Inplace & memoryefficient fft(gpuArray) calls for real input?
For a real NxM gpuArray, Matlab's fft() returns a complex NxM gpuArray which doubles the allocated memory. This makes it impossible to operate on gpuArrays that occupy close to the maximum RAM on the GPU.
Is there a way to use inplace (padded) R2C (complex N/2+1 x M touput) cufft() call from Matlab without writing a custom CUDA kernel or mexcuda call?
See also questions close to this topic

Implement the MATLAB 'fitdist' in Python
I am working on a image processing tool, and I am having some trouble finding a good substitute for matlab's fitdist to a Python soltuion.
The matlab code works something like this:
pdR = fitdist(Red,'Kernel','Support','positive');
Any of you have found a good implementation for this in Python?

This is an error of matlab. I am not able to figure out what type of error is this?
I am a complete beginner in MATLAB. I was solving some machine learning problem of stanford machine learning course. When I ran the program an error occurred and when I again opened the script , something like this showed up. image of error occurred
I am not able to figure out this error.

smoothing curve plot in Matlab
When i run below code in matlab, i get plots in output graph that are not completely smooth, I have also attached snap of current graph. How can i smooth curves of my graph?
function A = glo ga=19; ba=0.81; Asat=1.4; ap=48000; bp=0.123; q1=3.8; q2=3.7; r=[0,0.2,0.4,0.6,0.8,1,1.2,1.4,1.6,1.8]; r1=[0 , 0.1475 , 0.2550 , 0.4308 , 0.8000 , 1.0000 , 0.2027, 0.1728, 0.4969 , 0.2742]; %r1=abs([0.7145 0.0293 0.0712 0.5145 0.2620 0.6894 0.2276 0.5719 0.0057 0.0078]); r2=abs(r1).*r; for i=1:length(r) A(i)=(ga*r(i))/(1+((ga*r(i))/Asat)^(2*ba))^(1/(2*ba)); end for i=1:length(r) A1(i)=(ga*r2(i))/(1+((ga*r2(i))/Asat)^(2*ba))^(1/(2*ba)); end plot (r,A,'k'); hold on plot (r,A1,'k'); hold on plot (r,r2,'k'); end

Sampling an arbitrary point within a DFT?
What I'm trying to do: I want to compress a 2D greyscale map (2D array of float values between 0 and 1) into a DFT. I then want to be able to sample the value of points in continuous coordinates (i.e. arbitrary points in between the data points in the original 2D map).
What I've tried: So far I've looked at Exocortex and some similar libraries, but they seem to be missing functions for sampling a single point or performing lossy compression. Though the math is a bit above my level, I might be able to derive methods do do these things. Ideally someone can point me to a C# library that already has this functionality. I'm also concerned that libraries that use the rowcolumn FFT algorithm don't produce sinusoid functions that can be easily sampled this way since they unwind the 2D array into a 1D array.
More detail on what I'm trying to do: The intended application for all this is an experiment in efficiently precomputing, storing, and querying line of sight information. This is similar to the the way spherical harmonic light probes are used to approximate lighting on dynamic objects. A grid of visibility probes store compressed visibility data using a small number of float values each. From this grid, an observer position can calculate an interpolated probe, then use that probe to sample the estimated visibility of nearby positions. The results don't have to be perfectly accurate, this is intended as first pass that can cheaply identify objects that are almost certainly visible or obscured, and then maybe perform more expensive raycasting on the few onthefence objects.

Javascript P5 get frequency data from AudioBuffer Channel
I'm trying to get the frequency data from an AudioBuffer channel in p5.js. The fft function included in p5.sound.js doesn't correspond to my needs because it is based on an Analyzer Node and can't be fed the sound file buffer.
I've tried to use browserify with the fftjs package but that was no good. In my
sketch.js
I have a line like this :let fft_js = require('fftjs').fft;
And then I browserified my sketch.js to a bundle.js like so :
browserify sketch.js o bundle.js
And added bundle.js to my index.html :
<script src="bundle.js"></script>
When refreshing my localhost browser I get a blank page where I used to have a functioning script before.
Ideally I am looking for a way to get something like this using p5.js as a basis :
let frequencies = fft.getFrequencies(channel);
Could anyone guide me to resources that would allow me to process sound files in such a way ?
Thank you very much !

How to denoise complex signals with fft and ifft
I have a complex IQ signal in a matrix of 128x32. the signal itself is actually continuous and can be unrolled to a single row 4096 vector of complex values (unrolling columns one at a time).
I wanted to remove some of the noise of the singal, so I created an image from the FFT of the data (see code below) and removed the noise from the image. Now I want to know how to get the original clean signal back without the noise.
I can't use ifft, because the noise was cleaned from the absolute value of the fft, so the clean image is not complex.
Let's say IQ is the variable holding the 128x32 complex signal. This is the matlab code:
img = abs(fft(IQ)); %this is also a 128x32 matrix %normalizing the data mu = mean(reshape(img, 1, [])); sig = std(reshape(img, 1, []), 1); img = (img  mu) ./ sig; %using some function on the image to clean it from noise img = remove_noise(img)
the function that removes noise from the image only replaces certain values with zeros. It doesn't change other frequencies.
Now, the variable img contains the absolute value of the fft of the signal, after it was cleaned. I can't just use ifft since the signal was complex and the image is a matrix containing real values.
How do I get the original signal back?
Thanks.

cuFFT static linking failed
I tried to link cuFFT statically.
nvcc ccbin g++ dc O3 arch=sm_35 c fftStat.cu fftStat.o; nvcc ccbin g++ dlink arch=sm_35 fftStat.o o link.o; g++ main.cc link.o fftStat.o lcudart lcudadevrt lcufft_static lculibos ldl pthread lrt L/usr/local/cuda10.2/lib64 o run
It gave me the following errors ( not showing all the errors)
/usr/local/cuda10.2/lib64/libcufft_static.a(fft_dimension_class_multi.o): In function `__sti____cudaRegisterAll()': fft_dimension_class_multi.compute_75.cudafe1.cpp:(.text+0xdad): undefined reference to `__cudaRegisterLinkedBinary_44_fft_dimension_class_multi_compute_75_cpp1_ii_466e44ab' /usr/local/cuda10.2/lib64/libcufft_static.a(fft_dimension_class_multi.o): In function `global constructors keyed to BaseListMulti::radices': fft_dimension_class_multi.compute_75.cudafe1.cpp:(.text+0x1c8d): undefined reference to float_64bit_regular_RT_SM50_plus.compute_75.cudafe1.cpp:(.text+0x3d): undefined reference to `__cudaRegisterLinkedBinary_51_float_64bit_regular_RT_SM50_plus_compute_75_cpp1_ii_66731515' /usr/local/cuda10.2/lib64/libcufft_static.a(float_64bit_regular_RT_SM50_plus.o): In function `global constructors keyed to compile_unitsforce_compile_float_width64_t_regular_fft_kernels__SM50_unbounded()': float_64bit_regular_RT_SM50_plus.compute_75.cudafe1.cpp:(.text+0x29d): undefined reference to `__cudaRegisterLinkedBinary_51_float_64bit_regular_RT_SM50_plus_compute_75_cpp1_ii_66731515' /usr/local/cuda10.2/lib64/libcufft_static.a(float_64bit_regular_RT_SM60_plus.o): In function `__sti____cudaRegisterAll()': float_64bit_regular_RT_SM60_plus.compute_75.cudafe1.cpp:(.text+0x3d): undefined reference to `__cudaRegisterLinkedBinary_51_float_64bit_regular_RT_SM60_plus_compute_75_cpp1_ii_dbb979db' /usr/local/cuda10.2/lib64/libcufft_static.a(float_64bit_regular_RT_SM60_plus.o): In function `global constructors keyed to compile_unitsforce_compile_float_width64_t_regular_fft_kernels__SM60_unbounded()': float_64bit_regular_RT_SM60_plus.compute_75.cudafe1.cpp:(.text+0x18d): undefined reference to `__cudaRegisterLinkedBinary_51_float_64bit_regular_RT_SM60_plus_compute_75_cpp1_ii_dbb979db' /usr/local/cuda10.2/lib64/libcufft_static.a(half_32bit_regular_RT_SM53_plus.o): In function `__sti____cudaRegisterAll()': half_32bit_regular_RT_SM53_plus.compute_75.cudafe1.cpp:(.text+0x3d): undefined reference to `__cudaRegisterLinkedBinary_50_half_32bit_regular_RT_SM53_plus_compute_75_cpp1_ii_96a57339' /usr/local/cuda10.2/lib64/libcufft_static.a(half_32bit_regular_RT_SM53_plus.o): In function `global constructors keyed to compile_unitsforce_compile_half_width32_t_regular_fft_kernels__SM53_unbounded()': half_32bit_regular_RT_SM53_plus.compute_75.cudafe1.cpp:(.text+0x1b0d): undefined reference to `__cudaRegisterLinkedBinary_50_half_32bit_regular_RT_SM53_plus_compute_75_cpp1_ii_96a57339' /usr/local/cuda10.2/lib64/libcufft_static.a(half_32bit_vector_RT_SM53_plus.o): In function `__sti____cudaRegisterAll()': half_32bit_vector_RT_SM53_plus.compute_75.cudafe1.cpp:(.text+0x3d): undefined reference to dpRadix0343C_cb.compute_75.cudafe1.cpp:(.text+0xa54): undefined reference to `__cudaRegisterLinkedBinary_34_dpRadix0343C_cb_compute_75_cpp1_ii_b592a056' collect2: error: ld returned 1 exit status
Dynamic linking works:
g++ main.cc link.o fftStat.o lcudart lcudadevrt lcufft L/usr/local/cuda10.2/lib64 o run
I followed this guide https://docs.nvidia.com/cuda/cudacompilerdrivernvcc/index.html#codechangesforseparatecompilation and this guide https://docs.nvidia.com/cuda/cufft/index.html#staticlibrary but apparently something is missing.

Planning and executing cufft in separate functions
Is there a way to define and initialize an fft plan in one function, then execute that fft plan in a separate function? Something like
cufftHandle make_plan(int params...){ cufftHandle plan; cufftPlanMany(&plan, ...); return plan; } void exec_fft(cufftHandle plan, cuDoubleComplex* idata, cuDoubleComplex* odata){ cufftExecZ2Z(plan, idata, odata, CUFFT_FORWARD); }
When I run the above code, cufftExecZ2Z fails, with the error CUFFT_INVALID_PLAN. The fft execution is successful if I move the cufftExecZ2Z into the same function as the plan:
void plan_and_exec_fft(int params ..., cuDoubleComplex* idata, cuDoubleComplex* odata){ cufftHandle plan; cufftPlanMany(&plan, ...); cufftExecZ2Z(plan, idata, odata, CUFFT_FORWARD); }
This question seems to be similar, but lacks details on implementation.

Using cufft to generate Spectrograms
I am looking to create a spectrogram from a wav file using CUDAs cuFFT library. From what I have researched, the general steps seem to be:
 read sound file data into appropriate array
 split the data into chunks
 apply window function to chunks
 perform FFT on each chunk (which is a STFT)
 return results in a contiguous array
 compute the squared magnitude of each element of the array
 plot the data using frequency vs time with amplitude as the color scale
I am trying to get step 2 through 4 to work, and am using this as a reference from FFTW library, and trying to convert that to cuFFT code. However it seems that this will require an unreasonable amount of memcpys to do the windowing, since we have to transfer the data to the gpu after each chunk is windowed over the last fft.
I also looked at CUDA: How do I use float audio data with cuFFT? but they were computing the FFT of the entire file and weren't doing any windowing, so I am not sure the approach is correct.How would I do a STFT to get a spectrogram using the cuFFT library?

I have cuMemAlloc error in my pycuda gpuarray
I am trying to implement odd even sort in pycuda on google colab notebook. However , this line gives me
randomArray_gpu = gpuarray.to_gpu(randomArray)
this error:
LogicError: cuMemAlloc failed: an illegal memory access was encountered
Here is my full code:
from pycuda.compiler import SourceModule from random import randint import pycuda.driver as cuda import pycuda.gpuarray as gpuarray import pycuda.autoinit import numpy as np import time import sys start_time = time.time() randomArray = np.array([33,23,45,65,34,87,4,67,98,43, 3,31,56,67,43,9,56,67], dtype=np.int32) randomArray_gpu = gpuarray.to_gpu(randomArray) mod = SourceModule(""" __global__ void sort(int *array, int *len, int *result) { int l; int temp; if(*len%2==0) l = *len/2; else l=(*len/2)+1; for(int i=0;i<l;i++) { if((!(threadIdx.x&1)) && (threadIdx.x<(*len1))) //even phase { if(array[threadIdx.x]>array[threadIdx.x+1]) temp =array[threadIdx.x]; array[threadIdx.x] = array[threadIdx.x+1]; array[threadIdx.x+1] = temp; } __syncthreads(); if((threadIdx.x&1) && (threadIdx.x<(*len1))) //odd phase { if(array[threadIdx.x]>array[threadIdx.x+1]) temp =array[threadIdx.x]; array[threadIdx.x] = array[threadIdx.x+1]; array[threadIdx.x+1] = temp; } __syncthreads(); } result = array; } """) func = mod.get_function("sort") sorted_array = gpuarray.zeros((10,), np.int32) sorted_array_gpu = gpuarray.to_gpu(sorted_array) func(randomArray_gpu, len(randomArray), sorted_array_gpu, block=(10,1,1)) print("Elapsed Time: " + str(time.time()start_time)) print(randomArray) print(sorted_array_gpu)
Can someone please help me figure out why i am having this error? Is there any error in my implementation? Thank you

How to construct a matrix from selected columns of a 3D array?
I have a 3D GPU array
A
with dimensionsK x M x N
and anint
vectorv
of lengthM
and want to construct 2D GPU arrays of the formX = [A(:,1,v(1)), A(:,2,v(2)),..., A(:,M,v(M))]
(depending onv
)in the most timeefficient way. Since all these are GPU arrays, I was wondering if there is a faster way to accomplish this than preallocating
X
and using the obviousfor
loop. My code needs to invoke several millions of these instances, so this becomes quite the bottleneck. Typical oders of magnitude would beK = 350 000, 2<=M<=15, N<=2000
, if that matters.EDIT: Here is a minimal working version of the original bottleneck code I am trying to improve. Conversion to the 3D array
A
has been commented out. Adjust the array size parameters as you see fit.% generate test data: K = 4000; M = 2; % N = 100 A_cell = cell(1,M); s = zeros(1,M,'uint16'); for m=1:M s(m) = m*50; % defines some widths for the matrices in the cells A_cell{m} = cast(randi([0 1],K,s(m),'gpuArray'),'logical'); end N = max(s,[],2); % % A_cell can be replaced by a 3D array A of dimensions K x M x N: % A = true(K,M,N,'gpuArray'); % for m=1:M % A(:,m,1:s(m)) = permute(A_cell{m},[1 3 2]); % end % bottleneck code starts here and has M = 2 nested loops: I_true = true(K,1,'gpuArray'); I_01 = false(K,1,'gpuArray'); I_02 = false(K,1,'gpuArray'); for j_1=1:s(1) for j_2=1:s(2) v = [j_1,j_2]; I_tmp = I_true; for m=1:M I_tmp = I_tmp & A_cell{m}(:,v(m)); end I_02 = I_02  I_tmp; end I_01 = I_01  I_02; end Out = gather(I_01); % A_cell can be replaced by 3D array A above