Inplace & memoryefficient fft(gpuArray) calls for real input?
For a real NxM gpuArray, Matlab's fft() returns a complex NxM gpuArray which doubles the allocated memory. This makes it impossible to operate on gpuArrays that occupy close to the maximum RAM on the GPU.
Is there a way to use inplace (padded) R2C (complex N/2+1 x M touput) cufft() call from Matlab without writing a custom CUDA kernel or mexcuda call?
See also questions close to this topic

How to properly store variables to use with parfor?
My calculations are based on a binary tree, which takes two previous instances of a block of variables (called Assemblies) and produces another one. A new assembly is generated based on two assemblies from the upper branch, so all variables must be stored.
To this end, I use cell arrays with the following syntax:
Assembly_ij = Tree{ithBranch}{jthAssembly}
, whereAssembly
is a18x3 double
Matrix. This approach is allowed by Matlab, however, it doesn't improve the execution of the code at all. I believe this is due to the inappropriate way in which I pass variables to workers. I get the following warning:The entire array or structure 'Tree' is a broadcast variable. This might result in unnecessary communication overhead.
Most work is done in this part of code, and it should convey the mistake I am making.
initialBranch = initialize(); Tree{1} = initialBranch; for i = 2 : Nbranches branch = cell(1, elmsInBranch(i)); parfor j = 1 : elmsInBranch(i) branch{j} = assembleBlocks(Tree{i1}{2*j1}, Tree{i1}{2*j}); end Tree{i} = branch; end
Matlab must surely pass the whole Tree structure to each worker, which is a lot of useless copying. I have no idea how to rewrite it to make it work properly, however, maybe there is some clever way to extract just the needed variables for each worker?

return number of rows inserted into table
I have a SQL command that inserts some rows into a SQL table using Matlab.
n = exec(db, sql);
I use the
exec
function which works and inserts the rows from one table to another. However, I would like to know how many rows it is inserting, is this possible in Matlab?In C# I can use the command below:
num_rows = cmd.ExecuteNonQuery();

How can I optimize saturation coefficients from known Inductance values
I am trying to optimize/solve some saturation coefficients based on measured inductance values. So I have the Inductances L(x,i) for certain values of x and i and would like to find the saturation coefficients s_k (which is an array) and s_x (which is a scalar). The optimization problem is shown below
I don't have any constraints for s_k and s_x. Is this absolutely necessary or can i find the values without constraints? If so, I will be very happy to be pointed in the right direction. Thanks. (Unfortunately, I have no prior knowledge about optimization in MATLAB.)

FFT in Python  wrong data?
on my Raspberry PI 4, I have a xlsx file which I scan with :
df = pd.read_excel('')
This is a dataset of a pressure sensor with 146651 rows x 1 columns. I want to perform a fft for this dataset, but if I plot the fft I get a curve exact the same to the time signal (moved on the xaxis???). So what is the problem?import numpy as np import matplotlib.pyplot as plt from scipy import fftpack import pandas as pd # Import csv file df = pd.read_excel('/home/pi/Downloads/test_11.12.19_500_neuHz.xlsx', skiprows=1) print(df) sig_fft = fftpack.fft(df) power = np.abs(sig_fft) print (power) sample_freq = fftpack.fftfreq(df.size, 0.02) plt.figure(figsize=(6,5)) plt.plot(sample_freq, power) plt.show()

how to make a high pass filter?
I have a 3D data matrix of sea level data (time, y, x) and I found the power spectrum by taking the square of the FFT but there are low frequencies that are really dominant. I want to get rid of those low frequencies by applying a high pass filter... how would I go about doing that? Example of data set and structure/code is below:
This is the data set and creating the arrays:
Yearmin = 2018 Yearmax = 2019 year_len = Yearmax  Yearmin + 1.0 # number of years direcInput = "filepath" a = s.Dataset(direcInput+"test.nc", mode='r') #creating arrays lat = a.variables["latitude"][:] lon = a.variables["longitude"][:] time1 = a.variables["time"][:] #DAYS SINCE JAN 1ST 1950 sla = a.variables["sla"][:,:,:] #t, y, x time = Yearmin + (year_len * (time1  np.min(time1)) / ( np.max(time1)  np.min(time1))) #detrending and normalizing data def standardize(y, detrend = True, normalize = True): if detrend == True: y = signal.detrend(y, axis=0) y = (y  np.mean(y, axis=0)) if normalize == True: y = y / np.std(y, axis=0) return y sla_standard = standardize(sla) print(sla_standard.shape) = (710, 81, 320) #fft fft = np.fft.rfft(sla_standard, axis=0) spec = np.square(abs(fft)) frequencies = (0, nyquist, df) #PLOTTING THE FREQUENCIES VS SPECTRUM FOR A FEW DIFFERENT SPATIAL LOCATIONS plt.plot(frequencies, spec[:, 68,85]) plt.plot(frequencies, spec[:, 23,235]) plt.plot(frequencies, spec[:, 39,178]) plt.plot(frequencies, spec[:, 30,149]) plt.xlim(0,.05) plt.show()
My goal is to make a high pass filter of the ORIGINAL time series (sla_standard) to remove the two really big peaks. Which type of filter should I use? Thank you!

How to interpret this fft graph
I want to apply Fourier transformation using
fft
function to my time series data to find "patterns" by extracting the dominant frequency components in the observed data, ie. the lowest 5 dominant frequencies to predict the y value (bacteria count) at the end of each time series. I would like to preserve the smallest 5 coefficients as features, and eliminate the rest.My code is as below:
df = pd.read_csv('/content/drive/My Drive/df.csv', sep=',') X = df.iloc[0:2,0:10000] dft_X = np.fft.fft(X) print(dft_X) print(len(dft_X)) plt.plot(dft_X) plt.grid(True) plt.show() # What is the graph about(freq/amplitude)? How much data did it use? for i in dft_X: m = i[np.argpartition(i,5)[:5]] n = i[np.argpartition(i,range(5))[:5]] print(m,'\n',n)
But I am not sure how to interpret this graph. To be precise,
1) Does the graph show the transformed values of the input data? I only used 2 rows of data(each row is a time series), thus data is
2x10000
, why are there so many lines in the graph?2) To obtain frequency value, should I use
np.fft.fftfreq(n, d=timestep)
?Parameters:
n : int Window length.d : scalar, optional Sample spacing (inverse of the sampling rate). Defaults to 1.
Returns:
f : ndarray Array of length n containing the sample frequencies.How to determine
n
(window length) andsample spacing
?3) Why are transformed values all complex numbers?
Thanks

MultiGPU batched 1D FFTs: only a single GPU seems to work
I have three Tesla V100s on RHEL 8 with CUDA toolkit version 10.2.89.
I'm attempting to compute a batch of 1D FFTs of the columns of a rowmajor matrix. In the example below, the matrix is 16x8, so with three GPUs I'd expect GPU 0 to perform the FFTs of the first 3 columns, GPU 1 to perform FFTs of the next 3, and GPU 2 to perform FFTs of the final 2.
The plan created in the example works as expected on a single GPU, but when running on three only the first three columns are computed (correctly), the remainder are untouched.
When I inspect the descriptor that is filled by cufftXtMalloc, I see that it has allocated space for 123 elements on GPUs 0 and 1, and 122 on GPU 2. This seems weird: I would expect 48=16*3 on GPUs 0 and 1 and 32=16*2 on GPU 2. Indeed this is the size of the workspaces filled by cufftMakePlanMany. When I inspect the data that was copied, elements 0122 are in the buffer on GPU 0, and elements 123127 are at the beginning of the buffer on GPU 1. The remainder of that buffer and the buffer on GPU 2 are junk.
In addition, when I increase the number of rows to 1024, I get a SIGABRT on the cufftXtFree call with the message 'free(): corrupted unsorted chunks'.
#include "cufft.h" #include "cufftXt.h" #include <vector> #include <cuComplex.h> #include <cassert> #define CUDA_CHECK(x) assert(x == cudaSuccess) #define CUFFT_CHECK(x) assert(x == CUFFT_SUCCESS) int main() { static const int numGPUs = 3; int gpus[numGPUs] = {0, 1, 2}; int nr = 16; int nc = 8; // Fill with junk data std::vector<cuFloatComplex> h_x(nr * nc); for (int i = 0; i < nr * nc; ++i) { h_x[i].x = static_cast<float>(i); } cufftHandle plan; CUFFT_CHECK(cufftCreate(&plan)); CUFFT_CHECK(cufftXtSetGPUs(plan, numGPUs, gpus)); std::vector<size_t> workSizes(numGPUs); int n[] = {nr}; CUFFT_CHECK(cufftMakePlanMany(plan, 1, // rank n, // n n, // inembed nc, // istride 1, // idist n, // onembed nc, // ostride 1, // odist CUFFT_C2C, nc, workSizes.data())); cudaLibXtDesc *d_x; CUFFT_CHECK(cufftXtMalloc(plan, &d_x, CUFFT_XT_FORMAT_INPLACE)); CUFFT_CHECK(cufftXtMemcpy(plan, d_x, (void *)h_x.data(), CUFFT_COPY_HOST_TO_DEVICE)); CUFFT_CHECK(cufftXtExecDescriptorC2C(plan, d_x, d_x, CUFFT_FORWARD)); std::vector<cuFloatComplex> h_out(nr * nc); CUFFT_CHECK(cufftXtMemcpy(plan, (void *)h_out.data(), d_x, CUFFT_COPY_DEVICE_TO_HOST)); CUFFT_CHECK(cufftXtFree(d_x)); CUFFT_CHECK(cufftDestroy(plan)); CUDA_CHECK(cudaDeviceReset()); return 0; }

cuFFT in column direction
I have a Complex matrix of nx * ny. I want to perform FFT in only column direction. One way is to transpose the entire matrix and then use cufftPlan1d to obtain FFT. Is there any other efficient way to obtain FFT without taking transpose of matrix. Will cufftPlanMany help to obtain fft in column direction. For e.g. let us assume nx = 8192 and ny = 32768. The Parameters for cufftPlanMany is as follows:
rank = 1; idist = 1 // distance b/w batches odist = 1 istride = nx ostride = nx int inembed[]={nx} int onembed[]={nx}
cufftPlanMany(&plan,rank,ny,&inembed,istride,idist,&onembed,ostride,odist,CUFFT_C2C,1)
Is this the right way to use cufftPlanMany?

L2 error norm of FFT solution not converging
Given a function u(x,y) I want to calculate the Laplacian of a function
I am doing this via cuFFT's forward and inverse transforms. Below is the code
#include <cufft.h> #include <stdio.h> #include <math.h> #define BSZ 8 #define gpuErrChk(ans) { checkCUDAError((ans), __FILE__, __LINE__); } inline void checkCUDAError(cudaError_t code, const char *file, int line, bool abort=true) { //cudaError_t err = cudaGetLastError(); if( cudaSuccess != code) { fprintf(stderr, "CUDA Error in %s(%d): %s.\n", file, line, cudaGetErrorString(code) ); exit(EXIT_FAILURE); } } __global__ void scale_drv(cufftComplex *ft, float *k, int N) { int i = threadIdx.x + blockIdx.x*blockDim.x; int j = threadIdx.y + blockIdx.y*blockDim.y; int idx = j*(N/2+1)+i; if (i<(N/2+1) && j<N) { float k2 = float(k[i]*k[i] + k[j]*k[j]); ft[idx].x *= k2; ft[idx].y *= k2; } } int main(int argc, char **argv) { // Default problem size int N = 64; // User can change problem size for (int i=0; i<argc; i++) { if (strcmp("N",argv[i]) == 0) { N = atoi(argv[i+1]); printf(" found N %d\n", N); } } // Allocate everything else float xmax = 1.0f, xmin = 0.0f, ymin = 0.0f; float h = (xmaxxmin)/((float)N), s = 0.1, s2 = s*s; float *x, *y, *f, *f_a, *u_a, *k; float r2; size_t memSize = N*N*sizeof(float); size_t vecSize = N*sizeof(float); x = (float *)malloc(memSize); y = (float *)malloc(memSize); f = (float *)malloc(memSize); f_a = (float *)malloc(memSize); u_a = (float *)malloc(memSize); k = (float *)malloc(vecSize); // Set coordinates, RHS vector, and analytical solution for (int j = 0; j < N; j++) { for (int i = 0; i < N; i++) { x[N*j+i] = xmin + i*h; y[N*j+i] = ymin + j*h; r2 = (x[N*j+i]  0.5)*(x[N*j+i]0.5) + (y[N*j+i]  0.5)*(y[N*j+i]0.5); f_a[N*j+i] = (r22*s2)/(s2*s2)*exp(r2/(2*s2)); u_a[N*j+i] = exp(r2/(2*s2)); //u_a[N*j+i] = sin(2*M_PI*i/float(N))*sin(2*M_PI*j/float(N)); //f[N*j+i] = 8*M_PI*M_PI*u_a[N*j+i]; } } // Fourier coeff? float freq = 1.f/N; for (int i = 0; i <= N/2; i++) k[i] = i*2*M_PI*freq; for (int i = N/2+1; i<N; i++) k[i] = (i  N)*2*M_PI*freq; // Allocate device information float *k_d, *u_d; gpuErrChk(cudaMalloc((void**)&k_d, vecSize)); gpuErrChk(cudaMalloc((void**)&u_d, memSize)); gpuErrChk(cudaMemcpy(u_d, u_a, memSize, cudaMemcpyHostToDevice)); gpuErrChk(cudaMemcpy(k_d, k, vecSize, cudaMemcpyHostToDevice)); // Initiate cuFFT cufftComplex *u_dk; gpuErrChk(cudaMalloc((void**)&u_dk, sizeof(cufftComplex)*N*(N/2+1))); // Create cuFFT plans cufftHandle planf, planb; size_t wrksize; int dims[2] = {N,N}; cufftCreate(&planf); cufftCreate(&planb); cufftMakePlanMany(planf,2,dims,NULL,1,0,NULL,1,0,CUFFT_R2C,1,&wrksize); cufftMakePlanMany(planb,2,dims,NULL,1,0,NULL,1,0,CUFFT_C2R,1,&wrksize); // Forward transform cufftExecR2C(planf, u_d, u_dk); // Kernel: scale derivative dim3 dimBlock (BSZ, BSZ); dim3 dimGrid((N/2)/BSZ+1,N/BSZ); scale_drv<<<dimGrid, dimBlock>>>(u_dk, k_d, N); cudaDeviceSynchronize(); // Inverse transform cufftExecC2R(planb, u_dk, u_d); // Transfer data back to host gpuErrChk(cudaMemcpy(f, u_d, memSize, cudaMemcpyDeviceToHost)); // Save expected vs compute solutions to file FILE *fp1,*fp2; fp1 = fopen("expected.dat","w"); fp2 = fopen("computed.dat","w"); float error = 0.0f; for (int i=0; i<N*N; i++) { fprintf(fp1,"%.3e\n",f_a[i]); fprintf(fp2,"%.3e\n",f[i]); // Calculate L2 error norm error += (f_a[i]  f[i])*(f_a[i]  f[i]); } fclose(fp1); fclose(fp2); // Print L2 error norm printf("error = %0.3e\n", sqrt(error)); // Clean up  list incomplete cufftDestroy(planf); cufftDestroy(planb); return 0; }
When i do an L2 error norm comparing the expected f(x,y) vs the computed f(x,y) for N = 4, 8, 16, 32, 64, 128, and 256 I get the following:
error = 2.190e+02 error = 8.118e+00 error = 2.887e02 error = 6.510e02 error = 1.715e01 error = 5.509e01 error = 2.874e+00
I expect the error norm to drop (as it did for N = 4 through 16) but it actually goes up when N = 32 and beyond. Why is this happening? Here's how I am computing the L2 error norm:
for (int i=0; i<N*N; i++) { ... // Calculate L2 error norm error += (f_a[i]  f[i])*(f_a[i]  f[i]); } ... // Print L2 error norm printf("error = %0.3e\n", sqrt(error));
I wonder if the increasing numbers result from pollution errors or actually mean that FFT is not getting any more accurate. Any input appreciated

How to construct a matrix from selected columns of a 3D array?
I have a 3D GPU array
A
with dimensionsK x M x N
and anint
vectorv
of lengthM
and want to construct 2D GPU arrays of the formX = [A(:,1,v(1)), A(:,2,v(2)),..., A(:,M,v(M))]
(depending onv
)in the most timeefficient way. Since all these are GPU arrays, I was wondering if there is a faster way to accomplish this than preallocating
X
and using the obviousfor
loop. My code needs to invoke several millions of these instances, so this becomes quite the bottleneck. Typical oders of magnitude would beK = 350 000, 2<=M<=15, N<=2000
, if that matters.EDIT: Here is a minimal working version of the original bottleneck code I am trying to improve. Conversion to the 3D array
A
has been commented out. Adjust the array size parameters as you see fit.% generate test data: K = 4000; M = 2; % N = 100 A_cell = cell(1,M); s = zeros(1,M,'uint16'); for m=1:M s(m) = m*50; % defines some widths for the matrices in the cells A_cell{m} = cast(randi([0 1],K,s(m),'gpuArray'),'logical'); end N = max(s,[],2); % % A_cell can be replaced by a 3D array A of dimensions K x M x N: % A = true(K,M,N,'gpuArray'); % for m=1:M % A(:,m,1:s(m)) = permute(A_cell{m},[1 3 2]); % end % bottleneck code starts here and has M = 2 nested loops: I_true = true(K,1,'gpuArray'); I_01 = false(K,1,'gpuArray'); I_02 = false(K,1,'gpuArray'); for j_1=1:s(1) for j_2=1:s(2) v = [j_1,j_2]; I_tmp = I_true; for m=1:M I_tmp = I_tmp & A_cell{m}(:,v(m)); end I_02 = I_02  I_tmp; end I_01 = I_01  I_02; end Out = gather(I_01); % A_cell can be replaced by 3D array A above

How to evaluate a very complicated function on GPU Using MATLAB
I have a very Complicated symbolic function in Matlab and I want to evaluate it on GPU.
As I searched, for evaluation, we need to use
gpuArray
but when I try to evaluate function on GPU matlab using this code :joint_pdf_function_handle = @(x1,x2)sqrt(1.99e+2).*1.0./pi.^2.*exp(2.4e+1./2.5e+1).*exp(sqrt(2.0).*erfcinv(integral(@(x)(1.0./sqrt(x.^2+9.0).*besselk(1,sqrt(x.^2+9.0).*(4.0./2.5e+1)).*7.757157130525889e1)./pi,Inf,x1).*2.0).*(sqrt(2.0).*erfcinv(integral(@(x)(1.0./sqrt(x.^2+9.0).*besselk(1,sqrt(x.^2+9.0).*(4.0./2.5e+1)).*7.757157130525889e1)./pi,Inf,x1).*2.0).*4.925125628140704e+1sqrt(2.0).*erfcinv(integral(@(x)(1.0./sqrt(x.^2+9.0).*besselk(1,sqrt(x.^2+9.0).*(4.0./2.5e+1)).*7.757157130525889e1)./pi,Inf,x2).*2.0).*4.974874371859296e+1).*(1.0./2.0)+(sqrt(2.0).*erfcinv(integral(@(x)(1.0./sqrt(x.^2+9.0).*besselk(1,sqrt(x.^2+9.0).*(4.0./2.5e+1)).*7.757157130525889e1)./pi,Inf,x2).*2.0).*(sqrt(2.0).*erfcinv(integral(@(x)(1.0./sqrt(x.^2+9.0).*besselk(1,sqrt(x.^2+9.0).*(4.0./2.5e+1)).*7.757157130525889e1)./pi,Inf,x1).*2.0).*4.974874371859296e+1sqrt(2.0).*erfcinv(integral(@(x)(1.0./sqrt(x.^2+9.0).*besselk(1,sqrt(x.^2+9.0).*(4.0./2.5e+1)).*7.757157130525889e1)./pi,Inf,x2).*2.0).*4.925125628140704e+1))./2.0).*1.0./sqrt(x1.^2+9.0).*1.0./sqrt(x2.^2+9.0).*besselk(1,sqrt(x1.^2+9.0).*(4.0./2.5e+1)).*besselk(1,sqrt(x2.^2+9.0).*(4.0./2.5e+1)).*1.157788944723618e1 x1_space = gpuArray.linspace(10,10,25) ; x2_space = gpuArray.linspace(10,10,25) ; [X Y] = meshgrid(x1_space,x2_space) ; F = arrayfun(joint_pdf_function_handle ,X, Y) ;
then I get following error :
The function 'symengine' cannot be executed because it has resolved to the Pcode file: C:\Program Files\Polyspace\R2019a\toolbox\symbolic\symbolic\symengine.p
the thing is if I run code on CPU like this, it works fine :
x1_space = linspace(10,10,25) ; x2_space = linspace(10,10,25) ; [X Y] = meshgrid(x1_space,x2_space) ; F = arrayfun(joint_pdf_function_handle ,X, Y) ;
I would appreciate any help you can provide.