More efficent way of computing multiple fft with CuFFT than batching

According to NVIDIA documentation, a batched CuFFT will execute the batches in parallel:

batch denotes the number of transforms that will be executed in parallel (

I want to perform a 2D FFt with 500 batches and I noticed that the computing time of those FFTs depends almost linearly on the number of batches. Therefore I wondered if the batches were really computed in parallel. One FFT of 1500 by 1500 pixels and 500 batches runs in approximately 200ms.

In the case with a big number of FFT to be run concurrently, is using batches the best approach to reduce the computing time or shall I maybe consider streaming or whatever other method?

I could not find more detailed information about the internal execution of the batches on NVIDIA documentation yet.