In-place & memory-efficient fft(gpuArray) calls for real input?

For a real NxM gpuArray, Matlab's fft() returns a complex NxM gpuArray which doubles the allocated memory. This makes it impossible to operate on gpuArrays that occupy close to the maximum RAM on the GPU.

Is there a way to use in-place (padded) R2C (complex N/2+1 x M touput) cufft() call from Matlab without writing a custom CUDA kernel or mexcuda call?