How to release the GPU memory used by Numba cuda?

x_cpu, y_cpu, z_cpu are big numpy arrays with same length, Result is the Grid result that will reduce the x,y,z resolution and only keep one point in each grid, they cannot be put into GPU memory together. so I divided x,y,z into several parts but still put the whole Result into the GPU memory used

from numba import cuda
from math import ceil

SegmentSize = 1000000
Loops = ceil(len(x_cpu),SegmentSize)
Result = cuda.device_array((maxX-minX,maxY-minY))
for lopIdx in range(Loops):
    x = cuda.to_device(x_cpu[lopIdx*SegmentSize:(lopIdx+1)*SegmentSize])
    y = cuda.to_device(y_cpu[lopIdx*SegmentSize:(lopIdx+1)*SegmentSize])
    z = cuda.to_device(z_cpu[lopIdx*SegmentSize:(lopIdx+1)*SegmentSize])
    CudaProc[blocks, 1024](x,y,z, Result)
    cuda.synchronize()
Result_CPU = Result.copy_to_host()

But when I did so, Unknown Cuda Error raised. I noticed that the occupied GPU memory kept raising. I think it is because that in the loops, it keeps writing new x, y, z into the GPU memory without releasing the x,y,z before. I couldn't find much information about how to release the GPU memory. Can anyone help?

1 answer

  • answered 2020-07-30 09:46 talonmies

    You are pretty much at the mercy of standard Python object life semantics and Numba internals (which are terribly documented) when it comes to GPU memory management in Numba. The best solution is probably to manage everything as explicitly as possible, which means not performing GPU object creation in things like loops unless you understand it will be trivial to performance and resource consumption.

    I would suggest moving GPU array creation out of the loop:

    from numba import cuda
    from math import ceil
    
    SegmentSize = 1000000
    Loops = ceil(len(x_cpu),SegmentSize)
    Result = cuda.device_array((maxX-minX,maxY-minY)) #you explicitly should type these
    x = cuda.device_array(SegmentSize, dtype=dtype) #you explicitly should type these
    y = cuda.device_array(SegmentSize, dtype=dtype) #you explicitly should type these
    z = cuda.device_array(SegmentSize, dtype=dtype) #you explicitly should type these
    
    for lopIdx in range(Loops):
        x.copy_to_device(x_cpu[lopIdx*SegmentSize:(lopIdx+1)*SegmentSize])
        y.copy_to_device(y_cpu[lopIdx*SegmentSize:(lopIdx+1)*SegmentSize])
        z.copy_to_device(z_cpu[lopIdx*SegmentSize:(lopIdx+1)*SegmentSize])
        CudaProc[blocks, 1024](x,y,z, Result)
        cuda.synchronize()
    Result_CPU = Result.copy_to_host()
    

    [ Code written in browser, never tested, use at own risk ]

    That way you ensure that the memory is only allocated once and you reuse the same memory through all the loop trips. You still don't have explicit control of when the intermediate arrays will be destroyed, but this way it prevents running out of memory within the loop.