How to use onnxruntime with flask

Created a server that can run a session with multi-threads using Flask.

When run 3 threads that the GPU's memory less than 8G, the program can run. But when run 4 threads that the GPU's memory will be greater than 8G, the program have error: onnxruntime::CudaCall CUBLAS failure 3: CUBLAS_STATUS_ALLOC_FAILED.

I know that the problem is leaky of GPU's memory. But I hope that the program don't run crash. So I try to limit the number of threads, and set intra_op_num_threads = 2 or inter_op_num_threads = 2 or os.environ["OMP_NUM_THREADS"] = "2", but don't work.
Try to 'gpu_mem_limit', don't work either

import onnxruntime as rt
from flask import Flask, request
app = Flask(__name__)

sess = rt.InferenceSession(model_XXX, providers=['CUDAExecutionProvider'])

@app.route('/algorithm', methods=['POST'])
def parser():
    prediction = sess.run(...)

if __name__ == '__main__':
    app.run(host='127.0.0.1', port='12345', threaded=True)

My understanding is that the Flask HTTP server maybe use different sess for each call.
How can make each call use the same session of onnxruntime?

System information

  • OS Platform and Distribution: Windows10
  • ONNX Runtime version: 1.8
  • Python version: python 3.7
  • GPU model and memory: RTX3070 - 8G
How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum