Python : GridSearchCV taking too long to finish running

I'm attempting to do a grid search to optimize my model but it's taking far too long to execute. My total dataset is only about 15,000 observations with about 30-40 variables. I was successfully able to run a random forest through the gridsearch which took about an hour and a half but now that I've switched to SVC it's already ran for over 9 hours and it's still not complete. Below is a sample of my code for the cross validation:

from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.svm import SVC

SVM_Classifier= SVC(random_state=7)



param_grid = {'C': [0.1, 1, 10, 100],
              'gamma': [1,0.1,0.01,0.001],
              'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
              'degree' : [0, 1, 2, 3, 4, 5, 6]}

grid_obj = GridSearchCV(SVM_Classifier,
                        
                        return_train_score=True,
                        param_grid=param_grid,
                        scoring='roc_auc',
                        cv=3,
                       n_jobs = -1)

grid_fit = grid_obj.fit(X_train, y_train)
SVMC_opt = grid_fit.best_estimator_

print('='*20)
print("best params: " + str(grid_obj.best_estimator_))
print("best params: " + str(grid_obj.best_params_))
print('best score:', grid_obj.best_score_)
print('='*20)

I have already reduced the cross validation from 10 to 3, and I'm using n_jobs=-1 so I'm engaging all of my cores. Is there anything else I'm missing that I can do here to speed up the process?

1 answer

  • answered 2022-05-03 15:08 user2246849

    Unfortunately, SVC's fit algorithm is O(n^2) at best, so it indeed is extremely slow. Even the documentation suggests to use LinearSVC above ~10k samples and you are right in that ballpark.

    Maybe try to increase the kernel cache_size. I would suggest timing a single SVC fit with different cache sizes to see whether you can gain something.

    EDIT: by the way, you are needlessly computing a lot of SVC fits with different degree parameter values, where that will be ignored (all the kernels but poly). I suggest splitting the runs for poly and the other kernels, you will save a lot of time.

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum