How do I apply RandomUnderSalmpling and OverSampling in StratifiedKfoldCrossValidation?

Currently undergoing a classification taks where I have to predict customer default using a dataset that is provided by LendingClub. For my fisrst model I decided to test the Logistic Regression using SGD.

I created this initial pipeline:

imputer = SimpleImputer(strategy = "median")

scaler = StandardScaler()

model = SGDClassifier(loss='log',random_state=42,n_jobs=-1,warm_start=True)

pipeline_sgdlogreg = make_pipeline(imputer, scaler, model)

Defined my strategy:

KF = StratifiedKFold(n_splits = 5)

And performed GridSearchCV:

grid_sgdlogreg = GridSearchCV(pipeline_sgdlogreg, param_grid_sgdlogreg,
                              scoring = 'roc_auc', pre_dispatch = 3,
                              n_jobs = -1, cv = KF, verbose = 5)

search = grid_sgdlogreg.fit(X_train, y_train)

Due to class imbalance the model is servery lacking in both recall and precision which does make sense.

I wanted to test out different sampling strategies. Consider this undersample approach. Made nem subsamples only ont the training data:

X_train_subsample, y_train_subsample = rus.fit_resample(X_train, y_train)

Pipeline that includes randomundersampler:

pipeline_sgdlogreg_rus = Pipeline([("Rus", RandomUnderSampler(sampling_strategy = "majority", random_state = 42)),
                               ('imputer', SimpleImputer(strategy = "median")),
                               ('scaler', StandardScaler()),
                               ('model'SGDClassifier(loss='log',random_state=42,n_jobs=-1,warm_start=True))])

Performed GridSearchCV again

grid_sgdlogreg = GridSearchCV(pipeline_sgdlogreg_rus, param_grid_sgdlogreg_rus,
                              scoring='roc_auc', pre_dispatch=3, n_jobs=-1, cv=KF, verbose=5)

search = grid_sgdlogreg.fit(X_train_subsample, y_train_subsample)

What I would like to know is if I am doing this correctly?

I have already dealt with outliers and label enconding before the split and I want to make sure that I am doing this for every fold.

Do I need to split the data again or by using the RandomUnderSampler () in the pipeline this command does that automatically for every fold?

Thank You!

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum