Should I restandardize training data during retraining?

I am running a simple keras deep learning which i will train once and then retrain every month as new data becomes available.

My data is made up of monetary values so i will first standardize my data using StandardScaler() however once new data comes in and i want to retrain can i use the same StandardScaler object? Because lets assume that the new data has a maximum datapoint higher than my current maximum and will thus alter the entire dataset standardization.

Should i re-standardize or can i use the same standardization for the new data?

1 answer

  • answered 2021-06-09 14:39 Waleed Aldhahi

    According to what I understand from your question, when you use new training data, the input data will be different than what been used to compute standardization parameters.

    In such case, the new data inputs might fall outside the range of values of those you have standardized.

    But in order to have a good predictive model, training data and future data need to have close distributions, otherwise, your model won't work as expected.

    So I think it is best to re-standardize your training data. And be sure standardization are done to training separate from validation set, i.e. using the mean from the training set with the validation set, not the mean from the validation set:

    scaled_train =  (train - train_mean) / train_std_deviation
    
    scaled_test = (test - train_mean) / train_std_deviation