Python multiprocessing code running infinitely

I am trying to train 2 models concurrently using sklearn and python's built-in multiprocessing library.

def train_model(model, X, y):
    model.fit(X, y)
    return model

from multiprocessing import Process

p1 = Process(target = train_model, args = (dt, X_train, y_train))
p2 = Process(target = train_model, args = (lr, X_train, y_train))

p1.start()
p2.start()

p1.join()
p2.join()

However, upon running this piece of code it continues to run infinitely. Training the two models individually doesn't take longer than a few seconds.

If my approach is wrong, how do I train 2 models parallelly?

Edit: Python version is 3.8.0. I am running this code on Jupyter Notebook on Windows 10.

Edit 2: The problem seems to lie with Jupyter Notebook. The same code runs without any problem on Google Colab.

Edit 3: I am now trying to run this code using my terminal

dt = DecisionTreeClassifier(class_weight='balanced')
lr = LogisticRegression(class_weight='balanced')


def train_model(model, X, y):
    model.fit(X, y)
    return model


p1 = Process(target=train_model, args=(dt, X_train, y_train))
p2 = Process(target=train_model, args=((lr, X_train, y_train)))

if __name__ == '__main__':
    p1.start()
    p2.start()
    p1.join()
    p2.join()

    dt_pred = dt.predict(X_test)
    lr_pred = lr.predict(X_test)

    print("Classification report for Decision Tree:",classification_report(y_test,dt_pred))
    print("Classification report for Logistic Regression", classification_report(y_test, lr_pred))

and get the following error

Traceback (most recent call last):
  File "D:/Bennett/HPC/E19CSE058_Lab3/E19CSE058_Lab3_Pt2.py", line 33, in <module>
    dt_pred = dt.predict(X_test)
  File "E:\Anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 436, in predict
    check_is_fitted(self)
  File "E:\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "E:\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 1041, in check_is_fitted
    raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: This DecisionTreeClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

It seems the training done through multiprocessing isn't being reflected outside the processes. How do I counter this?

1 answer

  • answered 2022-02-22 04:40 Tim Roberts

    Aaron has the right answer. On Windows, each process starts running your script over from the beginning, which will launch two more processes, each of which launches two more processes, etc. Anything that must be run ONLY in the master process needs to be protected by the "__main__" test:

    from multiprocessing import Process
    
    def train_model(model, X, y):
        model.fit(X, y)
        return model
    
    def main():
        p1 = Process(target = train_model, args = (dt, X_train, y_train))
        p2 = Process(target = train_model, args = (lr, X_train, y_train))
    
        p1.start()
        p2.start()
    
        p1.join()
        p2.join()
    
    if __name__ == "__main__":
        main()
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum