How do I split a dataset into training and testing whilst retaining the proportions of binary data (i.e some drugs work some don't)?

I have a dataset of drugs, associated chemical features and whether they are "responsive" or "Unresponsive". I need to ensure that once I split the dataset into test and train they both have the same proportion of responsive:unresponsive. I know how to randomly split the data where training is 80% and test is 20%. Not sure how to do the stratified sampling necessary here, is this what I'm meant to use - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html?

1 answer

  • answered 2022-05-04 13:48 Alex Serra Marrugat

    The train_test_split function already has one parameters that allows you keeping the proportion of y. The parameter is stratify; and is defined in the documentation as "If not None, data is split in a stratified fashion, using this as the class labels".

    An example of code would be:

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum