# optimization of season prediction of categorical purchase data

I am writing an algorithm with the aim of predicting the season of purchase (winter, spring, summer and fall) using the following example data:

``````df.head(4)

shop   category  subcategory     season
date
2013-09-04  abc    weddings  shoes           winter
2013-09-04  def    jewelry   watches         summer
2013-09-05  ghi    sports    sneakers        spring
2013-09-05  jkl    jewelry   necklaces       fall
``````

The predictor variables are `shop`, `category` and `subcategory`, and the target variable is `season`.

I have two questions: 1) best practices for preprocessing and 2) best classification models for this type of problem

1) preprocessing - below is my code, however I'm unsure if I need one hot encoding to be able to properly handle categorical variables:

``````le = LabelEncoder()
ss = StandardScaler()
X = pd.get_dummies(store_df.iloc[:, :-1], drop_first=True).values.astype('float')
y = le.fit_transform(store_df.iloc[:, -1].values).astype('float')
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=42)
xtrain = ss.fit_transform(xtrain)
xtest = ss.transform(xtest)
``````

The shape of the data looks correct as follows:

``````Training set: (67915, 1040), (67915,)
Testing set: (29107, 1040), (29107,)
``````

Would preprocessing benefit from one hot encoding? What are best practices here?

2) model selection - so far I have tried a couple of classifiers, both of which score around 66% (not ideal):

logistic regression:

``````lr = LogisticRegression(C=100000)
lr.fit(xtrain, ytrain)
lr_pred = lr.predict(xtest)
lr_acc = accuracy_score(ytest, lr_pred)
``````

random forest classifier:

``````rfc = RandomForestClassifier(n_estimators=100, max_features=3)
rfc.fit(xtrain, ytrain)
rfc_pred = rfc.predict(xtest)
rfc_acc = accuracy_score(ytest, rfc_pred)
``````

I would imagine a few classification methods would work given that my preprocessing is done efficiently. Any pointers are welcome.