Grid Search CV fit cannot use
Full code here: https://github.com/JYjunyang/FYPDEMO
I'm trying to do sentiment analysis on a dataset and apply to grid search cv before apply it to a SVM model.
However, the data cannot be run when applied on the gridsearchcv.fit()... It gives a
ValueError: setting an array element with a sequence
and
TypeError: float() argument must be a string or a real number, not 'list'
I tried to print out and check the data I used, but it seems no problem for me.
Dataset target:
0 0.350000
1 0.000000
2 0.416667
3 0.357143
4 0.700000
...
995 0.500000
996 0.000000
997 -0.352857
998 0.000000
999 0.000000
Name: Sentiment, Length: 1000, dtype: float64
Dataset samples & features:
id full_text retweet_count Sentiment
0 1.240000e+18 [ebs_the_great, lot, would, actually, benefit,... 631 0.350000
1 1.240000e+18 [damii_aros, mayorkun, somewhere, studio, sing... 2568 0.000000
2 1.240000e+18 [iam_erhnehst, everything, fine, world, liverp... 1004 0.416667
3 1.240000e+18 [chandlerriggs, here, deleted, scene, twds, sp... 12631 0.357143
4 1.240000e+18 [realsaavedra, good, came, china] 159 0.700000
.. ... ... ...
...
995 1.240000e+18 [candicebenbow, generation, z, want, name, fol... 103642 0.500000
996 1.240000e+18 [biancaixvi, corona, day, feel, like, sunday, ... 89753 0.000000
997 1.240000e+18 [ioproducer, nasty, flu, went, round, december... 48755 -0.352857
998 1.240000e+18 [projectplase, everyone, need, food, corona, c... 6 0.000000
999 1.240000e+18 [goncalvsrafa, corona, coming, dont, wash, han... 132 0.000000
This is the code I used to run before the grid search cv:
def TriggerCleaning():
#This is the place I clean the text
dfOri['full_text'] = dfOri['full_text'].apply(Cleaning)
dfOri['full_text'] = dfOri['full_text'].apply(nltk.word_tokenize)
dfOri['full_text'] = dfOri['full_text'].apply(lambda x: [item for item in x if item not in stopwords])
dfOri['full_text'] = dfOri['full_text'].apply(lemmatization)
tableAfterToken = Table(datasetsResults, dataframe=dfOri, width=300, editable=True)
TokenizedLabel.grid(column=0,row=6)
tableAfterToken.show()
buttonFE.grid(column=0,row=8,pady=5)
print(dfOri)
print(dfOri['Sentiment'])
#splitting data start here
param_grid = {'gamma':[0.001, 0.01, 0.1, 1, 10, 100],'C':[0.001, 0.01, 0.1, 1, 10, 100]}
grid_search = GridSearchCV(SVC(),param_grid,cv=5,error_score='raise')
X_train, X_test, y_train, y_test = train_test_split(dfOri,dfOri['Sentiment'],random_state=10)
grid_search.fit(X_train,y_train)
print("Best score on validation set: {:.2f}".format(grid_search.score(X_test,y_test)))
print("Best parameters: {}".format(grid_search.best_params_))
print("Best score on test set: {:.2f}".format(grid_search.best_score_))
def Cleaning(text):
text = text.replace('RT', '')
text = text.lower()
text = re.sub(r'[^\w\s]','',text)
text = re.sub('W*dw','',text)
text = re.sub(r'[0-9]+', '', text)
return text
def lemmatization(text):
return [wn.lemmatize(word) for word in text]
I even tried to search on the internet for the problem, they said it is because of the different dimensions in the array. But, if this is the problem, when I tokenized the sentence, I think sure will cause different dimensions in the array, and from other tutorials on the Internet, it seems this method has no problem for them at all...Wonder what is going on and how to solve this problem.
do you know?
how many words do you know