Shop name classification
I have a list of merchant name and its corresponding Merchant Category Code (MCC). It seems that about 80 percent of MCCs are true. Total number of MCCs are about 300. Merchant name may contain one, two or three words. I need to predict MCC using merchant name. How can I do that?
do you know?
how many words do you know
See also questions close to this topic
R: How can I add titles based on grouping variable in word_associate?
I am using the word_associate package in R Markdown to create word clouds across a grouping variable with multiple categories. I would like the titles of each word cloud to be drawn from the character values of the grouping variable.
I have added trans_cloud(title=TRUE) to my code, but have not been able to resolve my problem. Here's my code, which runs but doesn't produce graphs with titles:
library(qdap) word_associate(df$text, match.string=c("cat"), grouping.var=c(df$id), text.unit="sentence", stopwords=c(Top200Words), wordcloud=TRUE, cloud.colors=c("#0000ff","#FF0000"), trans_cloud(title=TRUE))
I have also tried the following, which does not run:
library(qdap) word_associate(df$text, match.string=c("cat"), grouping.var=c(df$id), text.unit="sentence", stopwords=c(Top200Words), wordcloud=TRUE, cloud.colors=c("#0000ff","#FF0000"), title=TRUE)
Can anyone help me figure this out? I can't find any guidance in the documentation and there's hardly any examples of or discussions about word_associate on the web.
Here's an example data frame that reproduces the problem:
id text question1 I love cats even though I'm allergic to them. question1 I hate cats because I'm allergic to them. question1 Cats are funny, cute, and sweet. question1 Cats are mean and they scratch me. question2 I only pet cats when I have my epipen nearby. question2 I avoid petting cats at all cost. question2 I visit a cat cafe every week. They have 100 cats. question2 I tried to pet my friend's cat and it attacked me.
Note that if I run this in R (instead of Markdown), the figures automatically print the "question1_list1" and "question2_list1" in bright blue at the top of the figure file. This doesn't work for me because I need the titles to exclude "_list1" and be written in black. These automatically generated titles do not respond to changes in my trans_cloud specifications. Ex:
library(qdap) word_associate(df$text, match.string=c("cat"), grouping.var=c(df$id), text.unit="sentence", stopwords=c(Top200Words), wordcloud=TRUE, cloud.colors=c("#0000ff","#FF0000"), trans_cloud(title=TRUE, title.color="#000000"))
In addition, I'm locked in to using this package (as opposed to other options for creating word clouds in R) because I'm using it to create network plots, too.
kwic() function returns less rows than it should
I'm currently trying to perform a sentiment analysis on a
kwicobject, but I'm afraid that the
kwic()function does not return all rows it should return. I'm not quite sure what exactly the issue is which makes it hard to post a reproducible example, so I hope that a detailed explanation of what I'm trying to do will suffice.
I subsetted the original dataset containing speeches I want to analyze to a new data frame that only includes speeches mentioning certain keywords. I used the following code to create this subset:
ostalgie_cluster <- full_data %>% filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie', speechContent, ignore.case = TRUE))
The resulting data frame consists of 201 observations. When I perform
kwic()on the same initial dataset using the following code, however, it returns a data frame with only 82 observations. Does anyone know what might cause this? Again, I'm sorry I can't provide a reproducible example, but when I try to create a reprex from scratch it just.. works...
#create quanteda corpus object qtd_speeches_corp <- corpus(full_data, docid_field = "id", text_field = "speechContent") #tokenize speeches qtd_tokens <- tokens(qtd_speeches_corp, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, padding = FALSE) %>% tokens_remove(stopwords("de"), padding = FALSE) %>% tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ") ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie") test_kwic <- kwic(qtd_tokens, pattern = ostalgie_words, window = 5)
How to integrate an email classification model into Outlook?
Let's say I built an email classification model which classifies email into two classes A and B. Is it possible to link this model to Outlook and make this classification automatic; whenever I recieve a new email it gets sent to folder A or B in Outlook?
Which text classification model will suit a multi-class dataset with a large number of labels?
I have a dataset where there is a single sentence as input and a single label as output. There are over 40 different class labels with each label having a certain important keyword. For example:
This phone is most durable in the markethere durable is the keyword and it has a label X.
So far I have tried SVM but to no use, it fails in classifying well. What would be a good suggestion for a model to classify a multi-class dataset with a large number of labels
Naive Bayes TfidfVectorizer predicts everything to one class
I'm trying to run Multinomial Bayes classificator on various balanced data sets and comparing 2 different vectorizers: TfidfVectorizer and CountVectorizer. I have 3 classes: NEG, NEU and POS. I have 10000 documents. NEG class has 2474, NEU 5894 and POS 1632. Out of that I have made 3 differently balanced data sets like this:
text counts: NEU NEG POS Total number
NEU balance dataset 5894 2474 1632 10000
NEG balance dataset 2474 2474 1632 6580
POS balance dataset 1632 1632 1632 4896
The problem is when I try to classify. Everything is okay on every dataset except NEU. when i classify NEU balanced dataset with countvectorizer it runs okay. here is confusion matrix:
[[ 231 247 17]
[ 104 1004 71]
[ 24 211 91]]
But when i use TfidfVectorizer model predicts everything to NEU class.
[[ 1 494 0]
[ 0 1179 0]
[ 0 326 0]]
Here is some of my code:
sentences = svietimas_data['text'] y = svietimas_data['sentiment'] #vectorizer = CountVectorizer() vectorizer = TfidfVectorizer(lowercase=False) vectorizer.fit(sentences) sentences = vectorizer.transform(sentences) X_train, X_test, y_train, y_test = train_test_split(sentences, y, test_size=0.2, random_state=42, stratify= y) classifier = MultinomialNB() classifier.fit(X_train, y_train) y_pred = classifier.predict(X_test) print(classification_report(y_test, y_pred)) print(accuracy_score(y_test, y_pred)) print(confusion_matrix(y_test, y_pred))
I have an idea that reason of that is because that NEU balanced dataset is worst balanced. But question is why using countvectorizer model is predicting fine?