Which text classification model will suit a multi-class dataset with a large number of labels?
I have a dataset where there is a single sentence as input and a single label as output. There are over 40 different class labels with each label having a certain important keyword. For example: This phone is most durable in the market
here durable is the keyword and it has a label X.
So far I have tried SVM but to no use, it fails in classifying well. What would be a good suggestion for a model to classify a multi-class dataset with a large number of labels
do you know?
how many words do you know
See also questions close to this topic
-
how do I dissable debian python path/recursion limit
so, as of late, I've been having path length limit and recursion limit issues, so I really need to know how to disable these.
I can't even install modules like discord.py!!!!
-
TypeError: 'float' object cannot be interpreted as an integer on linspace
TypeError Traceback (most recent call last) d:\website\SpeechProcessForMachineLearning-master\SpeechProcessForMachineLearning-master\speech_process.ipynb Cell 15' in <cell line: 1>() -->1 plot_freq(signal, sample_rate) d:\website\SpeechProcessForMachineLearning-master\SpeechProcessForMachineLearning-master\speech_process.ipynb Cell 10' in plot_freq(signal, sample_rate, fft_size) 2 def plot_freq(signal, sample_rate, fft_size=512): 3 xf = np.fft.rfft(signal, fft_size) / fft_size ----> 4 freq = np.linspace(0, sample_rate/2, fft_size/2 + 1) 5 xfp = 20 * np.log10(np.clip(np.abs(xf), 1e-20, 1e100)) 6 plt.figure(figsize=(20, 5)) File <__array_function__ internals>:5, in linspace(*args, **kwargs) File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\function_base.py:120, in linspace(start, stop, num, endpoint, retstep, dtype, axis) 23 @array_function_dispatch(_linspace_dispatcher) 24 def linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, 25 axis=0): 26 """ 27 Return evenly spaced numbers over a specified interval. 28 (...) 118 119 """ --> 120 num = operator.index(num) 121 if num < 0: 122 raise ValueError("Number of samples, %s, must be non-negative." % num) TypeError: 'float' object cannot be interpreted as an integer
What solution about this problem?
-
IndexError: list index out of range with api
all_currencies = currency_api('latest', 'currencies') # {'eur': 'Euro', 'usd': 'United States dollar', ...} all_currencies.pop('brl') qtd_moedas = len(all_currencies) texto = f'{qtd_moedas} Moedas encontradas\n\n' moedas_importantes = ['usd', 'eur', 'gbp', 'chf', 'jpy', 'rub', 'aud', 'cad', 'ars'] while len(moedas_importantes) != 0: for codigo, moeda in all_currencies.items(): if codigo == moedas_importantes[0]: cotacao, data = currency_api('latest', f'currencies/{codigo}/brl')['brl'], currency_api('latest', f'currencies/{codigo}/brl')['date'] texto += f'{moeda} ({codigo.upper()}) = R$ {cotacao} [{data}]\n' moedas_importantes.remove(codigo) if len(moedas_importantes) == 0: break # WITHOUT THIS LINE, GIVES ERROR
Why am I getting this error? the list actually runs out of elements, but the code only works with the if
-
Training an ML model on two different datasets before using test data?
So I have the task of using a CNN for facial recognition. So I am using it for the classification of faces to different classes of people, each individual person being a it's own separate class. The training data I am given is very limited - I only have one image for each class. I have 100 classes (so I have 100 images in total, one image of each person). The approach I am using is transfer learning of the GoogLenet architecture. However, instead of just training the googLenet on the images of the people I have been given, I want to first train the googLenet on a separate larger set of different face images, so that by the time I train it on the data I have been given, my model has already learnt the features it needs to be able to classify faces generally. Does this make sense/will this work? Using Matlab, as of now, I have changed the fully connected layer and the classification layer to train it on the Yale Face database, which consists of 15 classes. I achieved a 91% validation accuracy using this database. Now I want to retrain this saved model on my provided data (100 classes with one image each). What would I have to do to this now saved model to be able to train it on this new dataset without losing the features it has learned from training it on the yale database? Do I just change the last fully connected and classification layer again and retrain? Will this be pointless and mean I just lose all of the progress from before? i.e will it make new weights from scratch or will it use the previously learned weights to train even better to my new dataset? Or should I train the model with my training data and the yale database all at once? I have a separate set of test data provided for me which I do not have the labels for, and this is what is used to test the final model on and give me my score/grade. Please help me understand if what I'm saying is viable or if it's nonsense, I'm confused so I would appreciate being pointed in the right direction.
-
What's the best way to select variable in random forest model?
I am training RF models in R. What is the best way of selecting variables for my models (the datasets were pretty big, each has around 120 variables in total). I know that there is a cross-validation way of selecting variables for other classification algorithms such as KNN. Is that also a thing or if there exists a similar way for parameter tuning in RF model training?
-
How would I put my own dataset into this code?
I have been looking at a Tensorflow tutorial for unsupervised learning, and I'd like to put in my own dataset; the code currently uses the MNIST dataset. I know how to create my own datasets in Tensorflow, but I have trouble setting the code used here to my own. I am pretty new to Tensorflow, and the filepath to my dataset in my project is
\data\training
and\data\test-val\
# Python ≥3.5 is required import sys assert sys.version_info >= (3, 5) # Scikit-Learn ≥0.20 is required import sklearn assert sklearn.__version__ >= "0.20" # TensorFlow ≥2.0-preview is required import tensorflow as tf from tensorflow import keras assert tf.__version__ >= "2.0" # Common imports import numpy as np import os (X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data() X_train_full = X_train_full.astype(np.float32) / 255 X_test = X_test.astype(np.float32) / 255 X_train, X_valid = X_train_full[:-5000], X_train_full[-5000:] y_train, y_valid = y_train_full[:-5000], y_train_full[-5000:] def rounded_accuracy(y_true, y_pred): return keras.metrics.binary_accuracy(tf.round(y_true), tf.round(y_pred)) tf.random.set_seed(42) np.random.seed(42) conv_encoder = keras.models.Sequential([ keras.layers.Reshape([28, 28, 1], input_shape=[28, 28]), keras.layers.Conv2D(16, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(32, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(64, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2) ]) conv_decoder = keras.models.Sequential([ keras.layers.Conv2DTranspose(32, kernel_size=3, strides=2, padding="VALID", activation="selu", input_shape=[3, 3, 64]), keras.layers.Conv2DTranspose(16, kernel_size=3, strides=2, padding="SAME", activation="selu"), keras.layers.Conv2DTranspose(1, kernel_size=3, strides=2, padding="SAME", activation="sigmoid"), keras.layers.Reshape([28, 28]) ]) conv_ae = keras.models.Sequential([conv_encoder, conv_decoder]) conv_ae.compile(loss="binary_crossentropy", optimizer=keras.optimizers.SGD(lr=1.0), metrics=[rounded_accuracy]) history = conv_ae.fit(X_train, X_train, epochs=5, validation_data=[X_valid, X_valid]) conv_encoder.summary() conv_decoder.summary() conv_ae.save("\models")
Do note that I got this code from another StackOverflow answer.
-
Shop name classification
I have a list of merchant name and its corresponding Merchant Category Code (MCC). It seems that about 80 percent of MCCs are true. Total number of MCCs are about 300. Merchant name may contain one, two or three words. I need to predict MCC using merchant name. How can I do that?
-
R: How can I add titles based on grouping variable in word_associate?
I am using the word_associate package in R Markdown to create word clouds across a grouping variable with multiple categories. I would like the titles of each word cloud to be drawn from the character values of the grouping variable.
I have added trans_cloud(title=TRUE) to my code, but have not been able to resolve my problem. Here's my code, which runs but doesn't produce graphs with titles:
library(qdap) word_associate(df$text, match.string=c("cat"), grouping.var=c(df$id), text.unit="sentence", stopwords=c(Top200Words), wordcloud=TRUE, cloud.colors=c("#0000ff","#FF0000"), trans_cloud(title=TRUE))
I have also tried the following, which does not run:
library(qdap) word_associate(df$text, match.string=c("cat"), grouping.var=c(df$id), text.unit="sentence", stopwords=c(Top200Words), wordcloud=TRUE, cloud.colors=c("#0000ff","#FF0000"), title=TRUE)
Can anyone help me figure this out? I can't find any guidance in the documentation and there's hardly any examples of or discussions about word_associate on the web.
Here's an example data frame that reproduces the problem:
id text question1 I love cats even though I'm allergic to them. question1 I hate cats because I'm allergic to them. question1 Cats are funny, cute, and sweet. question1 Cats are mean and they scratch me. question2 I only pet cats when I have my epipen nearby. question2 I avoid petting cats at all cost. question2 I visit a cat cafe every week. They have 100 cats. question2 I tried to pet my friend's cat and it attacked me.
Note that if I run this in R (instead of Markdown), the figures automatically print the "question1_list1" and "question2_list1" in bright blue at the top of the figure file. This doesn't work for me because I need the titles to exclude "_list1" and be written in black. These automatically generated titles do not respond to changes in my trans_cloud specifications. Ex:
library(qdap) word_associate(df$text, match.string=c("cat"), grouping.var=c(df$id), text.unit="sentence", stopwords=c(Top200Words), wordcloud=TRUE, cloud.colors=c("#0000ff","#FF0000"), trans_cloud(title=TRUE, title.color="#000000"))
In addition, I'm locked in to using this package (as opposed to other options for creating word clouds in R) because I'm using it to create network plots, too.
-
kwic() function returns less rows than it should
I'm currently trying to perform a sentiment analysis on a
kwic
object, but I'm afraid that thekwic()
function does not return all rows it should return. I'm not quite sure what exactly the issue is which makes it hard to post a reproducible example, so I hope that a detailed explanation of what I'm trying to do will suffice.I subsetted the original dataset containing speeches I want to analyze to a new data frame that only includes speeches mentioning certain keywords. I used the following code to create this subset:
ostalgie_cluster <- full_data %>% filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie', speechContent, ignore.case = TRUE))
The resulting data frame consists of 201 observations. When I perform
kwic()
on the same initial dataset using the following code, however, it returns a data frame with only 82 observations. Does anyone know what might cause this? Again, I'm sorry I can't provide a reproducible example, but when I try to create a reprex from scratch it just.. works...#create quanteda corpus object qtd_speeches_corp <- corpus(full_data, docid_field = "id", text_field = "speechContent") #tokenize speeches qtd_tokens <- tokens(qtd_speeches_corp, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, padding = FALSE) %>% tokens_remove(stopwords("de"), padding = FALSE) %>% tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ") ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie") test_kwic <- kwic(qtd_tokens, pattern = ostalgie_words, window = 5)
-
How to integrate an email classification model into Outlook?
Let's say I built an email classification model which classifies email into two classes A and B. Is it possible to link this model to Outlook and make this classification automatic; whenever I recieve a new email it gets sent to folder A or B in Outlook?
-
Naive Bayes TfidfVectorizer predicts everything to one class
I'm trying to run Multinomial Bayes classificator on various balanced data sets and comparing 2 different vectorizers: TfidfVectorizer and CountVectorizer. I have 3 classes: NEG, NEU and POS. I have 10000 documents. NEG class has 2474, NEU 5894 and POS 1632. Out of that I have made 3 differently balanced data sets like this:
text counts: NEU NEG POS Total number
NEU balance dataset 5894 2474 1632 10000
NEG balance dataset 2474 2474 1632 6580
POS balance dataset 1632 1632 1632 4896
The problem is when I try to classify. Everything is okay on every dataset except NEU. when i classify NEU balanced dataset with countvectorizer it runs okay. here is confusion matrix:
[[ 231 247 17]
[ 104 1004 71]
[ 24 211 91]]
But when i use TfidfVectorizer model predicts everything to NEU class.
[[ 1 494 0]
[ 0 1179 0]
[ 0 326 0]]
Here is some of my code:
sentences = svietimas_data['text'] y = svietimas_data['sentiment'] #vectorizer = CountVectorizer() vectorizer = TfidfVectorizer(lowercase=False) vectorizer.fit(sentences) sentences = vectorizer.transform(sentences) X_train, X_test, y_train, y_test = train_test_split(sentences, y, test_size=0.2, random_state=42, stratify= y) classifier = MultinomialNB() classifier.fit(X_train, y_train) y_pred = classifier.predict(X_test) print(classification_report(y_test, y_pred)) print(accuracy_score(y_test, y_pred)) print(confusion_matrix(y_test, y_pred))
I have an idea that reason of that is because that NEU balanced dataset is worst balanced. But question is why using countvectorizer model is predicting fine?