out-of-sample data classification
A question from an online course:
A difficulty that arises from trying to classify out-of-sample data is that the actual classification may not be known, therefore making it hard to produce an accurate result.
True or False?
See also questions close to this topic
Input dimension error on pytorch's forward check
I am creating an RNN with
pytorch, it looks like this:
class MyRNN(nn.Module): def __init__(self, batch_size, n_inputs, n_neurons, n_outputs): super(MyRNN, self).__init__() self.n_neurons = n_neurons self.batch_size = batch_size self.n_inputs = n_inputs self.n_outputs = n_outputs self.basic_rnn = nn.RNN(self.n_inputs, self.n_neurons) self.FC = nn.Linear(self.n_neurons, self.n_outputs) def init_hidden(self, ): # (num_layers, batch_size, n_neurons) return torch.zeros(1, self.batch_size, self.n_neurons) def forward(self, X): self.batch_size = X.size(0) self.hidden = self.init_hidden() lstm_out, self.hidden = self.basic_rnn(X, self.hidden) out = self.FC(self.hidden) return out.view(-1, self.n_outputs)
xlooks like this:
tensor([[-1.0173e-04, -1.5003e-04, -1.0218e-04, -7.4541e-05, -2.2869e-05, -7.7171e-02, -4.4630e-03, -5.0750e-05, -1.7911e-04, -2.8082e-04, -9.2992e-06, -1.5608e-05, -3.5471e-05, -4.9127e-05, -3.2883e-01], [-1.1193e-04, -1.6928e-04, -1.0218e-04, -7.4541e-05, -2.2869e-05, -7.7171e-02, -4.4630e-03, -5.0750e-05, -1.7911e-04, -2.8082e-04, -9.2992e-06, -1.5608e-05, -3.5471e-05, -4.9127e-05, -3.2883e-01], ... [-6.9490e-05, -8.9197e-05, -1.0218e-04, -7.4541e-05, -2.2869e-05, -7.7171e-02, -4.4630e-03, -5.0750e-05, -1.7911e-04, -2.8082e-04, -9.2992e-06, -1.5608e-05, -3.5471e-05, -4.9127e-05, -3.2883e-01]], dtype=torch.float64)
and is a batch of 64 vectors with size 15.
When trying to test this model by doing:
BATCH_SIZE = 64 N_INPUTS = 15 N_NEURONS = 150 N_OUTPUTS = 1 model = MyRNN(BATCH_SIZE, N_INPUTS, N_NEURONS, N_OUTPUTS) model(x)
I get the following error:
File "/home/tt/anaconda3/envs/venv/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 126, in check_forward_args expected_input_dim, input.dim())) RuntimeError: input must have 3 dimensions, got 2
How can I fix it?
Removing stopwords; can't get rid of certain stopwords no matter what?
I'm trying to clean up a train set of newsarticles (input newsarticles) My output here is the 10 most common words in the articles.
When using the nltk stopwords certain words still got through: ['the','would','said','one','also','like','could','he'] So I added them to stopwords myself. I tried both the append method and the extend as shown in the code below. But the desired stopwords (words to be emitted) "the", and "he" is not removed.
Anyone knows why? Or know what I might be doing wrong?
(And yes; Ive googled it ALOT) import numpy as np import matplotlib.pyplot as plt import pandas as pd import nltk from nltk.corpus import stopwords import string import re from IPython.display import display from sklearn.feature_extraction.text import CountVectorizer #importing dataset and making a copy as string data = pd.read_csv('train.csv', encoding="ISO-8859-1") data1 = data.copy() data1.text = data1.text.astype(str) to_drop = ['id', 'title', 'author',] data1.drop(to_drop, inplace=True, axis=1) #cleaning text for punctuation, whitespace, splitting, and set to lower data1['text'] = data1['text'].str.strip().str.lower().str.replace('[^\w\s] ', '').str.split() #removing stopwords stopwords = nltk.corpus.stopwords.words('english') custom_words = ['the','would','said','one','also','like','could','he'] stopwords.extend(custom_words) data1['text'] = data1['text'].apply(lambda x: [item for item in x if item not in stopwords ]) data1['text']= data1['text'].apply(lambda x: " ".join(x)) vectorizer = CountVectorizer(max_features=1500, analyzer='word') train_voc = vectorizer.fit_transform(data1['text']) sum_words = train_voc.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()] words_freq =sorted(words_freq, key = lambda x: x, reverse=True) print (words_freq[:10]) display(data1.head())
[('the', 31138), ('people', 28975), ('new', 28495), ('trump', 24752), ('president', 18701), ('he', 17254), ('us', 16969), ('clinton', 16039), ('first', 15520), ('two', 15491)] text label 0 house dem aidewe didnât even see comeyâs l... 1 1 ever get feeling life circles roundabout rathe... 0 2 truth might get fired october 292016 tension i... 1 3 videos 15 civilians killed single us airstrike... 1 4 print iranian woman sentenced six years prison... 1
Someone asked for an example. This is an example where you can see 2 outputs; one before removing stopwords, and one after removing them. The same goes for this data, only that its a much larger dataset, and the output is the most common words aswell. Example output:
without stopwords: ['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
With stopwords['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']
I need help for my final year project about sentiment classification
I need help for my final year project. i am working on a sentimental classifier to classify peoples reactions in python. the available research is only classifying a text as either negative positive or neutral. but my research is about classifying peoples reaction basing on the post. we get facebook posts and the corresponding reactions and then classify them. so the sentiment depends on both the reaction and the post.
i also have to extract features that can determine the sentiment of a reaction like age, sex ,education background etc. any help is appreciated. thank you.
How can I save the upsampled training data generated using trainControl(sampling="up"...) option in R's Caret library?
I am using the
Caretpackage in R to train different binary classifiers.
If I use the option
trainControl(sampling="up", ....), the training data is up-sampled prior to fitting the model.
Is there a way to save the up-sampled training data?
PS: In some cases, it seems that the up-sampled training data is saved by the classifier itself (e.g. for PLS, it's in
$finalModel$model), but I'd like to find a Caret-based solution if possible.
I need to classify by categories without mixing the data of different columns
I have the following dataset:
Year Company Product Sales 2017 X A 10 2017 Y A 20 2017 Z B 20 2017 X B 10 2018 X B 20 2018 Y B 30 2018 X A 10 2018 Z A 10
I want to obtain the following summary:
Year Product Sales 2017 A 30 B 30 2018 A 50 B 20
and also the following summary:
Year Company Sales 2017 X 20 Y 20 Z 20 2018 X 50 Y 10 Z 10
Is there any way to do it without using loops?
I know I could do something with the function aggregate, but I don't know how to proceed with it without mixing the data of company, product and year. For example, I get the total sales of product A and B, but it's mixing the sales of both years instead of giving A and B in 2017, and separated in 2018.
Do you have any suggestions?
Unify classifiers based on their predictions
I am working on classifying texts and images of scientific articles. From the texts I use title and abstract. So far I have achieved good results using an SVM for the texts and not that good using a CNN for the images. I still did a multimodal classification, which did not show any classification improvement.
What I would like to do now is to use the svm and cnn predictions to classify, something like a vote ensemble. However the VotingClassifier from sklearn does not accept mixed inputs. You would have some idea of how I could implement or some guide line.