can I delete columns in test dataset? machine learning
If I deleted two columns in training dataset, should I delete the same column in test dataset?
I know I can't apply anything to the test dataset but I deleted columns only in the training dataset, and there was an error that the number of features was not correct with test dataset...
1 answer
-
answered 2021-04-08 03:05
ubershmekel
Machine learning models don't know which column is which. Generally you want to keep the columns the same for whatever the model needs to analyze.
See also questions close to this topic
-
Sparse Matrix Creation : KeyError: 579 for text datasets
I am trying to use the make_sparse_matrix function to create a sparse matrix for my text dataset, and I face KeyError: 579. Does anyone has any leads on the root of the error.
def make_sparse_matrix(df, indexed_words, labels): """ Returns sparse matrix as dataframe. df: A dataframe with words in the columns with a document id as an index (X_train or X_test) indexed_words: index of words ordered by word id labels: category as a series (y_train or y_test) """ nr_rows = df.shape[0] nr_cols = df.shape[1] word_set = set(indexed_words) dict_list = [] for i in range(nr_rows): for j in range(nr_cols): word = df.iat[i, j] if word in word_set: doc_id = df.index[i] word_id = indexed_words.get_loc(word) category = labels.at[doc_id] item = {'LABEL': category, 'DOC_ID': doc_id, 'OCCURENCE': 1, 'WORD_ID': word_id} dict_list.append(item) return pd.DataFrame(dict_list) make_sparse_matrix( X_train, word_index, y_test )
X_train is a DF that contains one single word in each cell, word_index contains all the index of words and y_test stores all labels.
The Key Error I am facing is:
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~\New folder\envs\geo_env\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3079 try: -> 3080 return self._engine.get_loc(casted_key) 3081 except KeyError as err:
pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 579
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last) in
in make_sparse_matrix(df, indexed_words, labels) 20 doc_id = df.index[i] 21 word_id = indexed_words.get_loc(word) ---> 22 category = labels.at[doc_id] 23 24 item = {'LABEL': category, 'DOC_ID': doc_id,
~\New folder\envs\geo_env\lib\site-packages\pandas\core\indexing.py in getitem(self, key) 2154 return self.obj.loc[key] 2155 -> 2156 return super().getitem(key) 2157 2158 def setitem(self, key, value):
~\New folder\envs\geo_env\lib\site-packages\pandas\core\indexing.py in getitem(self, key) 2101 2102 key = self._convert_key(key) -> 2103 return self.obj._get_value(*key, takeable=self._takeable) 2104 2105 def setitem(self, key, value):
~\New folder\envs\geo_env\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable) 959 960 # Similar to Index.get_value, but we do not fall back to positional --> 961 loc = self.index.get_loc(label) 962 return self.index._get_values_for_loc(self, loc, label) 963
~\New folder\envs\geo_env\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3080 return self._engine.get_loc(casted_key) 3081 except KeyError as err: -> 3082 raise KeyError(key) from err 3083 3084 if tolerance is not None:
KeyError: 579
-
Finding part of string in list of strings
GCM = ([519,520,521,522,533],[534,525],[526,527,530,531], [4404]) slice = int(str(df["CGM"][row_count])[:3])
I am looking through a row in a csv file and taking out the number I want. i want the number that starts with the number I have in
GCM
. since they represent info I want in other columns. this has working fine with the slice function because all the number i wanted started with 3 digits. now that i need to look for any number that starts with4404
and later on going to probably need to look for57052
the slice function no longer work.is there a way I can, instead of slicing and comparing to list, can take 5 digit number and see if part of it is in list. preferably look for it starting 3 or more same digits. the real point of that part of code is finding out which list in
GCM
list the number is. it need to be able to have the number44042
and know that the part of it a care about is inGCM[3]
, but on the other side do not want it to say that32519
is inDCM[0]
since I only care about number that start with519
not ends with it.ps. I am norwegian and have been learning programming by myself. been some long nights. so something here can be lost in translation.
-
How to forecast a time series out-of-sample using an ARIMA model in Python?
I have seen similar questions at Stackoverflow. But, either the questions were different enough or if similar, they actually have not been answered. I gather it is something that modelers run into often, and have a challenge solving.
In my case I am using two variables, one Y and one X with 50 time series sequential observations. They are both random numbers representing % changes (they could be anything you want, their true value does not matter. This is just to set up an example of my coding problem). Here are my basic codes to build this ARIMAX(1,0,0) model.
import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf df = pd.read_excel('/Users/gaetanlion/Google Drive/Python/Arima/df.xlsx', sheet_name = 'final') from statsmodels.tsa.arima_model import ARIMA endo = df['y'] exo = df['x']
Next, I build the ARIMA model, using the first 41 observations
modelho = sm.tsa.arima.ARIMA(endo.loc[0:40], exo.loc[0:40], order =(1,0,0)).fit() print(modelho.summary())
So far everything works just fine.
Next, I attempt to forecast or predict the next 9 observations out-of-sample. Here I want to use the X values over these 9 observations to predict Y. And, I just can't do it. I am showing below just the one code, that I think gets me the closest to where I need to go.
modelho.predict(exo.loc[41:49], start = 41, end = 49, dynamic = False) TypeError: predict() got multiple values for argument 'start'
-
How to connect a Python ML model to a MERN webapp?
I want to display the charts and results of a LDA model I coded on Python using Gensim in a Webapp, I have tried both with Django and MERN (Express, Node.js) and I saw child_process.spawn() could be useful but I am not quite sure how to use that in my webapp.
-
What is the difference between 'transform' and 'fit_transform' when using preprocessing methods?
I noticed that for One-Hot-Encoding or Imputation we always use
fit_transform
for the training set andtransform
for the validation set. What is the difference between these two methods?Example 1
from sklearn.preprocessing import OneHotEncoder # Apply one-hot encoder to each column with categorical data OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols])) OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
Example 2
from sklearn.impute import SimpleImputer # Imputation my_imputer = SimpleImputer() imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train)) imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
-
Content Based Movie Recommendation system in R
I am trying to produce a content-based movie recommendation system in R. However I am not quite seeing the result I expect. I have been following the website:
https://muffynomster.wordpress.com/2015/06/07/building-a-movie-recommendation-engine-with-r/
When I run just the bare minimum of the program I get very skewed and duplicate results. I think it is because my binary filtering process is off.
Any help or advice?
movie_data <- read.csv("movies.csv",stringsAsFactors=FALSE) rating_data <- read.csv("ratings.csv", stringsAsFactors=FALSE) summary(movie_data) summary(rating_data) rating_data$timestamp <- NULL
Prepare Data
genres <- as.data.frame(movie_data$genres, stringsAsFactors=FALSE) genres2 <- as.data.frame(tstrsplit(genres[,1], '[|]', type.convert=TRUE), stringsAsFactors=FALSE) colnames(genres2) <- c(1:10) list_genres <- c("Action", "Adventure", "Animation", "Children", "Comedy", "Crime","Documentary", "Drama", "Fantasy","Film-Noir", "Horror", "Musical", "Mystery","Romance","Sci-Fi", "Thriller", "War", "Western") genre_matrix <- matrix(0,10330,18) genre_matrix[1,] <- list_genres colnames(genre_matrix) <- list_genres for (i in 1:nrow(genres2)) { for (c in 1:ncol(genres2)) { genmat_col = which(genre_matrix[1,] == genres2[i,c]) genre_matrix[i+1,genmat_col] <- 1 } } genre_matrix2 <- as.data.frame(genre_matrix[-1,], stringsAsFactors=FALSE) for (c in 1:ncol(genre_matrix2)) { genre_matrix2[,c] <- as.integer(genre_matrix2[,c]) } str(genre_matrix2)
turn the ratings into binary 4 and 5 ratings are mapped to 1, 3 and below are mapped to -1
ratingMatrix <- dcast(rating_data, userId~movieId, value.var = "rating", na.rm=FALSE) ratingMatrix <- as.matrix(ratingMatrix[,-1]) ratingMatrix <- as(ratingMatrix, "realRatingMatrix") ratingMatrix binaryratings <- rating_data for (i in 1:nrow(binaryratings)){ if (binaryratings[i,3] > 3){ binaryratings[i,3] <- 1 } else{ binaryratings[i,3] <- -1 } } # transform the data from a long format to a wide format binaryratings2 <- dcast(binaryratings, movieId~userId, value.var = "rating", na.rm=FALSE) for (i in 1:ncol(binaryratings2)){ binaryratings2[which(is.na(binaryratings2[,i]) == TRUE),i] <- 0 } binaryratings2 = binaryratings2[,-1]
#remove the movies that have never been rated from the genres matrix movieIds <- length(unique(movie_data$movieId)) ratingmovieIds <- length(unique(rating_data$movieId)) movie_data2 <- movie_data[-which((movieIds%in%ratingmovieIds) == FALSE),] row.names(movie_data)<- NULL genre_matrix3 <- genre_matrix2[-which((movieIds%in%ratingmovieIds) == FALSE),] row.names(genre_matrix3)<- NULL genre_matrix4 <- genre_matrix3[-which((movie_data2%in%ratingmovieIds) == FALSE),] row.names(genre_matrix4)<- NULL
#Calculate dot product result = matrix(0,18,668) for (c in 1:ncol(binaryratings2)){ for (i in 1:ncol(genre_matrix4)){ result[i,c] <- sum((genre_matrix4[i,]) * (binaryratings2[,c])) } } #Convert to Binary scale for (i in 1:nrow(result)){ if (result[i] < 0){ result[i] <- 1 } else { result[i] <- 0 } }
result2 <- result[,1] sim_mat <- rbind.data.frame(result2, genre_matrix4) sim_mat <- data.frame(lapply(sim_mat,function(x){as.integer(x)})) sim_results <- dist(sim_mat, method = "Jaccard") sim_results <- as.data.frame(as.matrix(sim_results[1:10329])) recs <- which(sim_results == min(sim_results)) Sim_results2<- tibble(movie_data[recs,]) View(Sim_results2) SearchMatrix <- cbind(movie_data[,1:2], genre_matrix2[]) #head(SearchMatrix) #DataFlair
-
Database or dataset with all regions, cities and districts within the cities in Russia
I need to find a data source to fill the database for my work project, but I struggle to find something that is full enough.
All datasets contain only information about regions and cities, but when it comes to the districts, there is nothing suitable.
I've spent about two weeks on this, but still can't find it. Is there anywhere I can look for the relevant data?
P.S. It would be much better if the data source was in Russian, but other languages will work too.
P.S.S. I've tried to collect the data from Google Maps API, but couldn't find the right way to fetch all required information.
-
Pass dataset to keras.evaluate directly
By studying numerous tutorials and also trying myself I've found that in the training phase, when building a keras model I can pass complete Keras Datasets (both features and labels) as one argument, for both training and validation. Very elegant.
model.fit( train_ds, validation_data=val_ds, /.../)
However when evaluating the model on a novel dataset of exactly the same type and shape this way of passing data does not seem to work. I tried this:
model.evaluate( test_ds)
Does not work. Seems like evaluate needs features and labels as separate arguments.
This forces me to make cumbersome transformations to numpy arrays for both features and labels, and increases the risk of making mistakes, as opposed to the case when fitting. Or I need to construct the test_ds without using Dataset which forces me to write two different sets of code to assemble train and test data. This seems very suboptimal.
Am I missing some important detail here or is this the way model.evaluate(...) always works? Seems strange if I can't use the same call signature for so similar methods as fit and evaluate.
Or maybe there's some more elegant way to extract features and labels from the test_ds dataset? Prior to passing them as arguments?
-
I couldn't download VALID dataset based on the instructions available in its site: https://www.epfl.ch/labs/mmspg/downloads/valid/
The error message of filezilla is: Status: Resolving address of tremplin.epfl.ch Status: Connecting to 128.178.218.41:21... Status: Connection established, waiting for welcome message... Status: Insecure server, it does not support FTP over TLS. Command: USER VALID_dataset@grebvm2.epfl.ch Error: Connection timed out after 20 seconds of inactivity Error: Could not connect to server Status: Waiting to retry... Status: Resolving address of tremplin.epfl.ch Status: Connecting to 128.178.218.41:21... Status: Connection established, waiting for welcome message... Response: 220- Derval FTP Proxy Server ready. Response: 220 Utilisation: voir http://tremplin.epfl.ch/proxyftp/ Command: AUTH TLS Response: 530 Please login with USER and PASS. Command: AUTH SSL Response: 530 Please login with USER and PASS. Status: Insecure server, it does not support FTP over TLS. Command: USER VALID_dataset@grebvm2.epfl.ch Error: Connection timed out after 20 seconds of inactivity Error: Could not connect to server
-
How to merge points at center in Houdini
In Blender, if you hit M key and select Merge At Center, all the selected points will be merged into 1 point.
I want to ask how should I perform the same action in Houdini?
I research a lot and discovered that some people recommend using the Fuse Node, but I still cannot figure out what setting in the Fuse Node should it take to merge all points at center.
-
Which software can handle the following?
I am working on a model that is a bit tricky to regress (at least for me), and I couldn't make it work in neither Stata nor R Studio. It is the following system of equations:
The set "OD" has 40875 elements and the set "I" has 16, and I have data variables unique to each "od" and "i". I have about 42 unique variables to each "od" so to make it work in Stata, I need 42*40875 columns (yikes), which doesn't allow me. Does anyone have any suggestion which software i can use? I can try to write the code for an OLS (or SUR) it but then I will have to code all the other tests which isn't nice :(
-
macro or model to select numerical values from an array
Can any help with this difficult problem? I am wanting to create a spreadsheet where I can enter 48 values in a list which represent the lengths of picture framing moldings required to make say 12 picture frames of different sizes and using the same molding profile (4 sides for one frame). The uncut moldings are 4 meters long. The spreadsheet needs to automatically generate arrays populated with the lengths to cut from each 4 meter length of profile in order to minimize off cut wastage. Each array representing 1 full 4 meter length. Basically the spreadsheet needs to optimize the mix of lengths to cut from each 4 meter long picture frame molding. for example the 12 frames to be made are as follows: 4 frames 80 x 120 cm, 2 frames 65 x 95 cm, 2 frames 70 x 110, 2 frames 30 x 50 cm, 1 frame 105 x 160, 1 frame 25 x 65. The spreadsheet also needs to indicate how many lengths of moldings will be required. Thanks in advance for any suggestions.
-
For KNN, I want to find the RMSE for training set and test set within one function
I want to compute the training RMSE and test RMSE using the 'brute' algorithm, for a KNN algorithm.
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsRegressor def getTrainTestRMSE(regr, X_train, X_test, y_train, y_test): # How can I compute both the training and test rmse, in the same fn? # I think it might look something like this knn = KNeighborsRegressor(algorithm='brute') knn.fit(X_train, y_train) # RMSE_train = np.sqrt(((predictions - actual)**2).mean()) # RMSE_test = np.sqrt(((predictions - actual)**2).mean()) # ...
-
Are APIs like Faker, mimesis or fauxfactory in python opensource and legal to use in enterprise projects?
Are APIs like Faker, mimesis or fauxfactory in python opensource and legal to use in enterprise projects? Any copyrights or content permissions has to be handled?