data omission when reading folder by sklearn.datasets load_files
I'm trying classification and clustering by sklearn. To load data text files, I used codes like this.
categories = ['Business', 'Entertainment', 'Living', 'Metro', 'Shopping', 'Sports', 'Tech'] data = load_files(container_path= "C:/Users/정주영/Desktop/SNU 2-2/데관분/DMA_project3/CC/text_all", categories=categories, shuffle=True, encoding='utf-8', decode_error='replace')
In this folder, there are definitely more than 10 files. So I can't understand why this code reads only 5 datasets that has no relations!
See also questions close to this topic
Any easy way to transform 1-1 to Jan 1 in Pandas?
I have a column of str-like dates as follows: 1-1 1-2 1-3 ... 1-31 2-1 ... 12-31 Any easy way to transform it to Jan 1 Jan 2 ... Dec 31 I want to plot the transformed dates to matplotlib x-axis. I am a beginner on python and pandas. I looked for some methods such as strftime and to_datetime but didn't find out solutions.Thanks to anyone who can help!
How to define timezone in datetime strptime python
In python, I'm formatting dates like below
datetime.strptime(date_var + " " + time_var, '%Y-%m-%d %H:%M:%S')
I want to add timezone to this, cause it makes 3 hours less. How can i do this ? Thanks for answering
Can I get a list of skipped rows of a .csv file when using error_bad_lines=False?
I have a
.csvfile which I read :
e = pd.read_csv('file.csv', error_bad_lines=False, sep=',',engine='python')
And I know that file contains rows that I skip, during the read of the file Python prints
Skipping line 4: Expected 87 fields in line 4, saw 88 Skipping line 5: Expected 87 fields in line 5, saw 89 Skipping line 7: Expected 87 fields in line 7, saw 89 Skipping line 25: Expected 87 fields in line 25, saw 89
And it is getting trimmed since there are 2m rows, can I somehow retrieve the lines that are being skipped? Possibly placing them in to a separate
.csvfile with simply the row numbers or the actual rows of the skipped lines?
TfidfVectorizer results in 1x1 sparse matrix with just 1 element
I'm trying to apply text based multilabel classification to a subset of This dataset. When I try to transform my data the result is an 1x1 sparse matrix which I can't do anything with, because the length isn't the same as my labels. My data before split:
1 No7 Lift & Luminate Triple Action Serum 50... 2 No7 Stay Perfect Foundation Cool Vanilla by No7 3 Wella Koleston Perfect Hair Colour 44/44 Mediu... 4 Lacto Calamine Skin Balance Oil control 120 ml... 5 Mary Kay Satin Hands Hand Cream Travel MINI Si... ... ... 98671 Panasonic Shockwave Portable Compact Disc Play... 98672 Jensen SC-340 Home-Theater Universal Remote Co... 98673 Motorola TalkAbout T250 2-Mile 14-Channel Two-... 98674 Sharp MDMT821 Ultra-Thin Minidisc Player/Recorder 98675 KLH KHP201TW Digital Headphones 98675 rows × 1 columns
vectorizer = TfidfVectorizer(max_features = 10000) vectorizer.fit(data) Xtrain, Xtest, Ytrain, Ytest = train_test_split(data, labels, test_size=0.2, random_state=1) x_train = vectorizer.transform(Xtrain) x_test = vectorizer.transform(Xtest) x_train
<1x1 sparse matrix of type '<class 'numpy.float64'>' with 1 stored elements in Compressed Sparse Row format>
XGB model (or any other ML model) objective function vs scoring metrics
I was trying to set the random state for XGB using numpy RandomState generator for hyperparameter tuning such that each instance would give a different column subsampling and so on.
However, unlike normal sklearn regressors such as random forest, it seems that I cannot set the random_state parameter as such:
regr = XGBRegressor('random_state': np.random.RandomState(42)) regr.fit(x_train, y_train) pred_y_test = regr.predict(x_test)
The following error occurs:
xgboost.core.XGBoostError: Invalid Parameter format for seed expect int but value='RandomState(MT19937)'
Do I have to set it as an integer only? What if I want the seed number to change after every hyperparameter trial? Is there an alternative random seed generator that I can use or should I just leave that parameter as None?
MultiLabelBinarizer with duplicated values
I have an expected array
[1,1,3]and a predicted array
[1,2,2,4]for which I want to calculate
precision_recall_fscore_support, so I need a matrix in the following format:
>> mlb = MultiLabelBinarizerWithDuplicates() >> transformed = mlb.fit_transform([(1, 1, 3), (1, 2, 2, 4)]) array([[1,1,0,0,1,0], [1,0,1,1,0,1]]) >> mlb.classes_ [1,1,2,2,3,4]
For the duplicated values I don't care which one of them is turned on, meaning that this is also a valid result:
MultiLabelBinarizer clearly says "All entries should be unique (cannot contain duplicate classes)" so it doesn't support this usecase.