Numpy np.newaxis
saleprice_scaled = /
StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
Can anyone please explain what's happening with this line? Why is newaxis being used here? Although I know the use of newaxis but I can't figure out it's use in this particular situations.
Thanks In advance
1 answer

df_train['SalePrice']
is a Pandas.Series (vector / 1D array) of a shape: (N elements,)Modern (version: 0.17+) SKLearn methods don't like 1D arrays (vectors), they expect 2D arrays.
df_train['SalePrice'][:,np.newaxis]
transforms 1D array (shape: N elements) into 2D array (shape: N rows, 1 column).
Demo:
In [21]: df = pd.DataFrame(np.random.randint(10, size=(5, 3)), columns=list('abc')) In [22]: df Out[22]: a b c 0 4 3 8 1 7 5 6 2 1 3 9 3 7 5 7 4 7 0 6 In [23]: from sklearn.preprocessing import StandardScaler In [24]: df['a'].shape Out[24]: (5,) # < 1D array In [25]: df['a'][:, np.newaxis].shape Out[25]: (5, 1) # < 2D array
There is Pandas way to do the same:
In [26]: df[['a']].shape Out[26]: (5, 1) # < 2D array In [27]: StandardScaler().fit_transform(df[['a']]) Out[27]: array([[0.5 ], [ 0.75], [1.75], [ 0.75], [ 0.75]])
What happens if we will pass 1D array:
In [28]: StandardScaler().fit_transform(df['a']) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\utils\validation.py:429: DataConversionWarning: Data with input dtype int32 was converted t o float64 by StandardScaler. warnings.warn(msg, _DataConversionWarning) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\preprocessing\data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0 .17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(1, 1) if your data has a single feature or X.reshape(1, 1) if it contains a single sample. warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\preprocessing\data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0 .17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(1, 1) if your data has a single feature or X.reshape(1, 1) if it contains a single sample. warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning) Out[28]: array([0.5 , 0.75, 1.75, 0.75, 0.75])
See also questions close to this topic

How to groupby column headers using a regex?
I have a dataframe like this
S1,0 S1,0.1 S1,0.2 S1,1 S1,1.1 S1,1.2 S2,0 S2,0.1 S2,1 S2,1.1 0 4 0 3 3 3 1 3 2 4 0 1 0 4 2 1 0 1 1 0 1 4 2 3 0 3 0 2 3 0 1 3 3
Now I want to
groupby
its column headers wherebyS1,0
should be in one group,S1,1
in another one, and the same forS2
and apply certain operations on those groups.My expected outcome looks like this (in case I calculate the
mean
, calledm
and thestandard deviation
calleds
):S1,0 S1,1 S2,0 S2,1 m 0 2.333333 2.333333 2.500000 2.000000 1 2.000000 0.666667 0.500000 2.500000 2 2.000000 1.666667 0.500000 3.000000 s 0 2.081666 1.154701 0.707107 2.828427 1 2.000000 0.577350 0.707107 2.121320 2 1.732051 1.527525 0.707107 0.000000
I can get this output doing:
import pandas as pd import numpy as np np.random.seed(0) data = np.random.randint(0, 5, 30).reshape(3, 10) df = pd.DataFrame(data, columns=['S1,0', 'S1,0.1', 'S1,0.2', 'S1,1', 'S1,1.1', 'S1,1.2', 'S2,0', 'S2,0.1', 'S2,1', 'S2,1.1']) df = df.T gdf = df.groupby(lambda x: x.split('.', 1)[0])[df.columns].agg({'m': np.mean, 's': np.std}).T.sort_index()
My question is whether there is a way which avoids this
split
operation on the column names but where one can pass an actual regex? So something along the linesimport re reg = re.compile('^S\d,\d') gdf2 = df.groupby(reg)[df.columns].agg({'m': np.mean, 's': np.std}).T.sort_index()
This does not work but is anything comparable possible?

Python pandas rolling winsorize
I have a timeseries pandas dataframe and I have calculated a new column
df['std_series']= ( df['series1']df['series1'].rolling(252).mean() )/ df['series1'].rolling(252).std()
however I want to winsorize to the 5% level before I standardize and on a rolling basis. So for any datapoint, look back 252 days if it is outside the 5% quantiles clip it to the 5% quantile and then standardize. I couldn't figure out how to make it work with
rolling.apply
.For instance (rolling on 10 elements):
df = pd.DataFrame({'series1':[78, 1, 3, 4, 5, 6, 7, 8, 99]})
and assume I clip at (0.15
and0.85
). Then the clip levels:(min=3.2, max=64)
. Then winsorized window expected before standardization will be
[ 64 3.2 3.2 4 5 6 7 8 64]
All the examples I found were winsorize the either dataframe or entire column.

Restructure a pandas dataframe
I've the following pandas dataframe:
>>> df = pd.DataFrame([ [np.nan, 2, 'x', 0], [3, 4, 'y', 0], [9, 6, 'x', 1], [np.nan, np.nan, 'y', 1]], columns=['ignore', 'value', 'col', 'row']) >>> df ignore value col row 0 NaN 2.0 x 0 1 3.0 4.0 y 0 2 9.0 6.0 x 1 3 NaN NaN y 1
I want to be able to convert it to something like the following:
x y 0 2.0 4.0 1 6.0 NaN
Is it possible using pivot or multiindex or anything else? Or the only possible solution is looping through individual values?

Differ between integer and float in an array
I have an assignment where I'm handed an array with numbers, integers, floats and possibly strings. I then have to identify which of the elements are contained in another array with pure integers and which are not. Those which are not contained in the array with integers must be printed and the user must change the element to a value that is contained in the array with integers. Though I have the problem that if the element in the element in the given array is a float, the output from the user's input also becomes a float, (Unless the input is a value from the array with integers.) The same problem also occurs if the element in the given array is an integer and the user's input is a float. Then the float rounds down to an integer. Can anyone give any tips how I should change this code, so the script runs flawless?
grades = np.array([3,10.5,0,7,4,8,4,7,10,5]) SevenGradeScale = np.array([3, 0, 2, 4, 7, 10, 12]) SevenGradeScale = SevenGradeScale.astype(int) for i in range(np.size(grades)): if grades[i] not in SevenGradeScale: while True: if grades[i] in SevenGradeScale: grades = grades.astype(int) print("The grade has been changed to {:d}. ".format(grades[i])) break elif type(grades[i]) is np.float64: print("\n{:.1f} is not a valid grade. The grade must be an integer.".format(grades[i])) elif type(grades[i]) is np.int32: print("\n{:d} is not within the seven grade scale.".format(grades[i])) elif type(grades[i]) is str: type("\n{:s} is not a valid grade.".format(grades[i])) try: grades[i] = float(input("Insert new grade: ")) except ValueError: pass
You would probably comment the "float(input())" but this somehow helped my script. Though I don't know if there are other possibilities.
When running the code and typing random inputs, I get following results
 10.5 is not a valid grade. The grade must be an integer. Insert new grade: 10.7 10.7 is not a valid grade. The grade must be an integer. Insert new grade: 10 The grade has been changed to 10. 8 is not within the seven grade scale. Insert new grade: 7.5 The grade has been changed to 7. 5 is not within the seven grade scale. Insert new grade: 5.5 5 is not within the seven grade scale. Insert new grade: string 5 is not within the seven grade scale. Insert new grade: 0 The grade has been changed to 0.

Sum an array to a matrix diagonal (element wise sum)
I need to sum to a matrix diagonal an array (element wise sum):
import numpy as np mx = np.matrix([[0,0,0,0],[1,1,1,1],[2,3,4,5],[6,7,8,9]]) print mx print '\n' v = np.array([20,0,10,0.5]) print v print '\n'
This doesn't work:
mx = np.diagonal(mx,v) print mx
What can I do? Thank you all

Python: how to summarize values in one column based on the other column?
I have an array
z
with 2 columns:x
andy
. And I have a much smaller arrayX0
. What I have to do is sort[x,y]
array byx
column (I know how to do that). Then I want summarize ally
values forx
in rangeX0[i]0.1
toX0[i]+0.1
. And get arrayY0
with summarized yvalues. It must be of the same length asX0
.
Could you please help me to do that? My attempt so far:Y0=numpy.zeros(len(X0)) z=sorted(z, key=lambda z_entry: z_entry[1]) for i in range(len(X0)) for j in range(len(z[:,[1])) if x[j]>=X0[i]0.1 and x[j]< X0[i]+0.1 Y0[i]=Y0[i] + y[j]

How to run predict_generator on large dataset with limited memory?
Currently I am feeding all the images at once to predict_generator. I want to be able to feed small set of images which are being stored in the validation_generator and make predictions on them so that there are no memory issues with large datasets. How should I change the following code?
top_model_weights_path = '/home/rehan/ethnicity.071217.230.28.hdf5' path = "/home/rehan/countries/pakistan/guys/" img_width, img_height = 139, 139 confidence = 0.8 model = applications.InceptionResNetV2(include_top=False, weights='imagenet', input_shape=(img_width, img_height, 3)) print("base pretrained model loaded") validation_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(path, target_size=(img_width, img_height), batch_size=32,shuffle=False) print("validation_generator") features = model.predict_generator(validation_generator,steps=10)

Trainingtesting data set for a learning machine using R
I have been using a learning machine in order to forecast a variable from timeseries data. My question comes when I create both data sets with the following script:
>library(caret) >ind=createDataPartition(Data$variable, p=2/3, list = FALSE) >train<Data[ind,] >test<Data[ind,]
These data sets are randomly chosen from the whole data set, having 2/3 for training and 1/3 for testing.
Do you consider that this technique is it correct? As my point of view, the predicted data will have a high r^2 because it is a timeseries dataset (highly correlated). Do you consider that it would be more beneficial picking the last 1/3 of the data (ordinal technique).
Thanks,
Regards. 
Finding which dimensions were most significant in making a given Machine Learning classification
I'm using Machine Learning to classify documents. I'm hoping to find which words in the document were most significant for the algorithm when deciding which category the document belongs to. The ultimate reason is eventually I want to use that information the highlight the most relevant terms in the document text.
A little more concretely, assume we have training data with 5 dimensions consisting of the TFIDF values for certain words.
The classifier has 3 classes: Apple, Banana, Cherry.
If the classifier thinks the document
"I like the black cherry"
belongs to class Cherry, then is there any way to determine that the most significant dimension in that classification was, say,dimension=4
, which in turn corresponds to the word "Cherry"?I'm not bound to a particular kind of classifier or anything like that. I'm more wondering if this could be done with any classifier  whether (ideally) in practice or in principle. Perhaps something like a postfacto PCA.
Update:
Just to clarify a little. I have already trained a model and my classifier sorts documents into categories (e.g. A, B, C) successfully. What I'm hoping to do now is, when a new document is classified, to work backwards and find which of the dimensions in my data (in other words, which word in the text, given my dimensions correspond to words) were the most significant in making that classification. In my above example, I assume the word "cherry" will have been more significant in making the decision than the word "I", for example. Is there any way of getting a "score" for the significance of each dimension/word either during or after the classification process?

warm start for scikit learn regression
My dataset is in 8 digits and as expected, training any
scikitlearn
regression model givesoutofmemory error
. InMLPRegressor
, there is a warm start feature that can be set as true and then I can split the training data and keep training in batches. I can't find the same inKernelridge
orSVR
. I am an amateur at machine learning, so any way of solving this problem ? 
custom model moved to scikitlearn
In short: I wrote a custom model which worked well, but tried to reimplement it in scikitlearn, and that is working poorly. I am not sure if my code (below) is buggy, if I'm missing recommended preprocessing for a scikitlearn project, or if my choice of model is wrong.
The goal of the project is: Given the title of a blog post about a product, predict the actual product that is being written about. There are about 2,000 products overall.
First I built a custom model, using some "language model" principles that I adopted from a textbook.
I went through the labeled data and for each product, got a tally of all words used in all its titles (e.g. car:10,windshield:3,husband:2,tires:5}). Then to make new predictions for unseen titles, I tokenized the title and constructed a score for each product: score=(word1's percent frequency in the product's tally) * (word2's percent frequency in the product's tally) etc. I had a default low percent frequency for words that were missing for the product (rather than using 0 which would mess up the rankings).
Then I'd sort the products by score and return the top 5. I found that 87% of the time that I did this, the correct product was within those top 5.
So now I tried implementing it using Naive Bayes in scikitlearn. For now as an evaluation metric I'm using the default score() method which is harsher since the top (only) prediction of the model has to be correct. But I am getting 44% accuracy, which surprises me.
 Notably, I also get 44% when scoring on my training data; I think this should be much higher as the model has already seen this data.
 I also get low scores using Linear Regression: 88% when scoring on the alreadyseen training data, but 47% for unseen test data.
My code:
titles = [] products = [] with open('1pct_singlelabel.csv', 'r', encoding="utf8") as one_pct: reader = csv.reader(one_pct, delimiter=',', quotechar='"', lineterminator='\n') for i, row in enumerate(reader): if (i == 0): continue # skip header titles.append(row[2]) products.append(row[1]) text_train, text_test, y_train, y_test = train_test_split(titles, products, random_state=0) vect = CountVectorizer(min_df=0) vect.fit(titles) X_train = vect.transform(text_train) X_test = vect.transform(text_test) le = preprocessing.LabelEncoder() le.fit(products) y_train = le.transform(y_train) y_test = le.transform(y_test) clf = LogisticRegression() clf.fit(X_train, y_train) print("Logistic Regression: ") print(clf.score(X_train, y_train)) print(clf.score(X_test, y_test))
In sum, I don't understand why scikitlearn is so much worse than my custom code, which was not complex and was (I am told) similar to Naive Bayes. I am not sure if I am using scikitlearn correctly.

Python Sklearn linear regression not callable
I am implementing simple linear regression and multiple linear regression using pandas and sklearn
My code is as follows
import pandas as pd import numpy as np import scipy.stats from sklearn import linear_model from sklearn.metrics import r2_score df = pd.read_csv("Auto.csv", na_values='?').dropna() lr = linear_model.LinearRegression() y = df['mpg'] x = df['displacement'] X = x.values.reshape(1,1) sklearn_model = lr.fit(X,y)
This works fine, but for multiple linear regression, for some reason it doesn't work WITH the () at the end of sklearn's linear regression, when I use it with the brackets I get the following error:
TypeError: 'LinearRegression' object is not callable
My multiple linear regression code is as follows:
lr = linear_model.LinearRegression feature_1 = np.array(df[['displacement']]) feature_2 = np.array(df[['weight']]) feature_1 = feature_1.reshape(len(feature_1),1) feature_2 = feature_2.reshape(len(feature_2),1) X = np.hstack([feature_1,feature_2]) sklearn_mlr = lr(X,df['mpg'])
I want to know what I'm doing wrong. Additionally, I'm not able to print the various attributes in the linear regression method if I don't use the () at the end. e.g.
print(sklearn_mlr.coef_)
Gives me the error:
AttributeError: 'LinearRegression' object has no attribute 'coef_'

Sparksqlserver connection
Can we connect spark with sqlserver? If so, how? I am new to spark, I want to connect the server to spark and work directly from sqlserver instead of uploading .txt or .csv file. Please help, Thank you.

Can we map column using pandas merge with different different sizes without getting Nan?
I want to map the columns of two different data frame of different length using merge.But I am getting Nan values.I did outer join.

Where can we get historical weather data of India?
We have a final year project pitched about data analysis of weather data which will be historical and we are very much need of the data such as temperature, humidity, pressure, wind speed etc. Getting the data itself was the first task but there are plenty of sources such as https://www.wunderground.com/ etc. But this didn't help. After getting a data we are planning to use python or R as an language for developing a machine learning algorithm.