Numpy np.newaxis
saleprice_scaled = /
StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
Can anyone please explain what's happening with this line? Why is newaxis being used here? Although I know the use of newaxis but I can't figure out it's use in this particular situations.
Thanks In advance
1 answer

df_train['SalePrice']
is a Pandas.Series (vector / 1D array) of a shape: (N elements,)Modern (version: 0.17+) SKLearn methods don't like 1D arrays (vectors), they expect 2D arrays.
df_train['SalePrice'][:,np.newaxis]
transforms 1D array (shape: N elements) into 2D array (shape: N rows, 1 column).
Demo:
In [21]: df = pd.DataFrame(np.random.randint(10, size=(5, 3)), columns=list('abc')) In [22]: df Out[22]: a b c 0 4 3 8 1 7 5 6 2 1 3 9 3 7 5 7 4 7 0 6 In [23]: from sklearn.preprocessing import StandardScaler In [24]: df['a'].shape Out[24]: (5,) # < 1D array In [25]: df['a'][:, np.newaxis].shape Out[25]: (5, 1) # < 2D array
There is Pandas way to do the same:
In [26]: df[['a']].shape Out[26]: (5, 1) # < 2D array In [27]: StandardScaler().fit_transform(df[['a']]) Out[27]: array([[0.5 ], [ 0.75], [1.75], [ 0.75], [ 0.75]])
What happens if we will pass 1D array:
In [28]: StandardScaler().fit_transform(df['a']) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\utils\validation.py:429: DataConversionWarning: Data with input dtype int32 was converted t o float64 by StandardScaler. warnings.warn(msg, _DataConversionWarning) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\preprocessing\data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0 .17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(1, 1) if your data has a single feature or X.reshape(1, 1) if it contains a single sample. warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\preprocessing\data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0 .17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(1, 1) if your data has a single feature or X.reshape(1, 1) if it contains a single sample. warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning) Out[28]: array([0.5 , 0.75, 1.75, 0.75, 0.75])
See also questions close to this topic

how to check if a value exists in a dataframe
hi I am trying to get the column name of a dataframe which contains a specific word,
eg: i have a dataframe,
NA good employee Not available best employer not required well manager not eligible super reportee my_word=["well"]
how to check if "well" exists in a df and the column name which has "well"
thanks in Advance!

Outlier Analysis Python: Is there a better/more efficient way?
I am trying to do my outlier analysis in Python. Since I have multiple dataframes with varying length, I want to deduct 2.5% of both the tail and head when the dataframe has 10 observations, 0.25% when it has 100 etc. Currently, I have some code that seems to work. However, I still have the feeling it could be a little bit more efficient. This is mainly because of the last 2 lines. I feel like the filter could be done in one line. Also, I am unsure if the .loc is of good use here. Perhaps there is a better way to do this? Does anyone have suggestions?
This is my first question, so please let me know if there is anything I can improve with my question =)
Currently, this is my code:
df_filtered_3['variable'] = df_filtered_3['variable1'] / df_filtered_3['variable2'] if len(df_filtered_3.index) <= 10: low = .025 high = .0975 elif len(df_filtered_3.index) <= 100: low = .0025 high = .00975 elif len(df_filtered_3.index) <= 1000: low = .00025 high = .000975 elif len(df_filtered_3.index) <= 10000: low = .000025 high = .0000975 else: low = .0000025 high = .00000975 quant_df = df_filtered_3.quantile([low, high]) df_filtered_3 = df_filtered_3.loc[df_filtered_3['variable'] > int(quant_df.loc[low, 'variable']), :] df_filtered_3 = df_filtered_3.loc[df_filtered_3['variable'] < int(quant_df.loc[high, 'variable']), :]

why this panda data frame doesn't append
I am hoping in the following code, the for loop will loop around all the csv in the folder, and the df data frame will append after reading from each csv. However, the df here never appends, but only contain the content of the first csv. Any thoughts? Thanks!
We are in python 3.6 and pandas 0.21
path = "/home/ubuntu/QA/client_" + CLIENT_ID + "_raw_data_" + year + "/_ACTUAL_*_Accrual*.xls" if CLIENT_ID in ('7') df_columns=pd.DataFrame(columns=['PropID','PROPERTY_CODE','TreeNodeID','ACCOUNT_CODE','TreeNodeName','ReportYear','Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']) OUTPUT_CSV="Client_"+CLIENT_ID+"_"+year+"_"+ACCOUNTING_TYPE+"_QA.csv" df_columns.to_csv(OUTPUT_CSV, header=True, index=False, encoding='utf8',na_rep="NA", mode='w') df = pd.DataFrame() for fname in glob.iglob(path): print (fname) df2 = pd.DataFrame() df2=pd.read_excel(fname,skiprows=4,converters={'TreeNodeCode':np.int64,'PropCode':np.str}).dropna(subset=['TreeNodeCode'],how='any') ## convert the account code in the raw data into strings. dropna drops the raw of the column 4 ,which is the IAM account code, if the column 4 is NA print (df2) df=df.append(df2) df=df.rename(columns={'TreeNodeCode':'ACCOUNT_CODE'}) df=df.rename(columns={'PropCode':'PROPERTY_CODE'}) df['PROPERTY_CODE'] = df_QA['PROPERTY_CODE'].astype(np.str) df['ACCOUNT_CODE'] = df_QA['ACCOUNT_CODE'].astype(np.str) df_QA['PROPERTY_CODE'] = df_QA['PROPERTY_CODE'].astype(np.str) df_QA['ACCOUNT_CODE'] = df_QA['ACCOUNT_CODE'].astype(np.str) print ("this is df") print (df) print ("this is df_QA") print (df_QA) df_check=pd.merge(df,df_QA, how='inner',on=['PROPERTY_CODE','ACCOUNT_CODE']) #print (df_check) # tricks in this ticket: https://stackoverflow.com/questions/384192823/subtractingmultiplecolumnsandappendingresultsinpandasdataframe df_check[['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']] = df_check[['Jan_x','Feb_x','Mar_x','Apr_x','May_x','Jun_x','Jul_x','Aug_x','Sep_x','Oct_x','Nov_x','Dec_x']]  df_check[['Jan_y','Feb_y','Mar_y','Apr_y','May_y','Jun_y','Jul_y','Aug_y','Sep_y','Oct_y','Nov_y','Dec_y']].values #print (df_check) df_check2=df_check[['PropID','PROPERTY_CODE','TreeNodeID','ACCOUNT_CODE','TreeNodeName','ReportYear','Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']] #print (df_check2) # tricks of panda query: https://pandas.pydata.org/pandasdocs/stable/generated/pandas.DataFrame.query.html#pandasdataframequery df_check3=df_check2.query('Jan > 0  Jan < 0  Feb > 0  Feb < 0  Mar > 0  Mar < 0  Apr > 0  Apr < 0  May > 0  May < 0  Jun > 0  Jun < 0  Jul > 0  Jul < 0  Aug > 0  Aug < 0  Sep > 0  Sep < 0  Oct > 0  Oct < 0  Nov > 0  Nov < 0  Dec > 0  Dec < 0') #print (df_check3) #print (df_check3.info()) df_check3.to_csv(OUTPUT_CSV, header=False, index=False, na_rep="NA", mode='a')

Modified Gram Schmidt in Python for complex vectors
I wrote some code to implement the modified Gram Schmidt process. When I tested it on real matrices, it is correct. However, when I tested it on complex matrices, it went wrong.
I believe my code is correct by doing a step by step check. Therefore, I wonder if there are numerical reasons why the modified Gram Schmidt process fails on complex vectors.
Following is the code:
import numpy as np def modifiedGramSchmidt(A): """ Gives a orthonormal matrix, using modified Gram Schmidt Procedure :param A: a matrix of column vectors :return: a matrix of orthonormal column vectors """ # assuming A is a square matrix dim = A.shape[0] Q = np.zeros(A.shape, dtype=A.dtype) for j in range(0, dim): q = A[:,j] for i in range(0, j): rij = np.vdot(q, Q[:,i]) q = q  rij*Q[:,i] rjj = np.linalg.norm(q, ord=2) if np.isclose(rjj,0.0): raise ValueError("invalid input matrix") else: Q[:,j] = q/rjj return Q
Following is the test code:
import numpy as np # If testing on random matrices: # X = np.random.rand(dim,dim)*10 + np.random.rand(dim,dim)*5 *1j # If testing on some good one v1 = np.array([1, 0, 1j]).reshape((3,1)) v2 = np.array([1, 1j, 1]).reshape((3,1)) v3 = np.array([0, 1, 1j+1]).reshape((3,1)) X = np.hstack([v1,v2,v3]) Y = modifiedGramSchmidt(X) Y3 = np.linalg.qr(X, mode="complete")[0] if np.isclose(Y3.conj().T.dot(Y3), np.eye(dim, dtype=complex)).all(): print("The QRcomplete gives orthonormal vectors") if np.isclose(Y.conj().T.dot(Y), np.eye(dim, dtype=complex)).all(): print("The Gram Schmidt process is tested against a random matrix") else: print("But My modified GS goes wrong!") print(Y.conj().T.dot(Y))
Update
The problem is that I implemented a algorithm designed for inner product linear in first argument
whereas I thought it were linear in second argument.
Thanks @landogardner

Install numpy latest package using aptget
I used below query to install numpy package. But it installed 1.11.0 version. How to install the most recent version.
sudo aptget install pythonnumpy

Figuring out the proper numpy.reshape for 1 option
I have a (hopefully) quick Numpy question, and I hope you can help me. I want to use numpy.reshape to convert (5000, 32, 32, 3) into (5000, 3072), and the only clue I got for the assignment is this:
# Reshape each image data into a 1dim array print (X_train.shape, X_test.shape) # Should be: (5000, 32, 32, 3) (500, 32, 32, 3) ##################################################################### # TODO (2): # # Reshape the image data to one dimension. # # # # Hint: Look at the numpy reshape function and have a look at 1 # # option # ##################################################################### X_train = X_test = ##################################################################### # END OF YOUR CODE # ##################################################################### print (X_train.shape, X_test.shape) # Should be: (5000, 3072) (500, 3072)
I've been spending the last day scouring Google for examples, but apparently this is too trivial to warrant an ask. Help?

Has Anyone Installed Mac OS High Sierra and is Python Anaconda still working?
My mac is bugging me to install the Mac OS High Sierra.
Has anyone installed it without experience of Anaconda or other Python Library breaking?

Save params between script runs in iPython console
I want to find iPython console equivalent to the Spyder console command.
When I use Spyder app all my variables are saved between script runs. By that I don't only mean I can inspect the values after script finished running but that those values will be preserved within next script run.
Spyder console command (doesn't work in iPython console):
runfile('some_file.py', wdir='/some/project/folder')
There is a similar command in iPython console:
%run i "some_script.py"
The problem is that this command deletes old values when new script starts executing.
Why is this important?
Let's say my script among other things builds some model which takes long(er) then I'm willing to wait (every time). In Spyder I can run it just the first time and then comment out this line of code and next time only rest of the code is run and model is pulled from working memory.
(yes I know I can save the model in pickle format etc. but that's totally beside the point)

Gathering list of 2d tensors from a 3d tensor in Keras
I have a 3d Tensor named main_decoder of shape (None,9,256)
I want to extract 9 tensors of shape (None,256)
I have tried using Keras gather and the following is mode code snippet:
for i in range(0,9): sub_decoder_input = Lambda(lambda main_decoder:gather(main_decoder,(i)), name='lambda'+str(i))(main_decoder)
the resultant is 9 lambda layers of shape (9,256)
How can I modify it so that I can get or gather 9 tensors of shape (None,256)
Thanks.

how to create curvefit graph using linear regression
I have this data:
FarmerCropId NDVI 7208 251784.0 0.235035 7209 251784.0 0.345980 7210 251784.0 0.286614 7211 251784.0 0.233536 7212 251784.0 0.167464 7213 251784.0 0.137915 7214 251784.0 0.111309 7497 251907.0 0.250681 7498 251907.0 0.299998`
Here is my code:
import matplotlib.pyplot as plt import numpy as np from sklearn import linear_model import pandas as pd from scipy.optimize import curve_fit from sklearn.linear_model import LinearRegression Y = df1['FarmerCropId'] X = df1['NDVI'] X=X.values.reshape(len(X),1) Y=Y.values.reshape(len(Y),1) plt.scatter(X,Y ,color='green') plt.title('Test Data') plt.ylabel('Ndvi') plt.xlabel('farmer crop id') regr = linear_model.LinearRegression() poly = PolynomialFeatures(degree=10) regr.fit(X,Y) plt.plot(X, regr.predict(Y), color='orange', linewidth=3) plt.show()
I tried this code, but its giving only line prediction. But I want curve prediction.Means i want to plot polynomial plot.But in this code i can get only line plot. Please help me on this. Thanks

Optimizing code mimiking sklearn kNN algorithm
I have written a script performing kNN classification using homemade functions. I have compared its performance against a similar script but using sklearn package.
Results : HouseMade ~ 20 seconds sklearn ~ 2 seconds
So now I would like to know if the performance difference is mainly due to the fact that sklearn is executed at a lower level (in C as far as I understand) or because my script is not efficient.
If some of you got references providing information for writing efficient Python scripts and programs, I am all aware
Here is the data file : DataFile
filename, os.environ['R_HOME'], os.environ['R_USER'] in both scripts must be made userspecific according to your directory structure
My code using homemade kNN classification
#Start Timer import time tic = time.time() # Begin Script import os os.environ['R_HOME'] = r'C:\Users\MyUser\Documents\R\R3.4.1' #setting temporary PATH variables : R_HOME #a permanent solution could be achieved but more complicated os.environ['R_USER'] = r'C:\Users\MyUser\AppData\Local\Programs\Python\Python36\Lib\sitepackages\rpy2' #same story import rpy2.robjects as robjects import numpy as np import matplotlib.pyplot as plt ## Read R data from ESLII book dir = os.path.dirname(__file__) filename = os.path.join(dir, '../ESL.mixture.rda') robjects.r['load'](filename) #load rda file in R workspace rObject = robjects.r['ESL.mixture'] #read variable in R workspace and save it into python workspace #Extract Blue and Orange classes data classes = np.array(rObject[0]) #note that information about rObject are known by outputing the object into the console #numpy is able to convert R data natively BLUE = classes[0:100,:] BLUE = np.concatenate((BLUE,np.zeros(np.size(BLUE,axis=0))[:,None]),axis=1) #the [:,None] is necessary to make the 1D array 2D. #Indeed concatenate requires identical dimensions #other functions exist such as np.columns_stack but they take more time to execute than basic concatenate ORANGE = classes[100:200] ORANGE = np.concatenate((ORANGE,np.ones(np.size(ORANGE,axis=0))[:,None]),axis=1) trainingSet = np.concatenate((BLUE,ORANGE),axis=0) ##create meshgrid minBound = 3 maxBound = 4.5 xmesh = np.linspace(minBound, maxBound, 100) ymesh = np.linspace(minBound, maxBound, 100) xv, yv = np.meshgrid(xmesh, ymesh) gridSet =np.stack((xv.ravel(),yv.ravel())).T def predict(trainingSet, queryPoint, k): # create list for distances and targets distances = [] # compute euclidean distance for i in range (np.size(trainingSet,0)): distances.append(np.sqrt(np.sum(np.square(trainingSet[i,:1]queryPoint)))) #find k nearest neighbors to the query point and compute its outcome distances=np.array(distances) indices = np.argsort(distances) #provides indices, sorted from short to long distances kindices = indices[0:k] kNN = trainingSet[kindices,:] queryOutput = np.average(kNN[:,2]) return queryOutput k = 1 gridSet = np.concatenate((gridSet,np.zeros(np.size(gridSet,axis=0))[:,None]),axis=1) i=0 for point in gridSet[:,:1]: gridSet[i,2] = predict(trainingSet, point, k) i+=1 #k = 1 #test = predict(trainingSet, np.array([4.0, 1.2]), k) col = np.where(gridSet[:,2]<0.5,'b','r').flatten() #flatten is necessary. 2D arrays are only accepted with RBA colors plt.scatter(gridSet[:,0],gridSet[:,1],c=col,s=0.2) col = np.where(trainingSet[:,2]<0.5,'b','r').flatten() #flatten is necessary. 2D arrays are only accepted with RBA colors plt.scatter(trainingSet[:,0],trainingSet[:,1],c=col,s=1.0) plt.contour(xv,yv,gridSet[:,2].reshape(xv.shape),0.5) plt.savefig('kNN_homeMade.png', dpi=600) plt.show() # #Stop timer toc = time.time() print(toctic, 'sec Elapsed')
My code using sklearn kNN
#Start Timer import time tic = time.time() # Begin Script import os os.environ['R_HOME'] = r'C:\Users\MyUser\Documents\R\R3.4.1' #setting temporary PATH variables : R_HOME #a permanent solution could be achieved but more complicated os.environ['R_USER'] = r'C:\Users\MyUser\AppData\Local\Programs\Python\Python36\Lib\sitepackages\rpy2' #same story import rpy2.robjects as robjects import numpy as np import matplotlib.pyplot as plt from sklearn import neighbors ## Read R data from ESLII book dir = os.path.dirname(__file__) filename = os.path.join(dir, '../ESL.mixture.rda') robjects.r['load'](filename) #load rda file in R workspace rObject = robjects.r['ESL.mixture'] #read variable in R workspace and save it into python workspace #Extract Blue and Orange classes data classes = np.array(rObject[0]) #note that information about rObject are known by outputing the object into the console #numpy is able to convert R data natively BLUE = classes[0:100,:] BLUE = np.concatenate((BLUE,np.zeros(np.size(BLUE,axis=0))[:,None]),axis=1) #the [:,None] is necessary to make the 1D array 2D. #Indeed concatenate requires identical dimensions #other functions exist such as np.columns_stack but they take more time to execute than basic concatenate ORANGE = classes[100:200] ORANGE = np.concatenate((ORANGE,np.ones(np.size(ORANGE,axis=0))[:,None]),axis=1) trainingSet = np.concatenate((BLUE,ORANGE),axis=0) ##create meshgrid minBound = 3 maxBound = 4.5 xmesh = np.linspace(minBound, maxBound, 100) ymesh = np.linspace(minBound, maxBound, 100) xv, yv = np.meshgrid(xmesh, ymesh) gridSet =np.stack((xv.ravel(),yv.ravel())).T gridSet = np.concatenate((gridSet,np.zeros(np.size(gridSet,axis=0))[:,None]),axis=1) ##classify using kNN k = 1 clf = neighbors.KNeighborsClassifier(k, weights='uniform',algorithm='brute') clf.fit(trainingSet[:,:1],trainingSet[:,1:].ravel()) #learn, ravel necessary to obtain (n,) shape instead of a vector (n,1) gridSet[:,2] = clf.predict(np.c_[xv.ravel(), yv.ravel()]) #Plot col = np.where(gridSet[:,2]<0.5,'b','r').flatten() #flatten is necessary. 2D arrays are only accepted with RBA colors plt.scatter(gridSet[:,0],gridSet[:,1],c=col,s=0.2) col = np.where(trainingSet[:,2]<0.5,'b','r').flatten() #flatten is necessary. 2D arrays are only accepted with RBA colors plt.scatter(trainingSet[:,0],trainingSet[:,1],c=col,s=1.0) plt.contour(xv,yv,gridSet[:,2].reshape(xv.shape),0.5) plt.savefig('kNN_sciKit.png', dpi=600) plt.show() # #Stop timer toc = time.time() print(toctic, 'sec Elapsed')

Suggest problems of monolithic architecture for one off data analysis
To try and make my situation clear: I am doing a 1off data analysis project in R that will ideally be moderately reproducible. My current architecture is as follows:
 1 file with the 'story' which is essentially calls to functions I wrote
 1 file containing all the functions I wrote, separated into 'story functions' which are called once, and more general 'helper' functions which are called more than once or else are written more generally.
 a data folder for everything that is input / produced
This architecture is appealing to me, as everything is pretty much in these two files, as well as a data folder. What I am trying to avoid is having to search through lots of files to find the relevant bit of code. However, it seems a little bit monolithic, which I know is bad for large software projects.
IMO I am not conducting a large software project. My question:
 Can you point out problems with this architecture for my purposes?

tensorflow for classifying unknown input
I have been working on convolutional neural networks for image classification. I've gotten an assignment where they gave me a text file containing input/output pairs, where the output is either 0 or 1, while input is generated values. For example
0 X225006700,X773579,X236398246,X773545,X51769735,X315340932,X44092910,Y2,Y1132,Y2257,Y2793,Y1080,Y1555,Y1222,Y2072,Y1238,Y1791,Y1705,Y2684,Y1725,Y2641,Y2640,Y1690,Y1367,Y1353,Y2949,Y2557,Y1478,Y2024,Y1486,Y1522,Y1456,Y1940,Y977,Y1468,Z4,Z15 0 X123474229,X51578397,X40087170,X236398246,X367227997,X62716661,X127972441,X344420902,X40087738,X103413307,X51769735,X524837224,X37875376,X79805718,X773579,X44092910,Y1353,Y1555,Y2849,Y1478,Y2321,Y1238,Y1486,Y3143,Y1522,Y2817,Y1702,Y1940,Z4,Z15,Z29 1 X62716661,X277692318,Y1367,Y3269,Y1353,Y2949,Y2814,Y2267,Y2257,Y3250,Y3021,Y2557,Y3232,Y1080,Y1555,Y1222,Y2849,Y1478,Y2321,Y3145,Y1486,Y3143,Y1791,Y1522,Y2817,Y1702,Y1456,Y2641,Y2640,Y1940,Y1468,Y2170,Y3585,Z4,Z15,Z27
Now with normal neural networks, I always need (at least that I know of) specified input numbers in order to create input layer (e.g. 10x10 pixel images, or cat/dog/bird etc). Is it possible to do the same for these random inputs?
If not, is there any recommended method for solving this problem? (I'm thinking using matlab since there appears to be xyz values in the data)