Numpy np.newaxis
saleprice_scaled = /
StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
Can anyone please explain what's happening with this line? Why is newaxis being used here? Although I know the use of newaxis but I can't figure out it's use in this particular situations.
Thanks In advance
1 answer

df_train['SalePrice']
is a Pandas.Series (vector / 1D array) of a shape: (N elements,)Modern (version: 0.17+) SKLearn methods don't like 1D arrays (vectors), they expect 2D arrays.
df_train['SalePrice'][:,np.newaxis]
transforms 1D array (shape: N elements) into 2D array (shape: N rows, 1 column).
Demo:
In [21]: df = pd.DataFrame(np.random.randint(10, size=(5, 3)), columns=list('abc')) In [22]: df Out[22]: a b c 0 4 3 8 1 7 5 6 2 1 3 9 3 7 5 7 4 7 0 6 In [23]: from sklearn.preprocessing import StandardScaler In [24]: df['a'].shape Out[24]: (5,) # < 1D array In [25]: df['a'][:, np.newaxis].shape Out[25]: (5, 1) # < 2D array
There is Pandas way to do the same:
In [26]: df[['a']].shape Out[26]: (5, 1) # < 2D array In [27]: StandardScaler().fit_transform(df[['a']]) Out[27]: array([[0.5 ], [ 0.75], [1.75], [ 0.75], [ 0.75]])
What happens if we will pass 1D array:
In [28]: StandardScaler().fit_transform(df['a']) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\utils\validation.py:429: DataConversionWarning: Data with input dtype int32 was converted t o float64 by StandardScaler. warnings.warn(msg, _DataConversionWarning) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\preprocessing\data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0 .17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(1, 1) if your data has a single feature or X.reshape(1, 1) if it contains a single sample. warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning) C:\Users\Max\Anaconda4\lib\sitepackages\sklearn\preprocessing\data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0 .17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(1, 1) if your data has a single feature or X.reshape(1, 1) if it contains a single sample. warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning) Out[28]: array([0.5 , 0.75, 1.75, 0.75, 0.75])
See also questions close to this topic

Filter columns by values in a row in Pandas
I have obtained the statistics for my dataframe by df.describe() in Pandas.
statistics = df.describe()
I want to filter the statistics dataframe base on count:
main Meas1 Meas2 Meas3 Meas4 Meas5 sublvl Value Value Value Value Value count 7.000000 1.0 1.0 582.00 97.000000 mean 30 37.0 26.0 33.03 16.635350
I want to get something like that: filter out all Values with count less than 30 and show me only the columns with count >30 in a new dataframe (or give me a list with all main that have count>30).
For the above example, I want:
main Meas4 Meas5 sublvl Value Value count 582.00 97.000000 mean 33.03 16.635350
and
[Meas4, Meas5]
I have tried
thresh = statistics.columns[statistics['count']>30]
And variations thereof.
Thank you!

Groupby loop over, group second dataframe in the same loop
Consider this same small example as followup to a recent post:
sd={"X":[1, 2, 3, 4, 5], "Y":[6, 7, 8, 9, 10], "Z": [11, 12, 13, 14, 15]) frame1=pd.DataFrame(data,columns=["X","Y","Z"],index=["A","A","A","B","B"])
I want to group "frame1" with
grouped_frame1=frame1.groupby(frame1.index)
Now I want to loop over the groups by:
for name,group in grouped_frame1: ...
However, I have a second dataframe
mean={"X":[21, 22, 23, 24, 25], "Y":[26, 27, 28, 29, 30], "Z": [31, 32, 33, 34, 35]) frame2=pd.DataFrame(data,columns=["X","Y","Z"],index=["A","A","A","B","B"])
which I also want to be split into groups following the groups in frame1, in the above forloop, since the identical groups (size, index) exist. How can I subset and use frame2groups in the "grouped_frame1"loop?

Pandas Row Date Conditional Filter Prior to Groupby  MAXIFS/MINIFS
I am trying to do MAXIFS style calculations in Pandas
I am trying to add a column containing the next (if exists) & last (if exists) flagged date for each unique ID
Sample Table: (Trying to get the Next Flag & Last Flag Columns)
Edit: To form a more generic case, what if you wanted to perform another function e.g ditinctcount over the period <= to the row
The idea is to be able to apply custom functions that are only applied to a filtered subset where each Id = row ID and Date <= row Date (I have created pandas compatible row functions but it is way too slow)
Table:
Id Date Flag Next Flag Last Flag Flag2 UniqueFlags 1 21Aug 0 NaN 18Aug 1 1 20Aug 0 NaN 18Aug 1 1 19Aug 0 NaN 18Aug 1 1 18Aug 1 NaN 18Aug A 1 1 17Aug 0 18Aug 15Aug 1 1 16Aug 0 18Aug 15Aug 1 1 15Aug 1 18Aug 15Aug A 1 1 14Aug 0 15Aug NaN 0 1 13Aug 0 15Aug NaN 0 2 21Aug 0 NaN 19Aug 2 2 20Aug 0 NaN 19Aug 2 2 19Aug 1 NaN 19Aug A 2 2 18Aug 0 19Aug 15Aug 1 2 17Aug 0 19Aug 15Aug 1 2 16Aug 0 19Aug 15Aug 1 2 15Aug 1 19Aug 15Aug B 1 2 14Aug 0 15Aug NaN 0 2 13Aug 0 15Aug NaN 0 3 21Aug 0 NaN 17Aug 1 3 20Aug 0 NaN 17Aug 1 3 19Aug 0 NaN 17Aug 1 3 18Aug 0 NaN 17Aug 1 3 17Aug 1 NaN 17Aug A 1 3 16Aug 0 17Aug NaN 0 3 15Aug 0 17Aug NaN 0 3 14Aug 0 17Aug NaN 0 3 13Aug 0 17Aug NaN 0
Ive tried groupby but cant get it to only be for dates <= to the row date whilst also being for each ID
Thanks

how can I use cython memoryview on numpy bool array?
I used bool array quite often. So I would like to do this in cython also. However, I just cannot make the new memoryview interface working on the numpy bool matrix.
here is my test:
def test_oldbuffer_uint8(np.ndarray[np.uint8_t, ndim=2] input): cdef size_t i, j cdef long total = 0 cdef size_t J = input.shape[0] cdef size_t I = input.shape[1] for j in range(J): for i in range(I): total +=input[i, j] return total def test_memview_uint8(np.uint8_t[:,:] input): cdef size_t i, j cdef long total = 0 cdef size_t J = input.shape[0] cdef size_t I = input.shape[1] for j in range(J): for i in range(I): total +=input[i, j] return total def test_oldbuffer_bool(np.ndarray[np.uint8_t, ndim=2, cast=True] input): cdef size_t i, j cdef long total = 0 cdef size_t J = input.shape[0] cdef size_t I = input.shape[1] for j in range(J): for i in range(I): total +=input[i, j] return total from cpython cimport bool def test_memview_bool(bool[:,:] input): cdef size_t i, j cdef long total = 0 cdef size_t J = input.shape[0] cdef size_t I = input.shape[1] for j in range(J): for i in range(I): total +=input[i, j] return total
Then I just pass an random boolean array into them:
def test_memview(): import fuzedtest1 a = np.random.randn(10000,10000) a = a>0 sum = a.sum() b = a.astype(np.uint8) funcs = [ (fuzedtest1.test_oldbuffer_uint8, b), (fuzedtest1.test_memview_uint8, b), (fuzedtest1.test_oldbuffer_bool, a), (fuzedtest1.test_memview_bool, a), ] for _func, _arr in funcs: try: _sum = _func(_arr) t = timeit.timeit(lambda : _func(_arr), number=10) print("{} time = {}, res = {} _sum = {}".format(_func, t, _sum, sum)) except Exception as err: print(err)
The result is like this:
<builtin function test_oldbuffer_uint8> time = 1.4905898699998943, res = 50000391 _sum = 50000391 <builtin function test_memview_uint8> time = 1.483763039999758, res = 50000391 _sum = 50000391 <builtin function test_oldbuffer_bool> time = 1.488173633999395, res = 50000391 _sum = 50000391 Does not understand character buffer dtype format string ('?')
How can I do the cast=True in memoryview context?

sort numpy array with custom predicate
I'd like to sort my numpy array of shape [n,4], along first dimension (size:n) using a custom predicate operating on the 2nd dimension vector (size:4). The C++ version of what I'd like to do is below, it's quite simple really. I've seen how to do this with python lists, but I can't find the syntax to do it with numpy arrays. Is this possible? The documentation on np.sort, np.argsort, np.lexsort doesn't mention custom predicates.
// c++ version vector< float[4] > v = init_v(); float[4] p = init_p(); std::sort(v.begin(), v.end(), [&p](const auto& lhs, const auto& rhs) { return myfn(p, lhs) > myfn(p, rhs); });
EDIT: below is the python code I would like to use for the sorting. I.e. for each 'row' (n:4) of my array, I'd calculate the square of the euclidean 3D distance (i.e. only the first 3 columns) to a fixed point.
# these both operate on numpy vectors of shape [4] (i.e. a single row of my data matrix) def dist_sq(a,b): d = a[:3]b[:3] return np.dot(d*d) def sort_pred(lhs, rhs, p): return dist_sq(lhs, p) > dist_sq(rhs, p)

Cython: how to go from numpy memory view to vector[pair[double,double]] without needing the GIL?
I am trying to convert all Python calls in my Cython code to pure C to be able to release the GIL and do parallelisation.
I used to work with a list of listsofsize2 initialized from a 2D numpy array so I did something like that:
cdef double[:,:,:] init=np.random.uniform((10,4,2),dtype=np.float32) cdef int i cdef int N=init.shape[0] for i in range(N): a=init[i].tolist() #I then get this list of list #a=[[1.,1.],[1.,1.],[1.,1.]] #f acting on list of list f(a)
I need to release the GIL inside the loop so I need to remove all the calls to Python. By using vector[pair[double,double]] instead of lists and modifying f accordingly I now have:
cdef vector[pair[double,double]] a cdef double[:,:,:] init=np.ones((10,4,2),dtype=np.float32_t) cdef int i cdef int N=init.shape[0] for i in prange(N): #I need to get a vector[pair[double,double]] from the numpy init[i] #with f now cdef acting on vector[pair[double,double]] a=np.asarray(init[i]) #actually works but it goes through Python ! f(a)
How to convert init[i] (thus a double[:,:] type) to a vector[pair[double,double]] without going through python ?

Why batch normalization over channels only in CNN
I am wondering, if in Convolutional Neural Networks batch normalization should be applied with respect to every pixel separately, or should I take the mean of pixels with respect to each channel?
I saw that in the description of Tensorflow's tf.layers.batch_normalization it is suggested to perform bn with respect to the channels, but if I recall correctly, I have used the other approach with good results.

MNIST dataset:created neural network with numpy, now can't correct error about broadcasting
Here is my code:
import numpy as np import random class Network: #layers, biases, weights def __init__(self, size): self.nr_layers = len(size) self.size = size self.bias = [np.random.rand(y, 1) for y in size[1:]] self.weights = [np.random.randn(x, y) for x, y in zip(size[1:], size[:1])] def feedfoward(self, a): #a is activation of last layer(or input) for b,w in zip(self.bias, self.weights): a = sigmoid(np.dot(w, a) + b) return(a) def SGD(self, training_data, test_data, nr_epoch, mini_batch_size, learning_rate): test_data = list(test_data) training_data = list(training_data) n_test_data = len(test_data) n_training_data = len(training_data) #build mini batches for i in range(nr_epoch): random.shuffle(training_data) mini_batches = [training_data[j:j + mini_batch_size] for j in range(0,n_training_data,mini_batch_size)] for mini_batch in mini_batches: self.update_mini_batch(mini_batch, learning_rate) print("Epoch {} : {} / {}".format(i, self.evaluate(test_data), n_test_data)) def update_mini_batch(self, mini_batch, learning_rate): bias_gradient = [np.zeros(b.shape) for b in self.bias] weights_gradient = [np.zeros(w.shape) for w in self.weights] #summing up gradients for weights and biases(calculate each gradient with backprop) for x, y in mini_batch: delta_b, delta_w = self.backprop(x, y) bias_gradient = [b + db for b, db in zip(bias_gradient, delta_b)] weights_gradient = [w + db for w, db in zip(weights_gradient, delta_w)] #now we update original weights and biases with gradient descent formula self.bias = [b  (learning_rate/len(mini_batch)) * change for b, change in zip(self.bias, bias_gradient)] self.weights = [w  (learning_rate/len(mini_batch)) * change for w, change in zip(self.weights, weights_gradient)] def backprop(self, x, y): bias_gradient = [np.zeros(bias.shape) for bias in self.bias] weights_gradient = [np.zeros(weights.shape) for weights in self.weights] activation = x activations = [x] #zs are weighted inputs zs = [] #FEEDFOWARD for b, w in zip(self.bias, self.weights): z = np.dot(w, activation) + b zs.append(z) activation = sigmoid(z) activations.append(activation) #BACKWARD PASS #first last layer(backprop formula #1), then we assign BP3 and BP4 delta = self.last_layer_cost(activations[1], y) * sigmoid_derivative(zs[1]) bias_gradient = delta weights_gradient = np.dot(delta, activations[2].transpose()) #now we apply BP formula #2 to all others(l2) layers, then we assign BP3 and BP4 #first layer in this loop is last layer before output(a^L) for l in range(2, self.nr_layers): delta = np.dot(self.weights[l + 1].transpose(), delta) * sigmoid_derivative(zs[l]) bias_gradient = delta weights_gradient = np.dot(delta, activations[l  1].transpose()) return weights_gradient, weights_gradient def last_layer_cost(self, last_layer_activation, y): return(last_layer_activation  y) def evaluation(self, test_data): test_result = [(np.argmax(self.feedfoward(x), y)) for x, y in test_data] return sum(int(x==y) for x, y in test_result) def sigmoid(z): return 1.0/(1.0 + np.exp(z)) def sigmoid_derivative(z): return sigmoid(z)*(1sigmoid(z)) import pickle import gzip # Next part is copied from solutions import numpy as np def load_data(): """Return the MNIST data as a tuple containing the training data, the validation data, and the test data. The ``training_data`` is returned as a tuple with two entries. The first entry contains the actual training images. This is a numpy ndarray with 50,000 entries. Each entry is, in turn, a numpy ndarray with 784 values, representing the 28 * 28 = 784 pixels in a single MNIST image. The second entry in the ``training_data`` tuple is a numpy ndarray containing 50,000 entries. Those entries are just the digit values (0...9) for the corresponding images contained in the first entry of the tuple. The ``validation_data`` and ``test_data`` are similar, except each contains only 10,000 images. This is a nice data format, but for use in neural networks it's helpful to modify the format of the ``training_data`` a little. That's done in the wrapper function ``load_data_wrapper()``, see below. """ f = gzip.open('mnist.pkl.gz', 'rb') training_data, validation_data, test_data = pickle.load(f, encoding="latin1") f.close() return (training_data, validation_data, test_data) def load_data_wrapper(): """Return a tuple containing ``(training_data, validation_data, test_data)``. Based on ``load_data``, but the format is more convenient for use in our implementation of neural networks. In particular, ``training_data`` is a list containing 50,000 2tuples ``(x, y)``. ``x`` is a 784dimensional numpy.ndarray containing the input image. ``y`` is a 10dimensional numpy.ndarray representing the unit vector corresponding to the correct digit for ``x``. ``validation_data`` and ``test_data`` are lists containing 10,000 2tuples ``(x, y)``. In each case, ``x`` is a 784dimensional numpy.ndarry containing the input image, and ``y`` is the corresponding classification, i.e., the digit values (integers) corresponding to ``x``. Obviously, this means we're using slightly different formats for the training data and the validation / test data. These formats turn out to be the most convenient for use in our neural network code.""" tr_d, va_d, te_d = load_data() training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]] training_results = [vectorized_result(y) for y in tr_d[1]] training_data = zip(training_inputs, training_results) validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]] validation_data = zip(validation_inputs, va_d[1]) test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]] test_data = zip(test_inputs, te_d[1]) return (training_data, validation_data, test_data) def vectorized_result(j): """Return a 10dimensional unit vector with a 1.0 in the jth position and zeroes elsewhere. This is used to convert a digit (0...9) into a corresponding desired output from the neural network.""" e = np.zeros((10, 1)) e[j] = 1.0 return e ################################################################################ training_data, validation_data, test_data = load_data_wrapper() net = Network([784, 30, 10]) net.SGD(training_data, test_data, 30, 10, 3.0)
and this are solutions.. Part that is copied from solutions is here in file
mnist_loader.py
.Here is my error:
Traceback (most recent call last): File "C:/Users/PycharmProjects/MachineLearning/ex.py", line 157, in <module> net.SGD(training_data, test_data, 30, 10, 3.0) File "C:/Users/PycharmProjects/MachineLearning/ex.py", line 29, in SGD self.update_mini_batch(mini_batch, learning_rate) File "C:/Users/PycharmProjects/MachineLearning/ex.py", line 39, in update_mini_batch weights_gradient = [w + db for w, db in zip(weights_gradient, delta_w)] File "C:/Users/PycharmProjects/MachineLearning/ex.py", line 39, in <listcomp> weights_gradient = [w + db for w, db in zip(weights_gradient, delta_w)] ValueError: operands could not be broadcast together with shapes (10,30) (784,)
I am beginner in DL and don't really know python and numpy for more than 23 months, but I know what's broadcasting...but I can't fix this bug anway so can pleas anyone take a look at this and suggests me how to fix it? What's most confusing to me is that this line is identical to solutions(which work, I tried).
Oh, and short terminology remark: nabla_b, nabla_w is bias_gradient, weights_gradient in my version

Pattern Recognition in Datasets without Visualisation for Data Analysis
Using Machine Learning How to recognise a pattern in a data without using data visualisation so that the machine recognises patterns on its own so I can use those patterns for further analysis without needing to analyse visualisations on my own ?
Patters as: the pattern of my sales in different months, years or weeks, pattern of the attendance of a particular student in school, patter on the websites being viewed each month, year, week....
So patterns as such need to be identified by the machine (via Unsupervised learning I guess) and without using graphs, charts or any kind of visualisation
Can you tell me if that's doable ? If yes, then how ?

Acceleration of kMeans in scikitlearn library
I have a problem that my code takes a long time for execution. I am using python for coding and scikitlearn as machine learning library. My problem is that the computation of kMeans take long time to finish (basically we have about 3000 data points for grouping in 400 clusters). This method repeats about 250 times. As for timing it takes about 40 mins to finish. Any suggestion on how to accelerate it? Thanks in advance.

ValueError: Unknown label type array in scikitlearn
I was trying to use scikitlearn to train and test my dataset. But first of all, here's my dataset (I only show top 4 rows of 800 rows) :
Full,Id,Id & PPDB,Id & Words Sequence,Id & Synonyms,Id & Hypernyms,Id & Hyponyms,Gold Standard 1.667,0.476,0.952,0.476,1.429,0.952,0.476,2.345 3.056,1.111,1.667,1.111,3.056,1.389,1.111,1.9 1.765,1.176,1.176,1.176,1.765,1.176,1.176,2.2 0.714,0.714,0.714,0.714,0.714,0.714,0.714,0.0
And then I divided columns into features and label, which the label is
"Gold Standard"
. Here's my code :import pandas as pd import numpy as np from sklearn.model_selection import cross_val_score from sklearn.model_selection import train_test_split from sklearn.neural_network import MLPClassifier dataset = pd.read_csv("datasupervised.csv") columns = ["Full","Id","Id & PPDB","Id & Words Sequence","Id & Synonyms","Id & Hypernyms","Id & Hyponyms"] label = dataset["Gold Standard"].values features = dataset[list(columns)].values X = features y = label print(X.shape) print(y.shape) print dframe = pd.DataFrame(X, y).head(800) print dframe print X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=4) per = MLPClassifier() print(per.fit(X_train, y_train))
But it said :
File "C:\Python27\lib\sitepackages\sklearn\utils\multiclass.py", line 98, in unique_labels raise ValueError("Unknown label type: %s" % repr(ys)) ValueError: Unknown label type: (array([ 5. , 5. , 2.682, 3.375, 5. , 2.2 , 3.125, 1.5 ,
I don't understand why did it say so? Anyone can explain? thanks

How do we penalize a prediction error of Random Forest Regressor in Python?
I am trying to fit a model using a
RandomForestRegressor
for dataset found in this linkimport pandas as pd import math import matplotlib import matplotlib.pyplot as plt import numpy as np from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor from sklearn.model_selection import GridSearchCV from sklearn.metrics import r2_score, mean_squared_error, make_scorer from sklearn.model_selection import train_test_split from math import sqrt from sklearn.cross_validation import train_test_split n_features=1000 df = pd.read_csv('cubic32.csv') for i in range(1,n_features): df['X_t'+str(i)] = df['X'].shift(i) print(df) df.dropna(inplace=True) X = df.drop('Y', axis=1) y = df['Y'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40) X_train = X_train.drop('time', axis=1) X_test = X_test.drop('time', axis=1) parameters = {'n_estimators': [10]} clf_rf = RandomForestRegressor(random_state=1) clf = GridSearchCV(clf_rf, parameters, cv=5, scoring='neg_mean_squared_error', n_jobs=1) model = clf.fit(X_train, y_train) model.cv_results_['params'][model.best_index_] math.sqrt(model.best_score_*1) model.grid_scores_ ##### print() print(model.grid_scores_) print("The best score: ",model.best_score_) print("RMSE:",math.sqrt(model.best_score_*1)) clf_rf.fit(X_train,y_train) modelPrediction = clf_rf.predict(X_test) print(modelPrediction) print("Number of predictions:",len(modelPrediction)) meanSquaredError=mean_squared_error(y_test, modelPrediction) print("Mean Square Error (MSE):", meanSquaredError) rootMeanSquaredError = sqrt(meanSquaredError) print("RootMeanSquare Error (RMSE):", rootMeanSquaredError) fig, ax = plt.subplots() index_values=range(0,len(y_test)) y_test.sort_index(inplace=True) X_test.sort_index(inplace=True) modelPred_test = clf_rf.predict(X_test) ax.plot(pd.Series(index_values), y_test.values) Plot_In_One=pd.DataFrame(pd.concat([pd.Series(modelPred_test), pd.Series(y_test.values)], axis=1)) plt.figure(); Plot_In_One.plot(); plt.legend(loc='best')
However, the plot of the predicted values seems (as shown below) to be very coarse (the blue line). The orange line is the actual value.
Is there anyway that we can penalize the prediction error so that we get a plot closer to the actual value? If we plot the actual value alone, it looks like the following.

how to access facebook data using Apl as facepager?
I have to access the Facebook data, How to access using API ? I used
facepager
for fetch a data but it doesn't fetching, why ?Anyother way to get a fb data ???

How can I collect a particular type of data from all of the different existing websites
suppose i want to collect all the weather information of all city.
i don't want to collect it manually or even from a particular weather API. All i want from all kind of existing websites. i want to collect those data. then i want to process those data, according to my requirement. I want to create a bot who can dynamically read all data from existing websites.
I am beginner, please advice me how can i do this job. Also tell me whether is it can be possible or not.