How to implement tSNE in a model?
I split my data to train/test. When i use PCA It is straight forward.
from sklearn.decomposition import PCA
pca = PCA()
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
From here i can use X_train_pca and X_test_pca in the next step and so on..
But when i use tSNE
from sklearn.manifold import TSNE
X_train_tsne = TSNE(n_components=2, random_state=0).fit_transform(X_train)
I can't seem to transform the test set so that i can use the tSNE data for the next step e.g. SVM.
Any help?
2 answers

I believe that what you're trying to do is impossible.
tSNE makes a projection that tries to keep pairwise distances between the samples that you fit. So you cannot use a tSNE model to predict a projection on new data without doing a refit.
On the other hand, I would not give the output of a tSNE as input to a classifier. Mainly because tSNE is highly non linear and somewhat random and you can get very different outputs depending with different runs and different values of perplexity.
See this explanation of tSNE.
However, if you really with to use tSNE for this purpose, you'll have to fit your tSNE model on the whole data, and once it is fitted you make your train and test splits.
from sklearn.manifold import TSNE size_train = X_train.shape[0] X = np.vstack((X_train,X_test)) X_tsne = TSNE(n_components=2, random_state=0).fit_transform( X ) X_train_tsne = X_tsne[0:size_train,:] X_test_tsne = X_tsne[size_train:,:]

According to the documentation TSNE is a tool to visualize highdimensional data. A bit lower in the description we can find: it is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions.
My suggestion would be use TSNE for visualisation and PCA or TruncatedSVD as a part of the machine learning model.
See also questions close to this topic

Dataframe : each column in different plot in subplot
I have a panda Dataframe which i want each column to be represented on each subplot( 2 dimensions)
i know the default subplot of pandas is the desired output but 1 dimensional:
pallet 45 46 47 48 49 50 date 20190415 4.0 NaN 2.0 NaN NaN 2.0 20190416 3.0 2.0 2.0 2.0 1.0 1.0 20190417 2.0 2.0 2.0 2.0 1.0 1.0 20190418 2.0 2.0 2.0 NaN 1.0 1.0 20190419 2.0 2.0 2.0 NaN 1.0 1.0 20190420 2.0 2.0 2.0 NaN 1.0 NaN
pivot.plot(subplots=True) plt.show()
output: https://imgur.com/E61XREF.jpg
i want to be able to ouput each column but in 2 dimensional subplots. with common X and Y the columns length is dynamic so i want to be able to put like 6 columns on each figure, if num pallets > 6 open a new sameshaped figure.
so i want it to look like that: https://imgur.com/8GxWEah but with common X and Y
Thank you!

How to remove automatically added back ticks while using explode() in pyspark?
I want to add a new column with some expression as defined here(https://www.mien.in/2018/03/25/reshapingdataframeusingpivotandmeltinapachesparkandpandas/#pivotinspark). While doing so, my explode() function changes column names to be sought by adding back ticks(" ` ") at the beginning and at the end of each column which then gives out the error:
Cannot resolve column name `Column_name` from [Column_name, Column_name2]
I tried reading the documentation and few other questions on SO but they don't address this issue.
I tried logging the different steps, in order to give the reader some clarity.
The error is at the line:
_tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
The output of
explode(...)
is available here(https://pastebin.com/LU9p53th)The function snippet is:
def melt_df( df: DataFrame, id_vars: Iterable[str], value_vars: Iterable[str], var_name: str = "variable", value_name: str = "value") > DataFrame: """Convert :class:`DataFrame` from wide to long format.""" print("Value name is {} and value vars is {}".format( value_name, value_vars )) # df2 = df2.select([col(k).alias(actual_cols[k]) for k in keys_de_cols]) # Create array<struct<variable: str, value: ...>> _vars_and_vals = array(*( struct(lit(c).alias(var_name), col(c).alias(value_name)) for c in value_vars)) print("Explode: ") print(explode(_vars_and_vals)) # Add to the DataFrame and explode _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals)) print("_tmp:") print(_tmp) sys.exit() cols = id_vars + [ col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]] return _tmp.select(*cols)
Whereas the whole code is:
import sys from datetime import datetime from itertools import chain from typing import Iterable from pyspark.context import SparkContext from pyspark.sql import (DataFrame, DataFrameReader, DataFrameWriter, Row, SparkSession) from pyspark.sql.functions import * from pyspark.sql.functions import array, col, explode, lit, struct from pyspark.sql.types import * spark = SparkSession.builder.appName('navydish').getOrCreate() last_correct_constant = 11 output_file = "april19_1.csv" input_file_name = "input_for_aviral.csv" def melt_df( df: DataFrame, id_vars: Iterable[str], value_vars: Iterable[str], var_name: str = "variable", value_name: str = "value") > DataFrame: """Convert :class:`DataFrame` from wide to long format.""" print("Value name is {} and value vars is {}".format( value_name, value_vars )) # df2 = df2.select([col(k).alias(actual_cols[k]) for k in keys_de_cols]) # Create array<struct<variable: str, value: ...>> _vars_and_vals = array(*( struct(lit(c).alias(var_name), col(c).alias(value_name)) for c in value_vars)) print("Explode: ") print(explode(_vars_and_vals)) # Add to the DataFrame and explode _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals)) print("_tmp:") print(_tmp) sys.exit() cols = id_vars + [ col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]] return _tmp.select(*cols) def getrows(df, rownums=None): return df.rdd.zipWithIndex().filter( lambda x: x[1] in rownums).map(lambda x: x[0]) df = spark.read.csv( input_file_name, header=True ) df2 = df for _col in df.columns: if _col.startswith("_c"): df = df.drop(_col) if int(_col.split("_c")[1]) > last_correct_constant: df2 = df2.drop(_col) else: # removes the reqd cols, keeps the messed up ones only. df2 = df2.drop(_col) actual_cols = getrows(df2, rownums=[0]).collect()[0].asDict() keys_de_cols = actual_cols.keys() # df2 = df2.select([col(x).alias("right_" + str(x)) for x in right_cols]) df2 = df2.select([col(k).alias(actual_cols[k]) for k in keys_de_cols]) periods = [] periods_cols = getrows(df, rownums=[0]).collect()[0].asDict() for k, v in periods_cols.items(): if v not in periods: periods.append(v) # periods = list(set(periods)) expected_columns_from_df = [ 'Value Offtake(000 Rs.)', 'Sales Volume (Volume(LITRES))' ] for _col in df.columns: if _col.startswith('Value Offtake(000 Rs.)') or _col.startswith('Sales Volume (Volume(LITRES))'): continue df = df.drop(_col) df2 = df2.withColumn("id", monotonically_increasing_id()) df = df.withColumn("id", monotonically_increasing_id()) df = df2.join(df, "id", "inner").drop("id") print("After merge, cols of final dataframe are: ") for _col in df.columns: print(_col) # creating a list of all constant columns id_vars = [] for i in range(len(df.columns)): if i < 12: id_vars.append(df.columns[i]) # creating a list of Values from expected columns value_vars = [] for _col in df.columns: if _col.startswith(expected_columns_from_df[0]): value_vars.append(_col) value_vars = id_vars + value_vars print("Sending this value vars to melt:") print(value_vars) # the name of the column in the resulting DataFrame, Value Offtake(000 Rs.) var_name = expected_columns_from_df[0] # final value for which we want to melt, Periods value_name = "Periods" df = melt_df( df, id_vars, value_vars, var_name, value_name ) print("The final headers of the resultant dataframe are: ") print(df.columns)
The whole error is here(https://pastebin.com/9cUupTy3)
I understand one would need the data but I guess if one could clarify the working of explode in a way that the extra unwanted quotes(" ` ") can be avoided, I can work.

How to solve "ValueError: Shapes must be equal rank" when I use a customized env and use baseline to do DQN?
I use a Gym environment produced by others, which can be found on gymgomoku When I use baselines to try to train a model, an ERROR occurs like:
ValueError: Shapes must be equal rank, but are 1 and 2 for 'deepq/Select' (op: 'Select') with input shapes: [?], [?], [?,361].
I think there is something wrong with the environment but I can't get it.Because it is successful when I test other game environment on Gym's website like 'CartPolev0'.
Thank a lot!
here is my code:
import gym from baselines import deepq def callback(lcl, _glb): # stop training if reward exceeds 199 is_solved = lcl['t'] > 0.9 and sum(lcl['episode_rewards'][101:1]) / 100 >= 0.9 return is_solved def main(): env = gym.make("Gomoku19x19v0") model = deepq.models.mlp([32, 16], layer_norm=True) act = deepq.learn( env, q_func=model, lr=0.01, max_timesteps=10000, print_freq=1, checkpoint_freq=1000 ) print("Saving model to Gomoku9x9.pkl") act.save("Gomoku9x9.pkl") print('Finish!') if __name__ == '__main__': main()

How to fix 'Getting 2D array from XGBoost instead of 1D array after prediction of probabilities' in python?
I'm getting a 2D array when i'm doing probabilities prediction using XGBoost on Test data. It should be a 1D array. The shape of columns of training, validation and test data is the same. Expected is 1D array of probabilities. How to fix it?
! [1]: https://drive.google.com/open?id=1h5MaSa9ojKYfS67_JhfTRiXFfsMfpr
Have tried checking the shape of the data and it all looks fine but still not able to understand why it is giving a 2D array.
Fitting XGBoost on train data
xgb_model.fit(X_train, y_train)
Fitting XGBoost on test data
pred_xgb_test = xgb_model.predict_proba(test)
Checking predictions (getting 2D array here instead of 1D)
pred_xgb_test
Expected result is a 1D array of prediction probabilities. But instead getting a 2D array.

How to integrate machine learning micro services to rails application?
I have a rails application and this application needs to process images. For this image processing i like to use python OpenCV(machine learning). In this case, how can i integrate machine learning app into rails application? Already i have googled it and i didn't find right solution.

Keras Model giving TypeError: only size1 arrays can be converted to Python scalars
I'm training a model to produce image masks. This error keeps popping up, and I can not determine the cause. Help would be appreciated.
Error statement:
File "\Users\\Anaconda3\lib\sitepackages\keras\initializers.py", line 209, in __call__ scale /= max(1., float(fan_in + fan_out) / 2) TypeError: only size1 arrays can be converted to Python scalars
Researching online, this error occurs when normal lists are used with numpy functions, but in my case, the arrays used are numpy arrays. Below, I've attached the code.
import cv2 import glob import numpy as np from keras.models import Sequential from keras.layers import Dense, Dropout, Activation, Flatten from keras.layers import Convolution2D, MaxPooling2D from keras.utils import np_utils from keras.datasets import mnist from keras import backend as K K.set_image_dim_ordering('tf') np.random.seed(123) # for reproducibility image_list = [] test_list = [] for filename in glob.glob("image/*.jpg*"): im = cv2.imread(filename) im_r = cv2.resize(im,(200, 200), interpolation = cv2.INTER_AREA) image_list.append(im_r) for filename in glob.glob("test/*.png*"): im = cv2.imread(filename) im_r = cv2.resize(im,(200, 200), interpolation = cv2.INTER_AREA) im_r = np.ravel(im_r) test_list.append(im_r) x_data = np.array(image_list) y_data = np.array(test_list) x_data = x_data.astype("float32") y_data = y_data.astype("float32") x_data /= 255 y_data /= 255 X_train = x_data Y_train = y_data model = Sequential() model.add(Convolution2D(32, 5, 5, activation='relu', input_shape=(200, 200, 3))) model.add(MaxPooling2D(pool_size=(2,2))) model.add(Convolution2D(32, 5, 5, activation='relu')) model.add(MaxPooling2D(pool_size=(2,2))) model.add(Convolution2D(32, 3, 3, activation='relu')) model.add(MaxPooling2D(pool_size=(2,2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(Y_train[0], activation='sigmoid')) print('hello') model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) loss = acc = 0 while acc < 0.9999: model.fit(X_train, Y_train, batch_size=32, nb_epoch=10, verbose=1) loss, acc = model.evaluate(X_train, Y_train, verbose=1) model.save("model_state_no_mapping")

ValueError: shapes (1,4) and (5,4) not aligned: 4 (dim 1) != 5 (dim 0), when adding my variables to my prediction machine
I am creating a prediction machine with four variables. When I add the variables it all messes up and gives me:
ValueError: shapes (1,4) and (5,4) not aligned: 4 (dim 1) != 5 (dim 0)
code
import pandas as pd from pandas import DataFrame from sklearn import linear_model import tkinter as tk import statsmodels.api as sm
# Approach 1: Import the data into Python
Stock_Market = pd.read_csv(r'Training_Nis_New2.csv') df = DataFrame(Stock_Market,columns=['Month 1','Month 2','Month 3','Month 4','Month 5','Month 6','Month 7','Month 8', 'Month 9','Month 10','Month 11','Month 12','FSUTX','MMUKX','FUFRX','RYUIX','Interest R','Housing Sale','Unemployement Rate','Conus Average Temperature Rank','30FSUTX','30MMUKX','30FUFRX','30RYUIX']) X = df[['Month 1','Interest R','Housing Sale','Unemployement Rate','Conus Average Temperature Rank']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets Y = df[['30FSUTX','30MMUKX','30FUFRX','30RYUIX']] # with sklearn regr = linear_model.LinearRegression() regr.fit(X, Y) print('Intercept: \n', regr.intercept_) print('Coefficients: \n', regr.coef_) # prediction with sklearn # prediction with sklearn HS=5.5 UR=6.7 CATR=8.9 New_Interest_R = 4.6 print('Predicted Stock Index Price: \n', regr.predict([[UR ,HS ,CATR ,New_Interest_R]])) # with statsmodel X = df[['Month 1','Interest R','Housing Sale','Unemployement Rate','Conus Average Temperature Rank']] Y = df['30FSUTX'] print('\n\n*** Fund = FSUTX') X = sm.add_constant(X) # adding a constant model = sm.OLS(Y, X).fit() predictions = model.predict(X) print_model = model.summary() print(print_model)

having ambiguity using customized kernel for `sklearn.svm` regressor
I want to use customized kernel function in EpsilonSupport Vector Regression module of
Sklearn.svm
. I found this code as an example for customized kernel for svc at the scilitlearn documentation:import numpy as np import matplotlib.pyplot as plt from sklearn import svm, datasets # import some data to play with iris = datasets.load_iris() X = iris.data[:, :2] # we only take the first two features. We could # avoid this ugly slicing by using a twodim dataset Y = iris.target def my_kernel(X, Y): """ We create a custom kernel: (2 0) k(X, Y) = X ( ) Y.T (0 1) """ M = np.array([[2, 0], [0, 1.0]]) return np.dot(np.dot(X, M), Y.T) h = .02 # step size in the mesh # we create an instance of SVM and fit out data. clf = svm.SVC(kernel=my_kernel) clf.fit(X, Y) # Plot the decision boundary. For that, we will assign a color to each # point in the mesh [x_min, x_max]x[y_min, y_max]. x_min, x_max = X[:, 0].min()  1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min()  1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Put the result into a color plot Z = Z.reshape(xx.shape) plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired) # Plot also the training points plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired, edgecolors='k') plt.title('3Class classification using Support Vector Machine with custom' ' kernel') plt.axis('tight') plt.show()
I want to define some function like:
def my_new_kernel(X): a,b,c = (random.randint(0,100) for _ in range(3)) # imagine f1,f2,f3 are functions like sin(x), cos(x), ... ans = a*f1(X) + b*f2(X) + c*f3(X) return ans
What I thought about kernel method is that it's a function that gets matrix of features (
X
) as input and returns a matrix of shape (n,1) . Then svm appends the returned matrix to the feature columns and uses that to classify the labelsY
.In the code above the kernel is used in
svm.fit
function and I can't figure out what areX
andY
inputs of kernel and their shapes. ifX
andY
(inputs ofmy_kernel
method) are the features and label of dataset, so then how does the kernel work for test data where we have no labels?Actually I want to use svm for a dataset with shape of
(10000, 6)
, (5 columns=features, 1 column = label) then if I want to usemy_new_kernel
method what would be the inputs and output and their shapes. 
sklearn LogisticRegression a proof that a model could not be improved
I am trying to tune my Logistic Regression model in sklearn. I use the F score to validate the model score. Here is the main idea:
clf = LogisticRegression() parameters = { 'solver': ['newtoncg', 'lbfgs', 'liblinear', 'sag', 'saga'], 'max_iter': [100, 200, 300], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] } scorer = make_scorer(fbeta_score, beta=2) grid_obj = GridSearchCV(clf, parameters, scoring=scorer, verbose=True)
Also, I tried to keep the only solvers which support the
l1
andl2
penalties.
In the end, I got the results:Unoptimized model  Accuracy score on testing data: 0.8419 Fscore on testing data: 0.6832 Optimized Model  Final accuracy score on the testing data: 0.8418 Final Fscore on the testing data: 0.6828
It seems that the default model setup gives the best results. I admit that it may be. But I am looking for any explanation why is so? Is there is any paper or article which gives you more insights on it and explains why your Logistic Regression model could not perform better with some proofs coming from your train dataset?

Decoding Keras Multiclass Classifications
I am preparing input to feed into a Keras Neural network for a multiclass problem as:
encoder = LabelEncoder() encoder.fit(y) encoded_Y = encoder.transform(y) # convert integers to dummy variables (i.e. one hot encoded) dummy_y = np_utils.to_categorical(encoded_Y) X_train, X_test, y_train, y_test = train_test_split(X, dummy_y, test_size=0.2, random_state=42) X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.06, random_state=42)
After having trained the model, I try to run the following lines to obtain a prediction that reflects the original class names:
y_pred = model.predict_classes(X_test) y_pred = encoder.inverse_transform(y_pred) y_test = np.argmax(y_test, axis = 1) y_test = encoder.inverse_transform(y_test)
However, I obtain surpisingly low levels of accuracy (0.36), as oppoes to training and validations, that reach 0.98. Is this the right way of transforming classes back into the original labels?
I compute accuracies as:
# For training history.history['acc'] # For testing accuracy_score(y_test, y_pred)

Is there any difference if I use cropped objects or full frames for training a cascade classifier?
Can I use cropped objects from full frames as training dataset for a cascade classifier (LBP or HAAR)?
I know that I have to use full frames with annotations when retraining a neural net (Tensorflow, YOLO and so on)
But do I need it for a cascade classifier? Or cropped images are ok?
It seems I can do it because we have positive and negative images
So it should be ok to crop objects from positive images
E.g.
or

Input formatting for models such as logistic regression and KNN for Python
In my training set I have 24 Feature Vectors(FV). Each FV contains 2 lists. When I try to fit this on
model = LogisticRegression()
ormodel = KNeighborsClassifier(n_neighbors=k)
I get this errorValueError: setting an array element with a sequence.
In my dataframe, each row represents each FV. There are 3 columns. The first column contains a list of an individual's heart rate, second a list of the corresponding activity data and third the target. Visually, it looks like something like this:
HR ACT Target [0.5018, 0.5106, 0.4872] [0.1390, 0.1709, 0.0886] 1 [0.4931, 0.5171, 0.5514] [0.2423, 0.2795, 0.2232] 0
Should I:
 Join both lists to form on long FV
 Expand both lists such that each column represents one value. In other words, if there are 5 items in HR and ACT data for a FV, the new dataframe would have 10 columns for features and 1 for Target.
How does Logistic Regression and KNNs handle input data? I understand that logistic regression combines the input linearly using weights or coefficient values. But I am not sure what that means when it comes to lists VS dataframe columns. Does it mean it automatically converts corresponding values of dataframe columns to a list before transforming? Is there a difference between method 1 and 2?
Additionally, if a long list is required, should I have the long list as
[HR,HR,HR,ACT,ACT,ACT]
or[HR,ACT,HR,ACT,HR,ACT]
.