Choose the number of samples to average the gradient on in the SGDClassifier of Scikitlearn
I'm aware that the SGDClassifier in Scikitlearn
picks one random sample from the training dataset each time to calculate the gradient, and updates the model weights (w
and b
) accordingly.
My question is that among the parameters in the SGDClassifier, there doesn't seem to be an option to select the number of samples to pick each time (instead of just one instance) to average the gradient on? This would give us a Minibatch Gradient Descent
.
I've already had a look at the partial_fit()
method which gets chunks of the training dataset each time to train on, but when using this in the SGDClassifier
, doesn't it just boil down to picking a random training instance from the chunk, instead of choosing it from the whole dataset?
See also questions close to this topic

Set as seaborn/matplotlib to always use edgecolor='k' parameter
I am constructing a lot of bar plots which are all in the same style, for example
import seaborn as sns import matplotlib.pyplot as plt sns.set_style('darkgrid') # some more configuration options that work as expected here sns.barplot([1, 2, 3], [4, 5, 6], edgecolor='k')
which gives
Is there a way to configure my plotting environment to always use
edgecolor='k'
such that I don't have to pass the parameter every time?I tried
sns.set_style(rc={'patch.force_edgecolor':True})
but it has no effect. If I don't explicitly use
edgecolor='k'
when plotting then the bars don't have an edge.I also noticed that there is
plt.rc('edgecolor', ???)
but I cannot figure out what to pass for
???
(it has to be a keyword argument). 
Python Missing value in output
Currently i am trying to learn to merge 2 singly linked list. However, I can't seem to understand why as to it is missing the first value when I key in the value.
These are the merge code: def mergeList(self, list):
p = self.head q = list.head s = None if not p: return q if not q: return p if p and q: if p.data <= q.data: s = p p = s.next else: s = q q = s.next new_head = s while p and q: if p.data <= q.data: s.next = p s = p p = s.next else: s.next = q s = q q = s.next if not p: s.next = q if not q: s.next = p return new_head
These are the codes where I request the user to input the values:
array1 = raw_input("Enter a list of numbers in descending order for list 1 separated by commas:") array1 = [int(x) for x in array1.split(",")] array2 = raw_input("Enter a list of numbers in descending order for list 2 separated by commas:") array2 = [int(x) for x in array2.split(",")] s1 = SinglyLinkedList() s2 = SinglyLinkedList() #insert items into s1, starting with the largest number at the end of the array1 for i in range(len(array1)1, 1, 1) : n = SinglyListNode(array1[i]) s1.insertAtHead(n) #insert items into s2, starting with the largest number at the end of the array2 for i in range(len(array2)1, 1, 1) : n = SinglyListNode(array2[i]) s2.insertAtHead(n)
And these are the print codes:
def printList(self): temp = self.head print "[", while temp is not None: print temp.data, temp = temp.next print "]" s1.mergeList(s2) print "Content of merged list" s1.printList()
When the user inputs the value of :
3 6 6 10 45 45 50 into s1 2 3 55 60 into s2
The output:
3 3 6 6 10 45 45 50 55 60
The value 2 in this case does not get printed out. I have tried to print the value of the head of the new_head in the mergeList and I got a 2.
What I don't understand is why is it that when it is printed, the value of 2 at the head of the list disappears.
Thanks for the help.

Django encrypted paypal m2crypto python3
I was trying to add Paypal Integration with Django with the help of djangopaypal https://djangopaypal.readthedocs.io/en/stable/
Since I dont wanted the values to be changed by form I switched from PAypalPaymentForm to PayPalEncryptedPaymentsForm
paypal_dict = { 'business': settings.PAYPAL_RECEIVER_EMAIL, 'amount': str(subscription_rule.price_in_usd), 'item_name': "Software License Key", 'invoice': "Test Payment Invoice", 'currency_code': 'USD', "custom": "some extra data", 'notify_url': 'http://mynotifyurl.com/', 'return_url': 'http://{}{}'.format(host, reverse('payment_done')), 'cancel_return': 'http://{}{}'.format(host, reverse('payment_canceled')), } form = PayPalEncryptedPaymentsForm(initial=paypal_dict)
but this requires a library M2Crypto that is available only in python 2 offically.So I cloned the unofficial edition of m2crypto in my site packages located at
https://gitlab.com/m2crypto/m2cryptoWhen I added this I got an import error for _m2crypto What Can I do now?

Solving mountain car (gym) with linear valuefunction approximation with temporal difference weight update
So, in my assignment i need to solve Mountain car, by optimizing the action state value function using linear function approximation specifically the using polynomials as features. For solving this i am using the Episodic Semigradient Sarsa for Estimating optimal state value function (like in http://www.cs.cmu.edu/~rsalakhu/10703/Lecture_VFA.pdf page 26). My feature vector for a pair [state=(position,velocity),action] looks as follows:
[1, action, position, velocity, positionXvelocity, actionXvelocity, actionXposition, actionXpositionXvelocity]
so i chose a first degree polynomial... I am initializing the weight vector W with zeros, and pretty much following the exact algorithm. Unfortunately every time i run it (also tried for many episodes) my agent never succeeds to reach the top of the mountain, and therefore only has 1 rewards, so improvement never happens. I don't want to change the environment as i feel this is "cheating". Any ideas what to change ? my code pretty much follows the http://www.cs.cmu.edu/~rsalakhu/10703/Lecture_VFA.pdf page 26 algorithm
n is the polynomial order, phi calculates the feature vector, qhat calculates featurevector*weightvector, qhat_derived_policy is an implementation of epsilon greedy algorithm for choosing action
def Episodic_Semi_gradient_Sarsa(alpha, epsilon, n): env = gym.make("MountainCarv0") env.seed(3333) # Set a seed for reproducability w = np.ones((n+1)**3) w[((n+1)**3)  1]=0 for i_episode in range(10000): state = env.reset() for t in range(200): env.render() action = qhat_derived_policy(state, epsilon, w, n) nextstate, reward, done, info = env.step(action) if done: w += alpha*(reward  qhat(state, action, w, n))*phi(state, action, n) print("Episode finished after {} timesteps".format(t+1)) break nextaction = qhat_derived_policy(nextstate, epsilon, w, n) w += alpha*(reward + qhat(nextstate, nextaction, w, n)qhat(state, action, w, n))*phi(state, action, n) action = nextaction state = nextstate

How to efficiently train a CNN model having large image dataset
I am beginner in machine learning. I am making a CNN model using keras to detect pest from leaf image. During training the data, memory exceed and I was unable to train. I have used kaggle/Google Collab but in both I have memory probelm. I was suggested to use Data Generator, but while trying to do, I was unable to do. Is there any other way to efficiently train or any example whether data generator is used(Have seen many examples but have problem while adding.
import numpy as np import pickle import cv2 from os import listdir from sklearn.preprocessing import LabelBinarizer from keras.models import Sequential from keras.layers.normalization import BatchNormalization from keras.layers.convolutional import Conv2D from keras.layers.convolutional import MaxPooling2D from keras.layers.core import Activation, Flatten, Dropout, Dense from keras import backend as K from keras.preprocessing.image import ImageDataGenerator from keras.optimizers import Adam from keras.preprocessing import image from keras.preprocessing.image import img_to_array from sklearn.preprocessing import MultiLabelBinarizer from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt EPOCHS = 25 INIT_LR = 1e3 BS = 32 default_image_size = tuple((256, 256)) image_size = 0 directory_root = 'PlantVillage/' width=256 height=256 depth=3 #Function to convert images to array def convert_image_to_array(image_dir): try: image = cv2.imread(image_dir) if image is not None: image = cv2.resize(image,default_image_size) return img_to_array(image) else: return np.array([]) except Exception as e: print(f"Error : {e}") return None image_list, label_list = [], [] try: print("[INFO] Loading images ...") root_dir = listdir(directory_root) #Looping inside root_directory for directory in root_dir : # remove .DS_Store from list if directory == ".DS_Store" : root_dir.remove(directory) for plant_folder in root_dir : plant_disease_folder_list = listdir(f"{directory_root}/{plant_folder}") print(f"[INFO] Processing {plant_folder} ...") #looping in images for disease_folder in plant_disease_folder_list : # remove .DS_Store from list if disease_folder == ".DS_Store" : plant_disease_folder_list.remove(plant_folder) #If all data taken not able to train for images in plant_disease_folder_list: image_directory = f"{directory_root}/{plant_folder}/{images}" if image_directory.endswith(".jpg") == True or image_directory.endswith(".JPG") == True: image_list.append(convert_image_to_array(image_directory)) label_list.append(plant_folder) print("[INFO] Image loading completed") except Exception as e: print(f"Error : {e}") #Get Size of Processed Image image_size = len(image_list) #Converting multiclass labels to binary labels(belong or doesnot belong in the class) label_binarizer = LabelBinarizer() image_labels = label_binarizer.fit_transform(label_list) #Saving label binarizer instance using pickle pickle.dump(label_binarizer,open('label_transform.pkl','wb')) n_classes = len(label_binarizer.classes_) print(label_binarizer.classes_) #Normalizing image from [0,255] to [0,1] np_image_list = np.array(image_list, dtype = np.float)/255.0 #Splitting data into training and test set 80:20 print('Splitting data to train,test') x_train, x_test, y_train, y_test = train_test_split(np_image_list, image_labels, test_size=0.2, random_state = 42) #Creating image generator object which performs random rotations, shifs,flips,crops,sheers aug = ImageDataGenerator( rotation_range = 25, width_shift_range=0.1, height_shift_range=0.1, shear_range=0.2, zoom_range=0.2, horizontal_flip = True, fill_mode="nearest") model = Sequential() inputShape = (height, width, depth) chanDim = 1 if K.image_data_format() == "channels_first": inputShape = (depth, height, width) chanDim = 1 model.add(Conv2D(32, (3, 3), padding="same",input_shape=inputShape)) model.add(Activation("relu")) model.add(BatchNormalization(axis=chanDim)) model.add(MaxPooling2D(pool_size=(3, 3))) model.add(Dropout(0.25)) model.add(Conv2D(64, (3, 3), padding="same")) model.add(Activation("relu")) model.add(BatchNormalization(axis=chanDim)) model.add(Conv2D(64, (3, 3), padding="same")) model.add(Activation("relu")) model.add(BatchNormalization(axis=chanDim)) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Conv2D(128, (3, 3), padding="same")) model.add(Activation("relu")) model.add(BatchNormalization(axis=chanDim)) model.add(Conv2D(128, (3, 3), padding="same")) model.add(Activation("relu")) model.add(BatchNormalization(axis=chanDim)) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(32)) model.add(Activation("relu")) model.add(BatchNormalization()) model.add(Dropout(0.5)) model.add(Dense(n_classes)) model.add(Activation("softmax")) #model.summary() #Compiling the CNN opt = Adam(lr= INIT_LR, decay= INIT_LR/EPOCHS) #distribution model.compile(loss="binary_crossentropy", optimizer = opt, metrics=["accuracy"]) #training the Model print("Training Model.....") history = model.fit_generator( aug.flow(x_train, y_train, batch_size= BS), validation_data = (x_test, y_test), steps_per_epoch = len(x_train) // BS, epochs = EPOCHS, verbose = 1 )
You find code in this link too.

Calculating Sensitivity for binary segmented image vs ground truth using some pixel tolerance
I have used a segmentation algorithm for some images with curvylinear structures (image size 256x256). The segmented image is binary. I also have ground truth images for these segmentations (binary). I am able to calculate sensitivity of the predictions by comparing pixel by pixel, but I saw some papers where they give a tolerance of some pixels while calculating the sensitivity. Can someone give an idea of how to calculate sensitivity with say 3 pixels tolerance? (I'm using matlab)

Sklearn RandomizedSearchCV OSError: [Errno 5] Input/output error
I'm trying to use a RandomizedSearch to determine the best hypterparameters for SKLearn's MLP and XGBoost. While running the optimization, after roughly 50 runs an OSError occurred.
The code I used for randomizedSearch with XGBoost:
from scipy.stats import randint as sp_randint import xgboost as xgb from sklearn.model_selection import RandomizedSearchCV from joblib import dump, load from sklearn.model_selection import StratifiedShuffleSplit import numpy as np import pickle # Reference f1 eval > https://stackoverflow.com/questions/51587535/customevaluationfunctionbasedonf1foruseinxgboostpythonapi from sklearn.metrics import f1_score import numpy as np def f1_eval(y_pred, dtrain): y_true = dtrain.get_label() err = 1f1_score(y_true, np.round(y_pred)) print("Score: ", str(1  err)) return 'f1_err', err neg_samples = len(y[y['canceled_in_6_mon'] == 0]) pos_samples = len(y[y['canceled_in_6_mon'] == 1]) xgb_model = xgb.XGBClassifier(objective= 'reg:logistic', nthread=1, scale_pos_weight=neg_samples / pos_samples) parameters = { 'learning_rate': [0.005, 0.01, 0.05, 0.1, 0.15, 0.25, 0.35], #so called `eta` value 'max_depth': sp_randint(10, 100), 'min_child_weight': sp_randint(1, 8), 'silent': [1], 'gamma': [0, 0.2, 0.5, 0.7, 1], 'subsample': [0.5, 0.7, 1], 'colsample_bytree': [0.5, 0.7, 1], "n_estimators": sp_randint(20, 100), "max_features": sp_randint(10, 400), "min_samples_split": sp_randint(2, 20), "seed": [42], "min_samples_leaf": sp_randint(1, 5) } ss = StratifiedShuffleSplit(n_splits=3, test_size=0.24, random_state=42) clf = RandomizedSearchCV(estimator=xgb_model, param_distributions = parameters,cv=ss, verbose=10, n_jobs=4, scoring='f1', n_iter=50) clf.fit(X=X_train, y=np.ravel(y_train), eval_metric=f1_eval)
Here's the complete output I got including the error: Pastebin Link
Any ideas why this happens? The error indicates that it's a problem with joblib, but I executed the same randomizedSearch code with a RandomForest Classifier and all works fine.

Predicting outcome of NFL games
A bit of background: I've scraped together 4 seasons of NFL game stats (a little under 1,100 games) and am trying to figure out the best way to predict straight up winners. Here's an example of the layout of 1 row  i.e., game:
Date year team opp team_score opp_score team_first downs 20140904 2014.0 SEA GB 36 16 25
team_net pass yards team_total yards team_turnovers team_time of possession 191 398 1 33.33
There's about 40+ more columns of potential stats for each game too. What I did was make each game 2 rows, so each team for each game has their own row with
team
stats andopp
stats  as well as ahome
either0
or1
to keep track of who's playing where. Of course, there is awin
column of either0
or1
which is the target variable. Total, the data have around 2,200 rows.My ultimate goal is to have a model simulate a matchup and provide a probability of the winner; e.g.,
NYG
:10%
,LAR
:90%
.Now, this is where my head unfortunately starts to spin, and I'm frankly not even sure if i've set the data up in an efficient manner. My first thought was to fit a classifier method like logistic regression/random forest (and maybe something like RFE to reduce the features) to try and predict a "winner" for any given game, but i'm not sure that's useful when so much has to be taken into account concerning who they're playing.
I'm wondering what, at first glance, the best route might be to set up such a model. Sorry if it's too broad a question, I'm happy to provide as much info as possible about the data!

Chaining transformations in scikit pipeline
I'm using the scikit pipeline to create a preprocess on a dataset. I have a dataset with four variables:
['monetary', 'frequency1', 'frequency2', 'recency']
and I want to preprocess all butrecency
. To preprocess, I first want to get the log and then standardize. However, when I get the transformed data from the pipeline, I get 7 columns (3 log, 3 standardize, recency). Is there a way to chain the transformations and so I can get the log and after the log perform standardize and only get a 4 feature dataset?def create_pipeline(df): all_but_recency = ['monetary', 'frequency1','frequency2'] # Preprocess preprocessor = ColumnTransformer( transformers=[ ( 'log', FunctionTransformer(np.log), all_but_recency ), ( 'standardize', preprocessing.StandardScaler(), all_but_recency ) ], remainder='passthrough') # Pipeline estimators = [( 'preprocess', preprocessor )] pipe = Pipeline(steps=estimators) print(pipe.set_params().fit_transform(df).shape)
Thanks in advance

Gradient descent algorithm in MATLAB
first thank you for taking the time to help. I thought I had a really good handle on the calculus behind gradient descent but for some reason by the algorithm is only getting very close to the optimal parameter values.
My algorithm results in parameter values of 3.5884 , 1.1237 for minimizing the cost function when the correct values are 3.6303 and 1.1664.
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) m = length(y); % number of training examples for iter = 1:num_iters for i = 1:m hypothesis = (theta(1)+theta(2)*X(i,2)); d1 = (1/m)* (hypothesis  y(i)); d2 = (X(i,2)/m)* (hypothesis  y(i)); theta(1) = theta(1)  d1 * alpha; theta(2) = theta(2)  d2 * alpha; end end end

How to write Multiplicative Update Rules for Matrix Factorization when one doesn't have access to the whole matrix?
So we want to approximate the matrix A with m rows and n columns with the product of two matrices P and Q that have dimension mxk and kxn respectively. Here is an implementation of the multiplicative update rule due to Lee in C++ using the Eigen library.
void multiplicative_update() { Q = Q.cwiseProduct((P.transpose()*matrix).cwiseQuotient(P.transpose()*P*Q)); P = P.cwiseProduct((matrix*Q.transpose()).cwiseQuotient(P*Q*Q.transpose())); }
where
P
,Q
, and thematrix
(matrix = A) are global variables in theclass mat_fac
. Thus I train them using the following method,void train_2(){ double error_trial = 0; for (int count = 0;count < num_iterations; count ++) { multiplicative_update(); error_trial = (matrixP*Q).squaredNorm(); if (error_trial < 0.001) { break; } } }
where
num_iterations
is also a global variable in theclass mat_fac
.The problem is that I am working with very large matrices and in particular I do not have access to the entire matrix. Given a triple (i,j,matrix[i][j]), I have access to the row vector P[i][:] and the column vector Q[:][j]. So my goal is to write rewrite the multiplicative update rule in such a way that I update these two vectors every time, I see a nonzero matrix value.
In code, I want to have something like this:
void multiplicative_update(int i, int j, double mat_value) { Eigen::MatrixXd q_vect = get_vector(1, j); // get_vector returns Q[:][j] as a column vector Eigen::MatrixXd p_vect = get_vector(0, i); // get_vector returns P[i][:] as a column vector // Somehow compute coeff_AQ_t, coeff_PQQ_t, coeff_P_tA and coeff_P_tA. for(int i = 0; i< k; i++): p_vect[i] = p_vect[i]* (coeff_AQ_t)/(coeff_PQQ_t) q_vect[i] = q_vect[i]* (coeff_P_tA)/(coeff_P_tA) }
Thus the problem boils down to computing the required coefficients given the two vectors. Is this a possible thing to do? If not, what more data do I need for the multiplicative update to work in this manner?

Tensorflow Projected Gradient Descent with Box Constraints using native Optimizer's apply_gradients
Say that our model parameters w have box constraints (e.g. 0 < w_i < 1). How can I implement projected gradient descent in Tensorflow respecting this constraints when I optimize using a subclass from tf.optimizer (e.g.ADAMOptimizer)?