Choose the number of samples to average the gradient on in the SGDClassifier of Scikitlearn
I'm aware that the SGDClassifier in Scikitlearn
picks one random sample from the training dataset each time to calculate the gradient, and updates the model weights (w
and b
) accordingly.
My question is that among the parameters in the SGDClassifier, there doesn't seem to be an option to select the number of samples to pick each time (instead of just one instance) to average the gradient on? This would give us a Minibatch Gradient Descent
.
I've already had a look at the partial_fit()
method which gets chunks of the training dataset each time to train on, but when using this in the SGDClassifier
, doesn't it just boil down to picking a random training instance from the chunk, instead of choosing it from the whole dataset?
See also questions close to this topic

Regex lookbehind and lookahead doesn't find any match
I have a lot of data that I need to parse and output in different format. The data looks something like this:
tag="001">utb20181009818< tag="003">CZ PrNK< ...
And now, I want to extract 'utb20181009818' after after 'tag="001">' and before the last '<'
This is my code in python:
regex_pattern = re.compile(r'''(?=(tag="001(.*?)">)).*?(?<=[<])''') ID = regex_pattern.match(one_line) print(ID)
My variable one_line already contains the necessary data and I just need to extract the value, but it doesn't seem to match no matter what I do. I looked at it for hours, but doesn't seem to find out what I'm doing wrong.

Python: show type inheritance
I'm trying to look under the hood in idle to wrap my head around python custom classes and how they are stored in memory. Suppose I have the following code:
class Point: pass x=Point() print(x)
Given the following output:
<__main__.Point object at 0x000002A3A071DF60>
I know that since my class consists of no code, when I create an object of type
Point
, an object of typeobject
is implicitly created from which thePoint
objectx
inherits such methods as__str__
etc. However, I cant seem to see the connection ie. when I typedir(x)
, I dont see any attribute that stores a reference to an object of typeobject
. Am I misunderstanding how it works or is there some attribute that I am unaware of? 
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U78') dtype('<U78') dtype('<U78'
I am reading in images from directories and as I loop through file names I get error mentioned in the title. The variable 'imagePath' is the path to image in my local machine. When 'np.fromfile(imagePath)' is removed the code runs, it even will print the image's path, but blows up when I try to read them in with numpy.
def getTrainingDataFromFile(): for subdir, dirs, images in os.walk(directory): for sub, dirs, images in os.walk(subdir): for currentImage in images: imagePath = str(os.getcwd() + "/" + sub.replace("./", "") + "/" + currentImage) if '.jpg' in imagePath: face = np.fromfile(imagePath) images.append(face)
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('

Multiply multiple tensors pairwise keras
I want to ask if it is possible to multiply two tensors pairwise. So for example, I have tensor output from LSTM layer,
lstm=LSTM(128,return_sequences=True)(input) output=some_function()(lstm)
some_function()
should doh1*h2,h2*h3....hn1*hn
I found How do I take the squared difference of two Keras tensors? little helpful but since, I will have trainable paramter, I will have to make my own layer. Also, willsome_function
layer interpret input dimension automatically as it will behn1
I am confused on how to deal with
call()

Lower loss for validation set than training set
I am training an image superresolution CNN for some medical data. I split our dataset to 300 patients for training and 50 patients for testing.
I am using dropout of 50% and aware that dropout can cause similar phenomenon. However, I am not talking about training loss over the training phase and testing loss over the testing phase.
The two metrics are all produced in the testing phase. I am using the testing mode to predict BOTH training patients and testing patients. The result is intriguing.
In terms of the loss of superresolved image between ground truth, the testing patients are lower than training patients.
The plot below shows that both losses for the training set and the testing set are steadily dropping over epochs. But the testing loss is always lowering than the training. It is very puzzling to me to explain why. Also, the glitch between epoch 60 and epoch 80 is weird to me, too. If someone has an explanation for those problems, it will be more than appreciated.

Why RNN is throwing batch input shape error?
My x_train shape is (798,3) and y_ input shape is (798, 1).I am creating a RNN like this
def create_rnn_model(): model = Sequential() model.add(SimpleRNN(20,return_sequences=False,stateful=stateful,activation='relu',batch_input_shape=(1,3,1))) model.add(Activation('relu')) adam = optimizers.Adam(lr=0.001) model.compile(loss='mean_squared_error', optimizer=adam, metrics=[root_mean_squared_error]) return model
But this returns the error
ValueError: Error when checking input: expected simple_rnn_1_input to have 3 dimensions, but got array with shape (798, 3)
My batch size =1 and my timestep is 3 and dat_dim=1 .Then where am I doing it wrong? Any help is appreciated.

Type Error with sklearn make_scorer Function
I am trying to build a custom scoring function (using sklearn.metrics.make_scorer) to be used in a GridSearCV object. The documentation for make_scorer says: score_func : callable, Score function (or loss function) with signature score_func(y, y_pred, **kwargs). Here is the code I am using:
class model(object): def __init__(self): pass def fit(self, X, y): score_func = make_scorer(self.make_custom_score) clf = GradientBoostingClassifier() model = GridSearchCV(estimator=clf, param_grid=grid, scoring=score_func, cv=3) model.fit(X, y) return self def make_custom_score(y_true, y_score): df_out = pd.DataFrame() df = pd.DataFrame({'true': y_true.tolist(), 'probability': y_score.tolist()}) for threshold in np.arange(0.01, 1.0, 0.01): above_thresh = df[df['probability'] > threshold].groupby('true').count().reset_index() tp = above_thresh.loc[[1.0]]['probability'].sum() df_threshold = pd.DataFrame({'threshold': [threshold], 'tp': tp}) df_out = df_out.append(df_threshold) df_out = df_out.sort_values(by = ['threshold'], ascending = False) tp_score = tp[5] return tp_score
The error I get is:
TypeError: make_custom_score() takes 2 positional arguments but 3 were given.
I am planning on adding more to the scoring function using the **kwargs in the future so I would like to use make_scorer if I can.

What is the rationale for dask's LinearRegression and how to use it?
I've been playing around with dask and am running into some trouble.
Assume my data is kept in a DataFrame (either pandas or dask style) called data, and I'm trying to fit a LinearRegression model of data[yname] against data[xname], where yname and xname are the names of some columns in my dataframe.
1) Scikitlearn + pandas dataframe version:
sklearn.linear_model.LinearRegression().fit*(data[xname].values.reshape(1,1), data[yname])
2) Scikitlearn + dask dataframe version:
chunks = list(data[xname].map_partitions(len).compute())
sklearn.linear_model.LinearRegression().fit(data[xname].to_dask_array(chunks).reshape(1,1), data[yname])
3) daskml + dask dataframe version
chunks = list(data[xname].map_partitions(len).compute())
dask_ml.linear_model.LinearRegression(C=1e12).fit(data[xname].to_dask_array(chunks).reshape(1,1), data[yname])
Here are my issues with this:
 The first version is very fast on a pandas dataframe, but if my data doesn't fit in memory, I have to use the dask dataframe, which takes a lot of time because the data[xname] column needs to be computed across all chunks, which is very slow. In practice, I'd like to run one of those models per column in my data. How do I make the most of dask's capabilities in that case?
 What is the rationale for using version 3) vs version 2)? They seem to do roughly the same thing. Am I using dask_ml.linear_model wrong here? Also, there doesn't seem to be an easy way to have no penalty (l1/l2) in the model except setting the regularization parameter C to a high value.
 Is there an easier way than what I'm using to make my data[xname] have the correct format (array with (n_samples, n_features)) for the LinearRegression API? This is in regards to the cast to_dask_array and reshaping.

Incremental training of Keras image classification model
I used the smaller VGG model and modified the training script of the following tutorial for training a previously trained model. Original source of model and script: https://www.pyimagesearch.com/2018/04/16/kerasandconvolutionalneuralnetworkscnns/
This is what I did:
1st training session:
 train the model with image dataset of 2 classes A and B with the original training script from the tutorial
2nd training session:
 load the model and train with image dataset of class C without any new data of class A and B with the following modified training script
 load the trained model from the 1st session and train it with new data as the following stackoverflow thread suggested
 load the pickled array of labels of the 1st session, combine it with the new label array in the 2nd session and save it in lb.pickle
Reference of loading Keras model: Loading a trained Keras model and continue training
Result:
The trained model after the 2nd session can only recognize the new class in the 2nd session. It seems other classes trained in the 1st session are lost. It just doesn't work.
My question: How to fix the following script to make incremental training work? Or any other suggestion or reference of incremental training that is similar to my case?
My modified training script:
from keras.utils import np_utils from keras.preprocessing.image import ImageDataGenerator from keras.optimizers import Adam from keras.preprocessing.image import img_to_array from keras.models import load_model from sklearn.preprocessing import LabelBinarizer from sklearn.model_selection import train_test_split from smallervggnet import SmallerVGGNet from imutils import paths import numpy as np import argparse, os, sys import random import pickle import cv2 ap = argparse.ArgumentParser() ap.add_argument("d", "dataset", required=True, help="path to input dataset (i.e., directory of images)") ap.add_argument("im", "loadmodel", required=True, help="path to model to be loaded") ap.add_argument("m", "model", required=True, help="path to output model") ap.add_argument("l", "labelbin", required=True, help="path to output label binarizer") ap.add_argument("p", "plot", type=str, default="plot.png", help="path to output accuracy/loss plot") args = vars(ap.parse_args()) EPOCHS = 100 INIT_LR = 1e3 BS = 10 IMAGE_DIMS = (256, 256, 3) data = [] labels = [] print("[INFO] loading images...") imagePaths = sorted(list(paths.list_images(args["dataset"]))) random.seed(42) random.shuffle(imagePaths) for imagePath in imagePaths: image = cv2.imread(imagePath) image = cv2.resize(image, (IMAGE_DIMS[1], IMAGE_DIMS[0])) image = img_to_array(image) data.append(image) label = imagePath.split(os.path.sep)[2] labels.append(label) data = np.array(data, dtype="float") / 255.0 labels = np.array(labels) print("[INFO] data matrix: {:.2f}MB".format( data.nbytes / (1024 * 1000.0))) lb = LabelBinarizer() bLabels = lb.fit_transform(labels) (trainX, testX, trainY, testY) = train_test_split(data, bLabels, test_size=0.2, random_state=42) #add these 2 lines to avoid error trainY = np_utils.to_categorical(trainY, 2) testY = np_utils.to_categorical(testY, 2) aug = ImageDataGenerator(rotation_range=25, width_shift_range=0.1, height_shift_range=0.1, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, fill_mode="nearest") print("[INFO] load previously trained model") modelPath = args["loadmodel"] model = load_model(modelPath) print("[INFO] training network...") H = model.fit_generator( aug.flow(trainX, trainY, batch_size=BS), validation_data=(testX, testY), steps_per_epoch=len(trainX) // BS, epochs=EPOCHS, verbose=1) print("[INFO] serializing network...") model.save(args["model"]) # my attempt to keep the labels of all the training session in label binarizer prevArray = './train_output/previous_data_array.pickle' arrPickle = labels if os.path.getsize(prevArray) > 0: prev = pickle.loads(open(prevArray, 'rb').read()) arrPickle = np.concatenate((prev,labels), axis=0) lb = LabelBinarizer() lb.fit_transform(arrPickle) print("[INFO] serializing combined label array...") f = open(prevArray, "wb") f.write(pickle.dumps(arrPickle)) f.close() print("[INFO] serializing label binarizer...") f = open(args["labelbin"], "wb") f.write(pickle.dumps(lb)) f.close()

Gradient descent Search implemented in matlab theta1 incorrect
I studied the Machine learning course taught by Prof. Andrew Ng. This is the link
I try to implement the 1st assignment of this course. Exercise 2: Linear Regression based upon Supervised learning problem
1.Implement gradient descent using a learning rate of alpha=0.07.Since Matlab/Octave and Octave index vectors starting from 1 rather than 0, you'll probably use theta(1) and theta(2) in Matlab/Octave to represent theta0 and theta1.
I write down a matlab code to solve this problem:
clc clear close all x = load('ex2x.dat'); y = load('ex2y.dat'); figure % open a new figure window plot(x, y, '*'); ylabel('Height in meters') xlabel('Age in years') m = length(y); % store the number of training examples x = [ones(m, 1), x]; % Add a column of ones to x theta = [0 0]; temp=0,temp2=0; h=[]; alpha=0.07;n=2; %alpha=learning rate for i=1:m temp1=0; for j=1:n h(j)=theta(j)*x(i,j); temp1=temp1+h(j); end temp=temp+(temp1y(i)); temp2=temp2+((temp1y(i))*(x(i,1)+x(i,2))); end theta(1)=theta(1)(alpha*(1/m)*temp); theta(2)=theta(2)(alpha*(1/m)*temp2);
I get the answer :
>> theta theta = 0.0745 0.4545 Here, 0.0745 is exact answer but 2nd one is not accurate.
Actual answer
theta =
0.0745 0.3800
The data set is provided in the link. Can any one help me to fix the problem?

How to calculate the term back propagated in PRELU?
I should implement the PRELU activation function. But I do not know how to calculate the term dE / df(y_i) 1. I know that in the classic backprop when calculating the derivatives of the error with respect to the weights w, delta_j and delta_k are calculated. But in this case, with the prelu, how is it done? The first term (dE / df(y_i)) should be delta_j?

Losses are increasing in Binary classification using gradient descent optimization method
This my program for Binary classification using gradient descent optimization method. I am not sure about my loss function. The error in my case is incresing when plotted
def sigmoid_activation(x): return 1.0 / (1 + np.exp(x)) def predict(testX, W): preds= sigmoid_activation(np.dot(testX,W)) # apply a step function to threshold (=0.5) the outputs to binary class #labels #start your code here for i in range(len(preds)): if preds[i]<0.5: p.append(0) if preds[i]>=0.5: p.append(1) return p epochs = 50 alpha = 0.01 (X,y)=make_moons(n_samples=1000, noise = 0.15) y=y.reshape(y.shape[0],1) X = np.c_[X, np.ones((X.shape[0]))] (trainX, testX, trainY, testY) = train_test_split(X, y, test_size=0.5, random_state=42) print("[INFO] training...") W = np.random.randn(X.shape[1], 1) losses = [] for epoch in np.arange(0, epochs): #start your code here Z=np.dot(trainX, W) yhat= sigmoid_activation(Z) error=trainYyhat loss = np.sum(error ** 2) losses.append(loss) gradient = trainX.T.dot(error) / trainX.shape[0] W= Walpha*gradient #moving in ve direction # check to see if an update should be displayed if epoch == 0 or (epoch + 1) % 5 == 0: print("[INFO] epoch={}, loss={:.7f}".format(int(epoch + 1), loss)) # evaluate our model print("[INFO] evaluating...") preds = predict(testX, W) print(classification_report(testY, preds)) # plot the (testing) classification data plt.style.use("ggplot") plt.figure() plt.title("Data") plt.scatter(testX[:, 0], testX[:, 1], marker="o", c=testY[:,0], s=30) # construct a figure that plots the loss over time plt.style.use("ggplot") plt.figure() plt.plot(np.arange(0, epochs), losses) plt.title("Training Loss") plt.xlabel("Epoch #") plt.ylabel("Loss") plt.show()