Low accuracy with logistic regression, desired error not necessarily achieved due to precision loss
been trying to implement coursera's machine learning course in python. I am stuck on recognizing handwritten digits/exercise 3, my accuracy is 84% instead of 94% and I am getting a warnings aswell. Ive been checking the gradient and compute_cost for days but I just cant anymore, I would really, really appreciate any insights.
See also questions close to this topic

Python List Comp with replace  ' a byteslike object is required, not 'str'
I am trying to split up a tab delimited byte object in to lines and fields. In my input data when a field is supposed to be empty the data has  . I want to replace  with something that will act as a empty when I use it to build a mysql insert ''. I am new to list comprehension but I found a few examples that seemed similar.
for line in line_split[1:]: field_split=line.split(b'\t') field_split = [x.replace('', '') for x in field_split] print("f", field_split) report_list.append(field_split)
If I comment out the replace line that errors so it can print, I get back the following line. If you scroll right the field value i want to replace shows b''. This seems like it should be a simple fix but I have messing around for way longer then I care to admit
f [b'1020569383', b'X012312', b'42132LVPG0U', b'Glow', b'Sports', b'Glow', b'Amazon', b'18.85', b'18.85', b'11.61', b'10.67', b'1.54', b'36.02', b'inches', b'0.52', b'pounds', b'LgStdNonMedia', b'USD', b'6.02', b'2.83', b'0.00', b'', b'', b'', b'3.19']

Box plots in Python using Seaborn  creating duplicates for bigrams and trigrams
I am using Spyder as part of Anaconda and trying to classify tweets (text) by event type. To do this, I am using the package cross_val_score, having already vectorised my tweets using TfidVectorizer and then transforming my training data using fit_transform for unigrams, bigrams and trigrams, as per the below:
# TFIDF on unigrams, bigrams and trigrams tfidf_words = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin1', ngram_range=(1,1), stop_words='english') # vectorize for bigrams tfidf_bigrams = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin1', ngram_range=(2,2), stop_words='english') # vecorize for trigrams tfidf_trigrams = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin1', ngram_range=(3,3), stop_words='english') # Transform and fit each of the outputs from TFIDF (unigrams, bigrams and trigrams) x_train_words = tfidf_words.fit_transform(x_train_sm.preprocessed).toarray() # bigrams x_train_bigrams = tfidf_bigrams.fit_transform(x_train_sm.preprocessed).toarray() #trigrams x_train_trigrams = tfidf_trigrams.fit_transform(x_train_sm.preprocessed).toarray()
Now I perform cross validation using the package cross_val_score to calculate the average accuracy for unigrams, bigrams and trigrams. Once complete, I am trying to produce and save a boxplot for the accuracies achieved. This is completed for 4 different models:
# Create list of models to be tested: Random Forest, Linear SVC, Naive Bayes & Logistic Regression models = [OneVsRestClassifier(RandomForestClassifier(n_estimators = 200, max_depth=3, random_state=0)), OneVsRestClassifier(LinearSVC()), OneVsRestClassifier(MultinomialNB()), OneVsRestClassifier(LogisticRegression(random_state=0))] # number of folds (10fold cross validation performed for each model) CV = 10 ########## Fitting, predicting and calculating average accuracy for unigrams data ########## # create blank dataframe with an index equal to the number of CV folds * number of models tested cv_words = pd.DataFrame(index=range(CV * len(models))) #create an empty list, which will be populated with the accuracies of each model at each fold entries = [] # list of the names of the models tested names = ["Random Forest", "Linear SVC", "Naive Bayes", "Logistic Regression"] # convert y_train_sm from an array into a series to work in the 'cross_val_score' function # this series contains all of the event_ids for the corresponding encoded tweets (labels) # cross_val_score is a functin used to calculate performance scores and implement crossvalidation y_train_sm = pd.Series(y_train_sm.tolist()) ### Fitting, predicting and calculating average accuracy for unigrams data ### # calculate the accuracy at each fold and populate the results in the 'entries' list # populate the dataframe 'cv_words' with the fold and accuracy scores at each fold i = 0 for model in models: #model_name = #model.__class__.__name__ model_name = names[i] # model => the model that will be used to fit the data # x_train_words_sm => x training data after oversampling (unigrams) # y_train_sm => y training data after oversampling (event_id) # scoring => the type of score you want the function 'cross_val_score' to return # cv = number of folds you want to be performed with crossvalidation accuracies = cross_val_score(model, x_train_words, y_train_sm, scoring ='accuracy', cv=CV) for fold_idx, accuracy in enumerate(accuracies): entries.append((model_name, fold_idx, accuracy)) cv_words = pd.DataFrame(entries, columns=['model_name_unigrams', 'fold_idx', 'accuracy']) i = i + 1 # plot the results of each model on a single box plot box_words = sns.boxplot(x='model_name_unigrams', y='accuracy', data=cv_words) fig_words = box_words.get_figure() fig_words.savefig('boxplot_unigrams.png')
The output of the unigrams is exactly what I want:
Now when I run the code for bigrams and trigrams (highlight ALL code and hit 'play'), I get the following:
Bigrams:
[
Trigrams:
The code for each of these is identical, except they use 'cv_bigrams' and 'cv_trigrams' for the data input for the box plots. Code for each is below.
Bigram code:
# create blank dataframe with an index equal to the number of CV folds * number of models tested cv_bigrams = pd.DataFrame(index=range(CV * len(models))) # clear the previous list called 'entries' that was populated with values entries = [] # calculate the accuracy at each fold and populate the results in the 'entries' list # populate the dataframe 'cv_bigrams' with the fold and accuracy score at each fold i = 0 for model in models: #model_name = #model.__class__.__name__ model_name = names[i] # model => the model that will be used to fit the data # x_train_bigrams_sm => x training data after oversampling (bigrams) # y_train_sm => y training data after oversampling (event_id) # scoring => the type of score you want the function 'cross_val_score' to return # cv = number of folds you want to performed with crossvalidation accuracies = cross_val_score(model, x_train_bigrams, y_train_sm, scoring ='accuracy', cv=CV) for fold_idx, accuracy in enumerate(accuracies): entries.append((model_name, fold_idx, accuracy)) cv_bigrams = pd.DataFrame(entries, columns=['model_name_bigrams', 'fold_idx', 'accuracy']) i = i + 1
Trigrams code:
# create blank dataframe with an index equal to number of CV folds * number of models tested cv_trigrams = pd.DataFrame(index=range(CV * len(models))) # clear the previous list called 'entries' that was populated with values entries = [] # calculate the accuracy at each fold and populate the results in the 'entries' list # populate the dataframe 'cv_trigrams' with the fold and accuracy score at each fold i = 0 for model in models: #model_name = #model.__class__.__name__ model_name = names[i] # model => the model that will be used to fit the data # x_train_trigrams => data that is to be fitted by the selected model (trigrams) # y_train_sm => y training data after oversampling (event_id) # scoring => the type of score you want the function 'cross_val_score' to return # cv = number of folds you want to performed with crossvalidation accuracies = cross_val_score(model, x_train_trigrams, y_train_sm, scoring ='accuracy', cv=CV) for fold_idx, accuracy in enumerate(accuracies): entries.append((model_name, fold_idx, accuracy)) cv_trigrams = pd.DataFrame(entries, columns=['model_name_trigrams', 'fold_idx', 'accuracy']) i = i + 1
Here is what happens if I select the below code only and run:
# plot the results of each model as a box plot box_bigrams = sns.boxplot(x='model_name_bigrams', y='accuracy', data=cv_bigrams) box_bigrams = sns.boxplot(x='model_name_bigrams', y='accuracy', data=cv_bigrams) fig_bigrams = box_bigrams.get_figure() fig_bigrams.savefig('boxplot_bigrams.png')
Same for trigrams:
# plot the results of each model as a box plot box_trigrams = sns.boxplot(x='model_name_trigrams', y='accuracy', data=cv_trigrams) box_trigrams = sns.boxplot(x='model_name_trigrams', y='accuracy', data=cv_trigrams) fig_trigrams = box_trigrams.get_figure() fig_trigrams.savefig('boxplot_trigrams.png')
Output:
Any idea why I am getting duplicate boxplots overlapping each other when I run all of the code at once (which I need to do when I put this code into production), rather than highlighting the snippets and running separately?

How to make Scrapy crawl in DFS Order
I have a scrapy code where the structure is like
parse()
parse2()
parse3()
I want the scrapy to crawl the pages in dfs order i.e all the 3 links first followed by 2 and then 1. But scrapy doesn't crawl that way. I have tried all the ways to achieve this but unable to get the solution. Can someone suggest me the correct way to get this?
Ex:
def parse(self, response):
print "url1" yield scrapy.Request(url, callback=self.parse2)
def parse2(self, response):
print "url2" yield scrapy.Request(url, callback=self.parse3)
def parse3(self, response):
print "url3"
# Do something
Output should be something like
url1 url2 url3 .... .... .... url2 url3 .... .... url2 url3 .... .... url1
Thanks in advance

GDAL 2.3.1 is installed but the llinux terminal is using GDAL 2.2.2
I've installed GDAL 2.3.1 using pip on a ubuntu 16.04. The package is in the correct sitepackages directory and yet when I run a python script, I receive this error:
Error 1: NUMPY driver was compiled against GDAL 2.3, but the current library version is 2.2
When using 'gdalconfig version', the output shows '2.2.2'.
I want to know how to change the version of gdal that linux appears to be using from 2.2.2 to 2.3.1 but I have no idea how to do this.
Any help would be greatly appreciated!

Python  unable to merge tables
I have two tables with the following fields. Table 1: Year, Week, Quarter,... Table 2: YearMonth,...
I want to merge these two tables so that I can do further analysis. Since month information is not available for Table 1, I am confused. Please help!

What can be a quicker way to generate a matrix which has at least one element non zero in a row?
I am trying to come up with a big matrix of the size 10000*10000 that should have data type as float and atleast one of the elements on its rows should be non zero.
I was using:
import numpy as np list_going_in=np.random.rand(10000,10000)
but it takes more than a second to come up with the values, rendering it useless for my application.
I have also tried using
np.empty()
but that return all elements of a list as zeros and therefore can't be used.
can someone please suggest the possible way to do this in well under a second. I will prefer if it is a numpy array

How do round off the elements of scipy.sparse.csr matrix to 2 decimal places ?
It can't be converted to numpy array due to memory error.

2d interpolation with NaN values in python
I have a 2d matrix (1800*600) with many NaN values.
I would like to conduct a 2d interpolation, which is very simple in matlab. But ifscipy.interpolate.inter2d
is used, the result is aNaN
matrix. I know theNaN
values could be filled usingscipy.interpolate.griddata
, but I don't want to fulfill theNan
. What other functions can I use to conduct a 2d interpolation? 
Generalised least squares fit python
There are lots of questions and answers on least squares fitting in python. My personal favourite is How can I do a least squares fit in python, using data that is only an upper limit?
But how can I write a generalised
fit_lambda
function, which fits any function to the data. 
How to get feature importance in logistic regression using weights?
I have a dataset of reviews which has a class label of positive/negative. I am applying Logistic regression to that reviews dataset. Firstly, I am converting into Bag of words. Here sorted_data['Text'] is reviews and final_counts is a sparse matrix
count_vect = CountVectorizer() final_counts = count_vect.fit_transform(sorted_data['Text'].values) standardized_data = StandardScaler(with_mean=False).fit_transform(final_counts)
split the data set into train and test
X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0) X_tr, X_cv, y_tr, y_cv = cross_validation.train_test_split(X_1, y_1, test_size=0.3)
I am applying the logistic regression algorithm as follows
optimal_lambda = 0.001000 log_reg_optimal = LogisticRegression(C=optimal_lambda) # fitting the model log_reg_optimal.fit(X_tr, y_tr) # predict the response pred = log_reg_optimal.predict(X_test) # evaluate accuracy acc = accuracy_score(y_test, pred) * 100 print('\nThe accuracy of the Logistic Regression for C = %f is %f%%' % (optimal_lambda, acc))
My weights are
weights = log_reg_optimal.coef_ . #<class 'numpy.ndarray'> array([[0.23729528, 0.16050616, 0.1382504 , ..., 0.27291847, 0.35857267, 0.41756443]]) (1, 38178) #shape of weights
I want to get the feature importance i.e; top 100 features which have high weights. Could anyone tell me how to get them?

How to do multicollinearity check in logistic regression?
I want to do a multicollinearity check for Bag of Words for logistic regression. I have to add a noise to the matrix i.e; from N(0,0.1) (to add noise). I want to check the weights prior to adding the noise and also after adding the noise. If the weights differ a lot then I will know that there is a multicollinearity. I converted the text into a matrix
count_vect = CountVectorizer() #in scikitlearn final_counts = count_vect.fit_transform(data['CleanedText'].values) standardized_data = StandardScaler(with_mean=False).fit_transform(final_counts)
The shape of the standardized_data(sparse matrix) is as follows
(0, 232) 5.28663039106 (0, 1026) 2.09754160944 (0, 4351) 47.1484208356 (0, 4894) 3.62576585703 (0, 6326) 17.496202036 (0, 7585) 12.2994564729 (0, 9033) 55.0542695865 (0, 9480) 5.60252663694 (0, 9489) 34.3093270041
Could anyone tell me how to add a noise to the matrix and get the weights?

Multiple logistic regression in R with binned data
Supose I have a data frame like this (actual data is irrelevant):
df < data.frame(bin = rep(1:10,5), x = sample(c(0:1),50,replace = T), y = sample(c(0:1),50,replace = T))
I want to make a logistic regression that implements the following:
On wich i is every bin. I know that I can implement it in R with
glm(y ~ x,family = binomial(link = "logit"))
, but I can't figure out how to make it without getting every x_i for 1 up to N bins and then run the modely ~ x_1 + x_2 +... x_N
.Is there a better way to doing it?

How to give restrictions on Gradient Descent Optimizations
I have the function:
a*b = 13
I am using gradient descent to obtain correct value for a and b. so am getting
a = 1.29
andb = 10.03
which satisfies the condition, but I need to get a = 1 and b = 13.I don't want a and b to be decimals, So that whenever I give a*b some prime number I should get a, b = 1,the prime number. How to give a restriction that a and b should be real numbers ?

Machine Learning: Stochastic gradient descent for logistic regression in R: Calculating Eout and average number of epochs
I am trying to write a code to solve the following problem (As stated in HW5 in the CalTech course Learning from Data):
In this problem you will create your own target function f (probability in this case) and data set D to see how Logistic Regression works. For simplicity, we will take f to be a 0=1 probability so y is a deterministic function of x. Take d = 2 so you can visualize the problem, and let X = [1; 1]×[1; 1] with uniform probability of picking each x 2 X . Choose a line in the plane as the boundary between f(x) = 1 (where y has to be +1) and f(x) = 0 (where y has to be 1) by taking two random, uniformly distributed points from X and taking the line passing through them as the boundary between y = ±1. Pick N = 100 training points at random from X , and evaluate the outputs yn for each of these points xn. Run Logistic Regression with Stochastic Gradient Descent to find g, and estimate Eout(the cross entropy error) by generating a sufficiently large, separate set of points to evaluate the error. Repeat the experiment for 100 runs with different targets and take the average. Initialize the weight vector of Logistic Regression to all zeros in each run. Stop the algorithm when w(t1)  w(t) < 0:01, where w(t) denotes the weight vector at the end of epoch t. An epoch is a full pass through the N data points (use a random permutation of 1; 2; · · · ; N to present the data points to the algorithm within each epoch, and use different permutations for different epochs). Use a learning rate of 0.01.
I am required to calculate the nearest value to Eout for N=100, and the average number of epochs for the required criterion.
I wrote and ran the code but I'm not getting the right answers (as stated in the solutions, these are Eout is near 0.1 and the number of epochs is near 350). The required number of epochs for a delta w of 0.01 comes to far too small (around 10), leaving the error too big (around 2). I then tried to replace the criterion with w(t1)  w(t) < 0.001 (rather than 0.01). Then, the average required number of epochs was about 250 and out of sample error was about 0.35.
Is there something wrong with my code/solution, or is it possible that the answers provided are faulty? I've added comments to indicate what I intend to do at each step. Thanks in advance.
library(pracma) h< 0 # h will later be updated to number of required epochs p< 0 # p will later be updated to Eout C < matrix(ncol=10000, nrow=2) # Testing set, used to calculate out of sample error d < matrix(ncol=10000, nrow=1) for(i in 1:10000){ C[, i] < c(runif(2, min = 1, max = 1)) # Sample data d[1, i] < sign(C[2, i]  f(C[1, i])) } for(g in 1:100){ # 100 runs of the experiment x < runif(2, min = 1, max = 1) y < runif(2, min = 1, max = 1) fit = (lm(y~x)) t < summary(fit)$coefficients[,1] f < function(x){ # Target function t[2]*x + t[1] } A < matrix(ncol=100, nrow=2) # Sample data b < matrix(ncol=100, nrow=1) norm_vec < function(x) {sqrt(sum(x^2))} # vector norm calculator w < c(0,0) # weights initialized to zero for(i in 1:100){ A[, i] < c(runif(2, min = 1, max = 1)) # Sample data b[1, i] < sign(A[2, i]  f(A[1, i])) } q < matrix(nrow = 2, ncol = 1000) # q tracks the weight vector at the end of each epoch l= 1 while(l < 1001){ E < function(z){ # cross entropy error function x = z[1] y = z[2] v = z[3] return(log(1 + exp(v*t(w)%*%c(x, y)))) } err < function(xn1, xn2, yn){ #gradient of error function return(c(yn*xn1, yn*xn2)*(exp(yn*t(w)*c(xn1,xn2))/(1+exp(yn*t(w)*c(xn1,xn2))))) } e = matrix(nrow = 2, ncol = 100) # e will track the required gradient at each data point e[,1:100] = 0 perm = sample(100, 100, replace = FALSE, prob = NULL) # Random permutation of the data indices for(j in 1:100){ # One complete Epoch r = A[,perm[j]] # pick the perm[j]th entry in A s = b[perm[j]] # pick the perm[j]th entry in b e[,perm[j]] = err(r[1], r[2], s) # Gradient of the error w = w  0.01*e[,perm[j]] # update the weight vector accorng to the formula involving step size, gradient } q[,l] = w # the lth entry is the weight vector at the end of the lth epoch if(l > 1 & norm_vec(q[,l]  q[,l1])<0.001){ # given criterion to terminate the algorithm break } l = l+1 # move to the next epoch } for(n in 1:10000){ p[g] = mean(E(c(C[1,n], C[2, n], d[n]))) # average over 10000 data points, of the error function, in experiment no. g } h[g] = l #gth entry in the vector h, tracks the number of epochs in the gth iteration of the experiment } mean(h) # Mean number of epochs needed mean(p) # average Eout, over 100 experiments

Gradient descent weights initialization
I came back to gradient descent with some question. You might have already seen following lines of code. ( normally weights get initialized with randn function).
Why should I assign this with
randn
function? Is there any issue if I start bothm
andb
with zero. I am awarerandn
starts with some values of uniform distribution. I am doubting whether the assignedrandn
values might start above my actual point values. Could you kindly explain something on this.m = np.random.randn() b = np.random.randn()
Thanks