Calculate residual deviance from scikitlearn logistic regression model
Is there any way to calculate residual deviance of a scikitlearn logistic regression model? This is a standard output from R model summaries, but I couldn't find it any of sklearn's documentation.
1 answer

You cannot do it in scikitlearn but check out statsmodels,
GLMResults
(API)
See also questions close to this topic

How to detach Python child process on Windows (without setsid)?
I'm migrating some process code to Windows which worked well on Posix. Very simply put: the code to launch a subprocess and immediately detach will not work because
setsid()
is not available:import os, subprocess, sys p = subprocess.Popen([sys.executable, 'c', "print 'hello'"], preexec_fn=os.setsid)
I can remove the use of
setsid
but then the child process ends when the parent ends.My question is how do I achieve the same effect as
setsid
on Windows so that the child process lifetime is independent of the parent's?I'd be willing to use a particular Python package if one exists for this sort of thing. I'm already using
psutil
for example but I didn't see anything in it that could help me. 
Recursively finding a base sequence
findStartRec(goal, count)
recursively searches forward from an initial value of 0, and returns the smallest integer value that reaches or exceeds the goal.The preconditions are that
goal >= 0
andcount > 0
. If the double (x * 2) and add 5 ( + 5) sequence starting at 0 cannot reach the goal incount
steps, then try starting at 1.Continue this process until the program finds a starting value 'N' that does reach or exceed goal in count steps, and return that start value.
Example:
findStartRec( 100, 3 ) returns '9'
Here is what I have come up with so far
def findStartRec(goal, count, sequence=0, itter=0): if sequence == goal and count == 0: print("Sequence: ", sequence, "Itter: ", itter) return sequence, itter else: while count > 0: sequence = (itter * 2) + 5 count = count + 1 #return findStartRec(goal, count + 1, sequence, itter) else: return findStartRec(goal, count, sequence, itter + 1)

Data preparation with python
I'm in my debut in this forum, so I apologize for any noncompliance conditions. I have a text file that I want to divide into four parts as indicated in the code. it always generates me errors, I really ask for your help. Thank you
# First import pandas and the regex module import pandas as pd import numpy as np import re data = open("Discussion.txt", encoding="utf8") contenu = data.read() data.close() print(contenu) # Read the .txt file into a string data = open("Discussion.txt", encoding="utf8") string = data.read() data.close() #Split seperate lines into list of strings splitstring = string.splitlines() # For each list item find the data needed (with regex or indexing) # and assign to a dictionary df = {} for i in range(len(splitstring)): match = re.search(r'(.* .*)  (.*): (.*)',splitstring[1]) line = { 'Date' : splitstring[i][:10], 'Time' : match.group(1), 'Number' : match.group(2), 'Text' : match.group(3)} df[i] = line  AttributeError Traceback (most recent call last) <ipythoninput543a1f0fdf7c6> in <module>() 8 line = { 9 'Date' : splitstring[i][:10], > 10 'Time' : match.group(1), 11 'Number' : match.group(2), 12 'Text' : match.group(3)} AttributeError: 'NoneType' object has no attribute 'group' # Convert dictionary to pandas dataframe dataframe = pd.DataFrame(df).T #Finally send to csv dataframe.to_csv(filepath) File "<ipythoninput62b1b4e00c433>", line 3 Finally send to csv ^ IndentationError: unexpected indent
Here is a preview of the content of print (content) in image:

PCA with several time series as features of one instance with sklearn
I want to apply PCA on a data set where I have 20 time series as features for one instance. I have some 1000 instances of this kind and am looking for a way to reduce dimensionality. For every instance I have a pandas Data Frame, like:
import pandas as pd import numpy as np df = pd.DataFrame(data=np.random.normal(0, 1, (300, 20)))
Is there a way to use sklearn.fit on all instances with each having a set of time series as feature space. I mean I could apply sklearn.fit on all instances separatly, but I want the same principal components for all.
Is there a way? The only not satisfying idea I have by now is to append all those series of one instance to one, so that I have one time series for one instance.

Sklearn supervised learning with 3d array
NOTE: You can probably ignore the paragraph below if you have deep technical knowledge of sklearn and ML in general.
I am working on indexing image objects based on their position in an image. Their index is relative to the other image objects in each image,which varies significantly, so simple math will not work in indexing them. Moreover, I have tried to index them via their middle x coordinate in the image, but that only yields an accuracy of ~75% with sklearn DecisionTreeRegressor. Now I want to try to train a model to index them from their detection box's (obtained from tensorflow object recognition + pretrained neural network) x1,y1 or x1,x2,y1,y2 coordinates.
So here's my question:
Is an array such as
[ [[x0_0_0, x0_0_1], # < object 1 x,y coords for image 1 [x0_1_0, x0_1_1]], # < object 2 x,y coords for image 1 [ ... ], [[xn_0_0, xn_0_1], # < object 1 x,y coords for image n [xn_1_0, xn_1_1]] # < object 1 x,y coords for image n ]
with a target array of
[[y0_0, y0_1], # < indices of objects 1 and 2 in image 1 [ ... ], [yn_0, yn_1]] # < indices of objects 1 and 2 in image n
Viable for use in any supervised ML algorithms packaged in sklearn?

How to use sklearn LassoCV, what am I doing wrong?
This is hopefully a very simple question for anyone who has used
sklearn.linear_model.LassoCV
successfully.I'm doing my first Lasso regression on a very simple simulated data set as follows... I'm getting unsatisfactory results and I want to know what it is I'm doing wrong.
import numpy as np import pandas as pd X = np.random.uniform(0, 10, 100) e = np.random.uniform(0, 1, 100) # coefficients b0, b1, b2, b3 = 0.0, 0.1, 0.2, 0.3 # target Y = b0 + b1 * X + b2 * X**2 + b3 * X**3 + e # the dataset x**1, ..., x**10 data = pd.DataFrame({"Y":Y, "X1":X}) for i in range(2, 11): data["X{:d}".format(i)] = data["X1"]**i X = data.drop(axis = 1, labels = 'Y') Y = data['Y'] from sklearn.preprocessing import StandardScaler # standardize the data scaler = StandardScaler() Xscl = pd.DataFrame( data = scaler.fit_transform(X), columns = ['X'+`i` for i in range(1, 11)] ) # lasso constraints alphas = np.logspace(3, 1, 1000) # perform regression with 10 fold cv model = LassoCV(alphas = alphas, cv = 10, max_iter=10000, tol=0.0001, eps = 0.0001) result = model.fit(Xscl, Y) # reverse scale coefficients and plot fit over data coeff = model.coef_/scaler.scale_ x1 = np.linspace(0., 10., 100) pp.plot(X['X1'], Y, 'o') pp.plot(x1, np.polyval(coeff[::1], x1), '') # print(model.coef_) # [6.3122168 # 38.18296697 # 30.20713128 # 16.3567352 # 7.30950212 # 2.27074138 # 0. # 0. # 1.16784659 # 1.88575215]
plotting the fit over the data gives the following. What am I doing wrong?

Minimal glmnet example for factors
I am trying to understand how to use the R package glmnet.
Suppose I have a dataset, representing games played between two teams, with the 'win' column defining the result.
library(RcppAlgos) library(dplyr) data < RcppAlgos::permuteGeneral(c("A", "B", "C", "D", "E"), 2, repetition = TRUE) %>% as.data.frame() %>% setNames(c("team1", "team2")) %>% mutate(win = rbinom(25, 1, 0.5))
where 1 represents that team1 won, and 0 represents that team1 lost.
I now want to run this data through glmnet, with the 'won' column as the response.
I know that I need to use model.matrix with my factor variables, but it doesn't seem to me that that would give the right result.
For example:
x < model.matrix(data$win ~ data$team1 + data$team2) fit < glmnet(x, data$win)
Can anyone help?
Thanks!

One shot learning for a regression task
I know one shot learning can be used for classification as in the Siamesenetwork, but can we use one shot learning for a regression task?

Centering variables for multiple regression  interested in group effects
I'm trying to run a multiple regression model looking at the lengthweight relationship in fish. So y = weight, x = weight. What I want to examine specifically is if the lengthweight relationship between different populations (same species) differs  I've run the model as:
weight = length * population
BUT have also reading a lot about centering data in regression models. It seems to make no sense to me to grandmean centre length for this analysis as i'm specifically interested in the differences in LW relationship between the groups, but should I groupcentre for length? Or, not centre at all?
Any help or pointers greatly apriciated.
Cheers. G.

Binary logistic regression: significance without an increase in Overall Percentage?
I have some data that with two independent variables and one dependent. I'm using SPSS and my IVs have interaction. My results are below.
I don't have a stats background and am new to LG, so not sure how to interpret my results. Specifically, as I highlight below, the data seems to have significance (χ2(1) = 7.737, p = .005), but the Overall Percentage for the model is the same as the Null Hypothesis (60.0)?
Am I doing something wrong or can binary LG show significance in the data without a bump in Overall Percentage?

Unbalanced training samples for binary classification (90% vs 10%)  Tensorflow
I have a training sample of 100,000 (with 5 features) (90,000 classified as '0' and the rest classified as '1')
I am getting the 98% accuracy but precision/recall rates were 55%
Any suggestion to improve precision/recall rates? using tensorflow
#Loss function after sigmoid applied on yy_ loss = tf.losses.log_loss(yy_, scores, scope="loss") optimizer = tf.train.GradientDescentOptimizer(learning_rate=.01) train_op = optimizer.minimize(loss) prediction = (scores > 0.5)

Logist Regression Predicted Values Are Clustered
I have built a logistic regression model (using MSFT Machine Learning studio). The overall model seems to be decent.
However, when I use the predictive functionality of the web platform to see how it does, the values are not as evenly distributed as I saw in the training/testing phase. I did a rough check on the training data and data I am using to do predictions and there is not a significant difference.