Partial residual plot with confidence intervals in R for multivariate generalized linear regression
I have fit the following regression model:
mod < betareg(connectance ~ fc * size, data = net.land.3000, link = "logit")
summary(mod)
Call:
betareg(formula = connectance ~ fc * size, data = net.land.3000, link = "logit")
Standardized weighted residuals 2:
Min 1Q Median 3Q Max
2.4777 0.1919 0.0185 0.1435 2.6781
Coefficients (mean model with logit link):
Estimate Std. Error z value Pr(>z)
(Intercept) 2.160853 0.939846 2.299 0.0215 *
fc 0.062797 0.027081 2.319 0.0204 *
size 0.109050 0.101085 1.079 0.2807
fc:size 0.005822 0.003427 1.699 0.0894 .
Phi coefficients (precision model with identity link):
Estimate Std. Error z value Pr(>z)
(phi) 85.06 42.32 2.01 0.0445 *
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type of estimator: ML (maximum likelihood)
Loglikelihood: 12.88 on 5 Df
Pseudo Rsquared: 0.6826
Number of iterations: 35 (BFGS) + 3 (Fisher scoring)
I am trying to plot the relationship between fc
and connectance
, with the effects of size
and the fc * size
interaction taken into account, and with a 95% confidence interval. From what I understand, this is a partial residual (or marginal effects) plot.
How can I create one in R given that I have multiple independent variables, an interaction, and a nonlinear regression? I know the standard method would be to regress the residuals of (y ~ my other independent variables) vs. (my variable of interest ~ other independent variables), but I'm not sure how to handle that with the interaction and nonlinear relationship.
I have given it a shot using the plot_model()
function and got the following:
plot_model(mod, type = "pred", terms = "fc")
However, I was expecting the regression line in my partial residual plot to have a slope equal to the model estimate for fc
(0.06) and an intercept equal to the intercept in my model output (2.16), which it does not.
To summarize my questions:
 Is what I want a partial residual plot?
 If yes, how is that related to the slope of my variable of interest in my model output, if at all?
 Is
plot_model
giving me the correct results, or is there another function I can use to calculate a regression line and confidence interval?
Sorry if this is very confusing, I am something of a beginner at stats and quite confused. I would appreciate any help, and be happy to provide more information as it is useful!
For reference, here are my data:
connectance fc size
1 0.3333333 37.96319 8
3 0.2500000 11.33780 8
5 0.3809524 18.16915 13
6 0.5000000 47.88571 5
8 0.2500000 14.02959 10
9 0.1904762 17.87691 13
11 0.2777778 19.11214 9
12 0.2291667 29.03701 14
See also questions close to this topic

Group several columns then aggregate a set of columns in Pandas (It crashes badly compared to R's data.table)
I am relatively new to the world of Python and trying to use it as a backup platform to do data analysis. I generally use
data.table
for my data analysis needs.The issue is that when I run groupaggregate operation on big CSV file (randomized, zipped, uploaded at http://www.filedropper.com/ddataredact_1), Python throws:
grouping pandas return getattr(obj, method)(*args, **kwds) ValueError: negative dimensions are not allowed
OR (I have even encountered...)
File "C:\Anaconda3\lib\sitepackages\pandas\core\reshape\util.py", line 65, in cartesian_product for i, x in enumerate(X)] File "C:\Anaconda3\lib\sitepackages\pandas\core\reshape\util.py", line 65, in for i, x in enumerate(X)] File "C:\Anaconda3\lib\sitepackages\numpy\core\fromnumeric.py", line 445, in repeat return _wrapfunc(a, 'repeat', repeats, axis=axis) File "C:\Anaconda3\lib\sitepackages\numpy\core\fromnumeric.py", line 51, in _wrapfunc return getattr(obj, method)(*args, **kwds) MemoryError
I have spent three days trying to reduce the file size (I was able to reduce the size by 89%), adding breakpoints, debugging it, but I was not able to make any progress.
Surprisingly, I thought of running the same group/aggregate operation in
data.table
in R, and it hardly took 1 second. Moreover, I didn't have to do any data type conversion etc., suggested at https://www.dataquest.io/blog/pandasbigdata/.I also researched other threads: Avoiding Memory Issues For GroupBy on Large Pandas DataFrame, Pandas: df.groupby() is too slow for big data set. Any alternatives methods?, and pandas groupby with sum() on large csv file?. It seems these threads are more about matrix multiplication. I'd appreciate if you wouldn't tag this as duplicate.
Here's my Python code:
finaldatapath = "..\Data_R" ddata = pd.read_csv(finaldatapath +"\\"+"ddata_redact.csv", low_memory=False,encoding ="ISO88591") #before optimization: 353MB ddata.info(memory_usage="deep") #optimize file: Objecttypes are the biggest culprit. ddata_obj = ddata.select_dtypes(include=['object']).copy() #Now convert this to category type: #Float type didn't help much, so I am excluding it here. for col in ddata_obj: del ddata[col] ddata.loc[:, col] = ddata_obj[col].astype('category') #release memory del ddata_obj #after optimization: 39MB ddata.info(memory_usage="deep") #Create a list of grouping variables: group_column_list = [ "Business", "Device_Family", "Geo", "Segment", "Cust_Name", "GID", "Device ID", "Seller", "C9Phone_Margins_Flag", "C9Phone_Cust_Y_N", "ANDroid_Lic_Type", "Type", "Term", 'Cust_ANDroid_Margin_Bucket', 'Cust_Mobile_Margin_Bucket', # # 'Cust_Android_App_Bucket', 'ANDroind_App_Cust_Y_N' ] print("Analyzing data now...") def ddata_agg(x): names = { 'ANDroid_Margin': x['ANDroid_Margin'].sum(), 'Margins': x['Margins'].sum(), 'ANDroid_App_Qty': x['ANDroid_App_Qty'].sum(), 'Apple_Margin':x['Apple_Margin'].sum(), 'P_Lic':x['P_Lic'].sum(), 'Cust_ANDroid_Margins':x['Cust_ANDroid_Margins'].mean(), 'Cust_Mobile_Margins':x['Cust_Mobile_Margins'].mean(), 'Cust_ANDroid_App_Qty':x['Cust_ANDroid_App_Qty'].mean() } return pd.Series(names) ddata=ddata.reset_index(drop=True) ddata = ddata.groupby(group_column_list).apply(ddata_agg)
The code crashes in above
.groupby
operation.Can someone please help me? Compared to my other posts, I have probably spent the most amount of time on this StackOverflow post, trying to fix it and learn new stuff about Python. However, I have reached saturationit even more frustrates me because
R
'sdata.table
package processes this file in <2 seconds. This post is not about pros and cons of R and Python, but about using Python to be more productive.I am completely lost, and I'd appreciate any help.
Here's my
data.table
R
code:path_r = "../ddata_redact.csv" ddata<data.table::fread(path_r,stringsAsFactors=FALSE,data.table = TRUE, header = TRUE) group_column_list <c( "Business", "Device_Family", "Geo", "Segment", "Cust_Name", "GID", "Device ID", "Seller", "C9Phone_Margins_Flag", "C9Phone_Cust_Y_N", "ANDroid_Lic_Type", "Type", "Term", 'Cust_ANDroid_Margin_Bucket', 'Cust_Mobile_Margin_Bucket', # # 'Cust_Android_App_Bucket', 'ANDroind_App_Cust_Y_N' ) ddata<ddata[, .(ANDroid_Margin = sum(ANDroid_Margin,na.rm = TRUE), Margins=sum(Margins,na.rm = TRUE), Apple_Margin=sum(Apple_Margin,na.rm=TRUE), Cust_ANDroid_Margins = mean(Cust_ANDroid_Margins,na.rm = TRUE), Cust_Mobile_Margins = mean(Cust_Mobile_Margins,na.rm = TRUE), Cust_ANDroid_App_Qty = mean(Cust_ANDroid_App_Qty,na.rm = TRUE), ANDroid_App_Qty=sum(ANDroid_App_Qty,na.rm = TRUE) ), by=group_column_list]
I have a 4core 16GB RAM Win10x64 machine. I can provide any details needed by experts.

Installing RODBC in R
I am trying to install RODBC and when I do so i get this error:
Installing package into ‘\\lakesh/Documents/R/winlibrary/3.5’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/RODBC_1.315.zip' Content type 'application/zip' length 879575 bytes (858 KB) downloaded 858 KB package ‘RODBC’ successfully unpacked and MD5 sums checked Warning in install.packages : cannot remove prior installation of package ‘RODBC’
Then I tried to remove the package:
> remove.packages("RODBC", lib = NULL) Removing package from ‘\\lakesh/Documents/R/winlibrary/3.5’ (as ‘lib’ is unspecified) Error in remove.packages : there is no package called ‘RODBC’
Need some guidance on installing RODBC package properly.

Rearranging columns in R
I have this table:
ID Gene1 Gene2 Gene3 a. 1a. 2a. 3a. b. 1b. 2b. 3b. c. 1c. 2c. 3c.
I want two new columns "
Gene
" and "effect
" to generate this table:ID Gene effect a. 1. 1a. a. 2. 2a. a. 3. 3a. b. 1. 1b. b. 2. 2b. b. 3. 3b. c. 1. 1c. c. 2. 2c. c. 3. 3c.

How to change y axis range to percent (%) from number in barplot with R (sjp.frq)
everybody!
Does anyone know how to change y axis range to percent (%) from number in barplot using the package SjPlot? I want to set the range to percent from 0  100.
Here is the code that I used
sjp.frq(CAAPORA$Q2, type = c("bar"), sort.frq = c("desc"), geom.colors = "grey", show.na = TRUE, axis.labels = c( "Infantil", "Fund (Inicial)", "Fund (Final)"))
The y axis is in number (as you can see in the image)

how do I plot interactions with continuous and categorical predictors in mixed models?
I'm very unsure how to plot mixedlevel data consisting of a mixture of categorical and continuous predictors, so any help would be appreciated.
This is the data
subject < factor(rep(c(1,2,3,4,5,6),each=12)) dep < c(0.3763244126,0.2185001692,0.4191841742,0.9812978664,0.7429273683,0.6254715486,0.6200958213,0.4693300191,0.1779032899,0.0035873980,0.8821949826,0.4818012617,0.0008013437,0.6280700732,0.7126500814,0.4984349359,0.2457996449,0.3085733312,0.5903398243,0.3704800352,0.8215325437,0.0445236221,0.1849731791,0.3670945817,0.0022268933,0.1630332691,0.9734050406,0.2638539758,0.8550054496,0.9413964085,0.4548943471,0.0440815873,0.5222098769,0.6553600784,0.6853486744,0.0571945074,0.0923124240,0.6976544929,0.9257440316,0.5658967043,0.0636543627,0.1038574059,0.0662497468,0.9165439918,0.5200087291,0.9528015053,0.5347318368,0.1848373057,0.9948602219,0.9633110918,0.1482162909,0.9000614029,0.0898618386,0.7975253051,0.8334557347,0.8629821099,0.0001795699,0.2488384889,0.6382902598,0.1103540359,0.2199716354,0.2737281912,0.5694398067,0.7940423761,0.4906677457,0.5191186895,0.4770589883,0.2823238128,0.2458788699,0.6363522802,0.0306954833,0.6979198116) f1 < factor(rep(c("Female","Male","Female","Male","Male","Female"), each=12)) f2 < c(0.098788608,0.934606288,0.145045152,0.841969882,0.498234471,0.562897249,0.359740488,0.082046687,0.183987342,0.082418820,0.173424633,0.799291329,0.041450568,0.686708743,0.352092230,0.823550310,0.650857094,0.331705763,0.659111451,0.745187314,0.066165065,0.870759966,0.154977488,0.031703774,0.065251788,0.707452073,0.564604314,0.224798417,0.656363138,0.047954841,0.500513114,0.923316812,0.706629266,0.561530974,0.670860932,0.414969178,0.709973062,0.452946384,0.187624344,0.278656351,0.562138433,0.655193272,0.014868182,0.518697012,0.414113229,0.273464316,0.844080831,0.962636550,0.952739605,0.728627219,0.761122951,0.309977150,0.755239042,0.208627128,0.481429897,0.376021223,0.753871400,0.164842337,0.921081061,0.859677311,0.600462073,0.119193708,0.276722102,0.854752641,0.962710853,0.956277061,0.228313179,0.920405764,0.001594131,0.104930433,0.241548888,0.549643015) f3 < factor(rep(c("day1","day2","day3","day4"),each=3, times=6)) data < data.frame(sub=subject, dep=dep, f1=f1, f2=f2, f3=f3) m < lmer(dep ~ f1*f2*f3 + (1sub), data=data)
how can I plot the 3way interaction
f1*f2*f3
?I tried using
ggplot
like thisggplot(aes(x=f2,y=dep,color=f1),data=data) + geom_smooth(method="lm") + facet_grid(".~f3")
but I don't think this is taking into account the fact that this is a mixedmodel design, is it?

Change size of points in R using plot() with multiple added points()
The data is available here.
I am trying to generate the below plot (proportion of individuals~year), with the exception that I need to scale all the points (
resF
,resM
,immF
, andimmM
) to the number of observations for each column (resFN
,resMN
,immFN
, andimmMN
).The code I use to make the above plot is below:
library (plotrix) plot(resF~year,data=data, type="b", col="black", xlab="Settlement year", ylab="Number of individuals", bty="l", pch=17, ylim=c(0,1)) ablineclip(v=1993, col="grey95", lwd=14, y1=0) ablineclip(v=1998, col="grey95", lwd=14, y1=0) ablineclip(v=2005, col="grey95", lwd=14, y1=0) ablineclip(v=2010, col="grey95", lwd=14, y1=0) ablineclip(v=2014, col="grey95", lwd=14, y1=0) points(resF~year,data=data, col="black", type="b", pch=17) points(resM~year,data=data, col="grey", type="b", pch=16) points(immF~year,data=data, col="red", type="b", pch=17) points(immM~year,data=data, col="orange", type="b", pch=16) legend("topright", c("Resident females","Resident males", "Immigrant females", "Immigrant males"), col=c("black", "grey","red", "orange"), pch=c(17, 16, 17, 16), box.lty=0)
I am plotting the
resF
, and adding points forresM
,immF
, andimmM
. I want to scale the points according to the number of observations for each column. For example,resF
would need to be rescaled according to the number of observations in columnresFN
,resM
scaled to number inresMN
, etc.Based on what I have read, I should be able to rescale the points by adding
symbols(x=data$resFN, y=data$year, circles=sqrt(data$resFN/pi), inches=1/3, ann=F, bg="steelblue2", fg=NULL)
to myplot()
code. I am encountering problems since I am adding multiplepoints()
to the originalplot()
and I am also rescaling according to values in different columns.Any suggestions for how I can accomplish this?

Baseball definitions?
All, I have a programming assignment due where I need to build a regression model on baseball data and I have no clue what the rules of baseball are! I have looked up a few baseball sites & google in general on what the following 3 variables mean, but to no luck. Could you please tell me what they mean and whether they are good or bad in terms of scoring? Thanks a bunch!....and don't kick me off the site please for such a weird question :)
Walks allowed Hits allowed Homeruns allowed

Linear Regression Model: Why is my test score higher than my training score?
I'm running a LinearRegression() model using sklearn. The dataset contains features for vehicles, such as engine size, cylinders, mpg, fuel type, and CO2 emissions. The target variable is CO2 emissions. I'm splitting the data 70/30. When I calculate the scores for the training and test sets, the training set scores at 0.992 and the test set scores at 0.993. (The high accuracy for both seems to be due to the utilization of dummy variables for fuel type. I reran the model without the dummy variables and the accuracy dropped a lot to 0.867 for train and 0.870 for test. Again, test is scoring higher).
I've looked around on the web and haven't found any helpful insights yet. I've seen others says it may be due to the dataset being small, but this dataset has over 18,000 observations (13,090 for train and 5,611 for test). Since the training and test scores are nearly identical, is this an issue? What could be causing the test score to be higher?
The data for this ML project can be found here. I'm using 20002018 fuel consumption.

What criteria does the leaps and regsubsets function use in order to select the best variables for the regression? (R language)
i was just wondering what criteria is used by the leaps package and the regsubsets function in order to find the best combination of regressors for a linear regression in the exhaustive search method.
Does the algorithm try to maximize the value of R², minimize the standard error, maximize the Fstatistic, etc?
I just wasn't able to find what criteria are being used when I was reading the package documentation.
Sorry for any inconvenience in this question.
EDIT: Also, is there a way to filter result in order to remove the models that contains regressors that are not statistically significant? (For example, remove models in which some of the regressors have pvalues >= 5%.)

Plotting prediction intervals for timeseries using Python matplotlib
I'm trying to plot prediction intervals of my prediction model on Timeseries using
errorbar()
andfill_between()
frommatplotlib.pyplot
. The model is a MLP implemented in keras. After runningmodel.predict()
on test data of 138 time series and 56 time steps for each, I calculated the variance and mean of each prediction, stored them in two ndarrayes of shape(56,138)
:mean.shape = (56, 138) var.shape = (56, 138)
So each column of these two ndarrayes,
mean
andvar
, has the mean and variance of one timeseries. The thing is that I'm not sure if I'm plotting the prediction intervals correctly! Here is my two tries:for ts in range(number_of_timeseries): pyplot.errorbar(x=np.arange(0, 56, 1), y=mean[:, ts], yerr=var[:, ts], ecolor='red', elinewidth=0.5,) pyplot.show()
And the second one is using
fill_between()
:for ts in range(number_of_timeseries): x = np.linspace(0, 56, 56) pyplot.plot(x, mean[:, ts], 'k', color='#3F7F4C') pyplot.fill_between(x, mean[:, ts]  var[:, ts], mean[:, ts] + var[:, ts], alpha=1, edgecolor='#3F7F4C', facecolor='#7EFF99',linewidth=0) pyplot.show()
Am I plotting them correctly?!

Errors using boot() and boot.ci (parametric vs. nonparametric)
I'm pretty new to R and bootstrapping, so I'm having trouble understanding where this is going wrong. Below is my full code I am using. For this dataset I want to do both nonparametric and parametric bootstrap analysis concerning the log of the sample mean as an estimator of the log of the population mean. It seems to run fine for the nonparametric, but doesn't seem to work for the parametric.
# Data and necessary library library(boot) data("aircondit") # Define "statistic" function. logmean < function(mydat, indices) { log(mean(mydat[indices])) } # The first problem requires us to use the nonparametric bootstrap, which we will do below. boot.outnp < boot(as.matrix(aircondit),logmean,R=999) # Confidence interval boot.ci(boot.outnp) # Next we'll do the same process, but for the parametric bootstrap boot.outp < boot(as.matrix(aircondit),logmean, R=999, sim='parametric') # Confidence interval boot.ci(boot.outp)

Unnesting a dataframe within a dataframe
I have been trying to calculate confidence intervals for binomial distributions through the Hmisc R package. Specifically, I used the binconf function which does its job perfectly.
library(plyr) library(Hmisc) Student < c("A", "B", "C") TP < c(13, 36, 43) obs.pos < c(16, 37, 48) df < data.frame(Student, TP, obs.pos) df1 < df %>% plyr::mutate(Sen = binconf(TP, obs.pos, alpha = 0.05, method = "wilson", return.df = TRUE)) df1 %>% View() # Student TP obs.pos Sen.PointEst Sen.Lower Sen.Upper #1 A 13 16 0.8125000 0.5699112 0.9340840 #2 B 36 37 0.9729730 0.8617593 0.9986137 #3 C 43 48 0.8958333 0.7783258 0.9546783
Unfortunately, I feel that the function creates a data frame within my original data frame and that does not allow me to apply basic functions on my output anymore. For instance, I cannot select columns (by using dplyr) or round digits because R is not able to find the created columns (such as Sen.PointEst, Sen.Lower, Sen.Upper). Below, the structure of my output.
df1 %>% str() #'data.frame': 3 obs. of 4 variables: # $ Student: Factor w/ 3 levels "A","B","C": 1 2 3 # $ TP : num 13 36 43 # $ obs.pos: num 16 37 48 # $ Sen :'data.frame': 3 obs. of 3 variables: # ..$ PointEst: num 0.812 0.973 0.896 # ..$ Lower : num 0.57 0.862 0.778 # ..$ Upper : num 0.934 0.999 0.955
I would like to have all the columns at the first level of my output so that I can easily apply all the normal functions to my output.
Thanks for any help!