How to subset from aov summary in R?
Maybe this is a simple question but I'm wondering how can I subset DF
and F.values
for the terms appearing in an aov
summary?
For example, using the base R builtin dataset npk
, how can I extract the residual and other DF
s and F.values
that appear in the summary of the following model:
fit < summary(aov(yield ~ block + N * P + K, data = npk)) # example is fully reproducible
P.S. I'm looking for base R solutions.
1 answer

The
fit
output is alist
oflength
1 (by checkingstr(fit)
). We extract it with[[
and then do$
or[[
to extract the componentsfit[[1]]$Df #[1] 5 1 1 1 1 14 #where 14 is the Residuals df fit[[1]]$`F value` #[1] 4.391098 12.105541 0.537330 6.088639 1.361073 NA
See also questions close to this topic

R: proportions of a range of numbers = >1
I am trying to determine the proportion of a range of numbers for subsets on a long dataframe. (The aim is to write a function.)
below.green<mean(results$Value <0.04) green.amber<mean(results$Value >0.04:0.4) amber.red<mean(results$Value >0.4:4) red.plus< mean(results$Value >4) meanresults < c(below.green,green.amber,amber.red,red.plus)
e.g.1
Values < c( 0.1501, 0.1276, 0.0838, 0, 0, 0.4544, 0.2573, 0.1788, 1.291, 1.4737, 1.8191, 0.5986, 4.5846, 4.9056, 2.4809, 2.1021, 3.3741, 0.0085, 0.0302, 0.0033, 0.0405, 0, 0, 0, 0, 0, 0.3262, 0.0462, 0.2536, 0.3661, 0.4311, 0.4719, 0.8482, 2.3731, 0.656, 0.3967, 0.0399, 0.0302, 0.2723, 0.3833, 0.5907, 0.3725, 0.0258, 0.0483)
sum(meanresults) [1] 1.247892
e.g.2
Values2 < c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0726, 0.0077, 0.0444)
sum(meanresults) [1] 1
When my proportions are returned on some subsets the proportions equal >1 (see e.g. 1). Other sites the proportion total =1 (e.g. 2) this seems to happen only on sites where Value <0.4. Where am I going wrong?
I have looked at multiple QAs on the site and haven't found similar examples.

Text processing in R
I have a text file with many lines (first two are shown below)
1: 146 189 229 2: 191 229
I need to convert to the output
1 146 1 189 1 229 2 191 2 229
I have read the lines in loop, removed the ":" and split by " ".
fbnet < readLines("0.egonet") for (line in fbnet){ line < gsub(":","",line) line < unlist(strsplit(line, " ", fixed = TRUE),use.names=FALSE) friend = line[1] }
How to proceed next

Convert a long list to a binary dataframe having duplicates
According to this question and answer it is possible to convert a long list to a binary dataframe.
However how could it be possible to use it into a dataframe which contains the same value more than one time for every user?
Example of dataframe:
d_long < data.frame( nameid = c("sally","sally","sally", "sally","Robert","annie","annie","annie"), value = c("product1","ra","ent","ra","ra","ra","product1","product1"))
nameid value 1 sally product1 2 sally ra 3 sally ent 4 sally ra 5 Robert ra 6 annie ra 7 annie product1 8 annie product1
The expected output is this:
d_exist < data.frame(nameid = c("sally","Robert","annie"), product1 = c(1,0,1), ra = c(1,1,1), ent = c(1,0,0))
nameid product1 ra ent 1 sally 1 1 1 2 Robert 0 1 0 3 annie 1 1 0
But when I try this:
d_long %>% group_by(nameid, value) %>% mutate(count = n()) %>% ungroup() %>% spread(value, count, fill = 0) %>% as.data.frame()
I receive the error:
Error: Duplicate identifiers for rows (7, 8), (2, 4)
Is it right to use only
d_long[!duplicated(d_long), ]

PHP Code insert values but when called with function the result is diffrent
im trying to execute code by calling function but the result is different from executing the code by itself
$arr1=array( 0 => array( "id" => 5, "SKS" => 2, "assignStatus" => 0 ) ); $arr2=array( 0 => array(//hari senin 0 => array(//sesi 1 0 => array( //ruang 405 0 => "", //draft 1 => "", //sks 2 => 0 //count3sks ) ) ) ); //execute code with function assignNilai($arr2,0,0,0,$arr1,0); echo $arr2[0][0][0][0]." ".$arr2[0][0][0][1]." ".$arr2[0][0][0][2]; echo "<br>"; //execute code without function $arr2[0][0][0][0] = $arr1[0]['id']; $arr2[0][0][0][1] = $arr1[0]['SKS']; if ($arr2[0][0][0][1]==3) { $arr2[0][0][0][2] = $csp[0][0][0][2] + 1; } echo $arr2[0][0][0][0]." ".$arr2[0][0][0][1]." ".$arr2[0][0][0][2]; function assignNilai($arr2,$hari,$sesi,$ruang,$arr1,$draft){ $arr2[$hari][$sesi][$ruang][0] = $arr1[$draft]['id']; $arr2[$hari][$sesi][$ruang][1] = $arr1[$draft]['SKS']; if ($arr2[$hari][$sesi][$ruang][1]==3) { $arr2[$hari][$sesi][$ruang][2] = $csp[$hari][$sesi][$ruang][2] + 1; } }
im trying to achieve what displayed on the manually executed code using function, how to do it?

Loops with a condition  Processing
Say that a ball is falling down the screen and resets once it hits the border as such:
float BallY = 50; // y value of the ball float BallX = 260; // x value of the ball void setup() { size(512, 348); //width and height of screen } void draw() { background(255); fill(0); ellipse(BallX, BallY, 15, 15); //ball that will fall BallY++; //ball's y value increases each frame if (BallY > height) //if ball's y value is greater than the screen { BallY = 0; //reset the y value of the ball back to 0 } }
How can I make my "if statement" a "for loop" that creates for example a square on the top left of the screen and creates another one directly beside it each time the ball reaches the end of the screen?
Because my logic was something like:
for(float BallY = 0, BallY < height, BallY++) { rect(20,20,20,20); }
But I know this spits out an error... one of my mentors recommended using nested for loops but I am not sure how exactly to put it together. So what is the best method to approach this?

how to add callback into function with javascript?
I want to create function which inside include swall then call that function with another job...
here is my code:
function confirmSwal(ket, callback){ swal({ title: ket, showCancelButton: true, cancelButtonText: 'Batal', confirmButtonClass: 'btnsuccess', confirmButtonText: 'Hapus', closeOnConfirm: true }, function(){ callback(); }); } $("#hapusBulk").click(function(){ confirmSwal("Apakah Anda Yakin Hapus Data Terpilih?", function(){ alert("Asd"); }); });
but alert doesn't work.. please help..

Identifying influential points in count model in R
I'm running count models in R; e.g., negative binomial models using MASS::glm.nb().
I'm finding I have a few very extreme scores that have a strong influence over the model.
Normally, in a linear regression, I could use popular tools to find values such as Cook's distance to identify highly influential points. However, I don't seem to see any similar analogue for count models such as negative binomial. Does anyone know of a procedure for flagging such influential points that is implemented in R?

Gamma regression only intercept
I am new to python I am trying to do a gamma regression, I hope to obtain similar estimations to R, but I can not understand the syntax of python and it generates an error, some idea of how to solve it.
My R code:
set.seed(1) y = rgamma(18,10,.1) print(y) [1] 76.67251 140.40808 138.26660 108.20993 53.46417 110.61754 119.11950 113.57558 85.82045 71.96892 [11] 76.81693 86.00139 93.62010 69.49795 121.99775 114.18707 125.43608 120.63640 # Option 1 model = glm(y~1,family=Gamma) summary(model) # Option 2 # x = rep(1,18) # summary(glm(y~x,family=Gamma))
Output:
summary(model) Call: glm(formula = y ~ 1, family = Gamma) Deviance Residuals: Min 1Q Median 3Q Max 0.57898 0.24017 0.07637 0.17489 0.34345 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 0.009856 0.000581 16.96 4.33e12 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 0.06255708) Null deviance: 1.1761 on 17 degrees of freedom Residual deviance: 1.1761 on 17 degrees of freedom AIC: 171.3 Number of Fisher Scoring iterations: 4
Python Code
y = [76.67251,140.40808,138.26660,108.20993,53.46417,110.61754, 119.11950,113.57558,85.82045,71.96892,76.81693,86.00139, 93.62010,69.49795,121.99775,114.18707,125.43608,120.63640] x = np.repeat(1,18) import numpy import statsmodels.api as sm model = sm.GLM(x,y, family=sm.families.Gamma()).fit() print(model.summary())
I expected an output similar to R

python regression analysis project
We have a dataset about cars. (age of the car, engine volume, etc.) We converted this information into a txt file. We need a program to read the txt file and then find out the price of the car by doing a regression analysis. We want to make a program that inserts the information in the interface of the program and prints the price as a result.

Looping apply function over list of dataframes
I have looked through various Overflow pages with similar questions (some linked) but haven't found anything that seems to help with this complicated task.
I have a series of data frames in my workspace and I would like to loop the same function (rollmean or some version of that) over all of them, then save the results to new data frames.
I have written a couple of lines of to generate a list of all data frames and a for loop that should iterate an apply statement over each data frame; however, I'm having problems trying to accomplish everything I'm hoping to achieve (my code and some sample data are included below):
1) I would like to restrict the
rollmean
function to all columns, except the 1st (or first several), so that the column(s) 'info' does not get averaged. I would also like to add this column(s) back to the output data frame.2) I want to save the output as a new data frame (with a unique name). I do not care if it is saved to the workspace or exported as an xlsx, as I already have batch import codes written.
3) Ideally, I would like the resultant data frame to be the same number of observations as the input, where as
rollmean
shrinks your data. I also do not want these to become NA, so I don't want to usefill = NA
This could be accomplished by writing a new function, passingtype = "partial"
inrollmean
(though that still shrinks my data by 1 in my hands), or by starting the roll mean on the nth+2 term and binding the non averaged nth and nth+1 terms to the resulting data frame. Any way is fine. (see picture for detail, it illustrates what the later would look like)My code only accomplishes parts of these things and I cannot get the for loop to work together but can get parts to work if I run them on single data frames.
Any input is greatly appreciated because I'm out of ideas.
#reproducible data frames a = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10))) b = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10))) c = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10))) colnames(a) = c("info", 1:20) colnames(b) = c("info", 1:20) colnames(c) = c("info", 1:20) #identify all dataframes for looping rollmean dflist = as.list(ls()[sapply(mget(ls(), .GlobalEnv), is.data.frame)] #for loop to create rolling average and save as new dataframe for (j in 1:length(dflist)){ list = as.list(ls()[sapply(mget(ls(), .GlobalEnv), is.data.frame)]) new.names = as.character(unique(list)) smoothed = as.data.frame( apply( X = names(list), MARGIN = 1, FUN = rollmean, k = 3, align = 'right')) assign(new.names[i], smoothed) }
I also tried a nested apply approach but couldn't get it to call the rollmean/rollapply function similar to issue here so I went back to for loops but if someone can make this work with nested applies, I'm down!
Picture is ideal output: Top is single input dataframe with colored boxes demonstrating a rolling average across all columns, to be iterated over each column; bottom is ideal output with colors reflecting the location of output for each colored window above

PySpark: Search For substrings in text and subset dataframe
I am brand new to
pyspark
and want to translate my existingpandas
/python
code toPySpark
.I want to subset my
dataframe
so that only rows that contain specific key words I'm looking for in'original_problem'
field is returned.Below is the Python code I tried in PySpark:
def pilot_discrep(input_file): df = input_file searchfor = ['cat', 'dog', 'frog', 'fleece'] df = df[df['original_problem'].str.contains(''.join(searchfor))] return df
When I try to run the above, I get the following error:
AnalysisException: u"Can't extract value from original_problem#207: need struct type but got string;"

subsetting a dataframe with conditions for a lot of columns
For subsetting dataframes with multiple conditions, one could use
#my condition x=1 a=dat[dat[,1]>x&dat[,2]>x,]
This time I'm facing quite a lot of columns that I have to check. I tried the following examples but could'd find a way to get it working
a=dat[dat[,1:10]>x,] d=dat[which(dat$V1:dat$V10>x)] c=subset(dat,dat$V1:dat$V10>x)
They basically all produce the same error:
numerical expression has XXX elements: only the first used
Does anyone know a way around? Thanks in advance!

R One way anova extracting p_value
I'm trying to do a oneway anova on several row of a dataset and extract the p_value to use it afterwards.
Here's what i've done:
anova < function(x) {summary(aov(x ~ bt.factor))[[1]]["Pr(>F)"]} anv.pval < apply(golubALL, 1, anova)
With this formula i'm able to extract the pvalue but it comes with other elements:
$`1414_at` Pr(>F) bt.factor 0.7871 Residuals
What I would like to have as a result is only this in a list. How could I extract it ?

r repeated measures output error (main effects shown twice, between subject shown as within)
I am having trouble discerning the error strata output from a repeated measures anova in R, as they appear to be doing something funky. I have a repeated measures anova where each participant gets a score for two different roles and a score for two different valences (Role and Valence are dichotomous categorical within subjects factors), and I am including them in a model that has Gender as a between subjects factor (also dichotomous categorical).
My model is as follows:
summary(aov(data = data, score ~ Role * Valence * Gender + Error(Subject_ID / (Role*Valence)))
The output looks unusual:
summary(aov(data = data, + score ~ Role * Valence * Gender + Error(Subject_ID / Role*Valence)))
Error: Subject_ID
Df Sum Sq Mean Sq Gender 1 0.06647 0.06647
Error: Valence
Df Sum Sq Mean Sq Valence 1 6.774 6.774
Error: Subject_ID:Role
Df Sum Sq Mean Sq Role 1 0.04595 0.04595
Error: Subject_ID:Valence
Df Sum Sq Mean Sq Valence:Gender 1 0.06981 0.06981
Error: Subject_ID:Role:Valence
Df Sum Sq Mean Sq Role:Valence 1 1.329 1.329
Error: Within
Df Sum Sq Mean Sq F value Pr(>F) Role 1 0.00 0.0000 0.000 0.9986 Gender 1 0.08 0.0781 0.382 0.5371 Role:Valence 1 0.65 0.6457 3.159 0.0767 Role:Gender 1 0.04 0.0354 0.173 0.6777 Valence:Gender 1 0.04 0.0443 0.217 0.6420 Role:Valence:Gender 1 0.24 0.2447 1.197 0.2749 Residuals 252 51.50 0.2044 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I don't understand why the F and P values only displayed for the within error strata, why my main effects are appearing twice (e.g., Role under within and Subject_ID:Role), or why my between subjects variable is showing up in the Within error strata. I'm not sure how to begin troubleshooting this, so any insights would be greatly appreciated.

How to do ANOVA test to compare the performance of different clustering algorithms
I am trying to compare the performance of the different clustering algorithm results (kmeans++ and hierarchical agglomerative clustering) applied to the same dataset. I have 4 different results in total (2 of them have KPCA preprocessing, 2 of them do not have), that is why I chose ANOVA to conclude which one yielded the best result.
However, I do not know what input to give the ANOVA test. Can anyone suggest which data I should take from the algorithm results to provide input? (I am using Python language and scikit learn)