Can KruskalWallis test be used to test significance of multiple groups within multiple factors?
I have tried to read what I can on KruskalWallis and while I have found some useful information, I still seem to not find the answer to my question. I am trying to use the KruskalWallis test to determine the significance of multiple groups, within multiple factors, in predicting a set of dependent variables.
Here is an example of my data:
ID Date Point Season Grazing Cattle_Type AvgVOR PNatGr NatGrHt
181 7/21/2015 B22 late pre Large 0.8 2 20
182 7/21/2016 B32 early post Small 1.0 4 24
In this example, my dependent variables are "AcgVor", "PNatGR" and"NatGrHt" while the independent variables (factors) are "Season', 'Grazing", and "Cattle_Type". As you can see, each of my factors has 2 group levels each.
What I am trying to accomplish is to run a nonparamatric test that looks at the separate and combined importance of my factor groups to each of my dependent variables. I chose KrukalWallis and it seems to work for testing one of my grouping factors at a time.
Here is the result for AvgVor ~ Grazing
kruskal.test(AvgVOR ~ Grazing, data = Veg)
KruskalWallis rank sum test
data: AvgVOR by Grazing
KruskalWallis chisquared = 94.078, df = 1, pvalue < 2.2e16
This tells me that AvGVor is significantly different according to whether they were recorded pre or post grazing.
Is there a way to build a similar model using KruskalWallis that includes all of my grouping factors? Even if I have to run a separate one for each dependent variable.
I attempted the following code, but it is flawed.
lapply(Veg[,c("Grazing", "Cattle_Type", "Season")]),function(AvgVOR) kruskal.test(AvgVOR ~ Veg)
See also questions close to this topic

R  Looping inside a case when statement
I´m trying to automate the following code using a for loop:
mydata$DISCCURVE < case_when( grepl(tolower(curve_name_1), tolower(mydata$REFERENCIA) ) == T ~ "discount1", grepl(tolower(curve_name_2), tolower(mydata$REFERENCIA) ) == T ~ "discount2", grepl(tolower(curve_name_3), tolower(mydata$REFERENCIA) ) == T ~ "discount3", grepl(tolower(curve_name_4), tolower(mydata$REFERENCIA) ) == T ~ "discount4", grepl(tolower(curve_name_5), tolower(mydata$REFERENCIA) ) == T ~ "discount5", grepl(tolower(curve_name_6), tolower(mydata$REFERENCIA) ) == T ~ "discount6" )
writing something like this:
mydata$DISCCURVE < case_when( for (i in 1:5) { eval(parse(text = paste0("grepl(tolower(curve_name_",i,"),tolower(mydata$REFERENCIA) ) == T ~ 'discount",i,"',"))) } grepl(tolower(curve_name_6), tolower(mydata$REFERENCIA) ) == T ~ "discount6" )
But its not working. Any alternatives? Is it possible to use a for loop inside the definition of a variable? Thanks

R: Convert values into pipedelimited format
I'm trying to create a RedCap data dictionary from an SPSS output. SPSS lists the allowed values, or factors, for each variable like this:
SEX 0 Male 1 Female LANGUAGE 1 English 2 Spanish 3 Other 6 Unknown
How can I convert the above to this format for RedCap:
Variable Values SEX 0, Male  1, Female LANGUAGE 1, English  2, Spanish  3, Other  6, Unknown
The language I'm best with is R. Thanks in advance!

Looping over each row and comparing values in multiple columns of the row being iterated
I have the following dataframe
yearly
ID Jan Feb March April May Jun Jul Aug Sept Oct Nov Dec ABC 0 0 0 1 0 0 0 0 1 0 0 0 DEF 0 0 0 1 1 0 0 0 1 0 0 0 GHI 0 0 0 1 0 1 0 0 0 1 0 0 MNO 0 0 0 1 0 1 0 0 1 0 0 0 QAL 0 1 1 1 0 0 1 0 0 1 0 0
I want to iterate through each row and then find that column after which the three columns are 0. I would like to get something like this which says the months after where there are no 0s for at least 3 months:
ID col1 col2 ABC April Sept DEF May Sept GHI Jun N/A MNO Sept N/A QAL N/A N/A
I had figured out how to iterate through the vector and get the index
vec2 < names(yearly) # vec is each row of yearly for(i in 1:length(vec)){ if(vec[i]==1){ if(vec[i+1]+vec[i+2]+vec[i+3]==0){ #I think R automatically takes care of the out of bounds index print(vec2[1]) print(vec2[i+1]) } } } } }
But I am finding it a bit difficult to link it to the original dataframe and get the columns. Any functions or resources that can guide me ?

Update weights in multiple regression which is based on neural network
How to update weights in multiple regression that is based on neural network? what is the equation?

How to refresh data automatically on excel graph
i work with daily checks using an excel graph and every day the excel chart is generated with new data.
the issue is i have to click refresh manually, where i prefer to have it happen automatically:
I have tried going into the settings/advanced but it hasnt worked. I obviously have googled too

ANOVA data argument error
I am attempting to run an analysis of covariance on a mixed effect model. My data set is as follows
> str(try) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 864 obs. of 7 variables: $ Site : chr "BISC1" "BISC1" "BISC1" "BISC1" ... $ SET : Factor w/ 3 levels "SET1","SET2",..: 1 1 1 1 1 1 1 1 1 1 ... $ ARM : chr "A_0001" "A_0001" "A_0001" "A_0001" ... $ Pin : num 1 2 3 4 5 6 7 8 9 1 ... $ SETarmpin : chr "SET1_A_0001_1" "SET1_A_0001_2" "SET1_A_0001_3" "SET1_A_0001_4" ... $ Days : num 145 145 145 145 145 145 145 145 145 145 ... $ AbsPinDiff: num 1 4 7 12 5 0 5 1 0 22 ... > dput(head(try)) structure(list(Site = c("BISC1", "BISC1", "BISC1", "BISC1", "BISC1", "BISC1"), SET = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("SET1", "SET2", "SET3"), class = "factor"), ARM = c("A_0001", "A_0001", "A_0001", "A_0001", "A_0001", "A_0001"), Pin = c(1, 2, 3, 4, 5, 6), SETarmpin = c("SET1_A_0001_1", "SET1_A_0001_2", "SET1_A_0001_3", "SET1_A_0001_4", "SET1_A_0001_5", "SET1_A_0001_6"), Days = c(145, 145, 145, 145, 145, 145), AbsPinDiff = c(1, 4, 7, 12, 5, 0)), row.names = c(NA, 6L), class = c("tbl_df", "tbl", "data.frame" ))
My mixed effect model is as below :
trymodel<lme(AbsPinDiff~Days+SET, random = ~1SETarmpin, correlation = corAR1(form=~DaysSETarmpin), data = try, na.action = na.exclude, method="REML")
I am using the Anova function within the 'car' package. Yet when I run the function on the above model I get the following error message as below :
> Anova(trymodel4) Error in terms.formula(object, data = data) : 'data' argument is of the wrong type
I am a bit confused because from what I can surmise, the Anova function does not have a 'data' argument.

Improper priors for bayesian multivariate generalized linear models in R
I am trying to run a bayesian multivariate generalized linear model with
OFT1
,MIS1
,wt
, andg.rate
as the response variables,grid
,collar
,sex
,fieldBirthDate
, andtrialnumber
as the predictor variables, andsq_id
as the random effect.My data looks like this:
sq_id sex grid trialnumber collar OFT1 MIS1 fieldBirthDate wt g.rate 22640 F KL 2 Y 2.1947851 0.08686934 2.036220 2.04349949 0.202092846 22640 F KL 1 Y 0.7661517 2.65544077 2.036220 0.09300674 0.546493570 22641 F KL 1 Y 0.9689955 1.38944543 2.036220 0.04942701 0.582793646 22653 F SU 1 Y 2.1524967 1.03831742 1.665209 0.44824150 0.691384500 22657 M KL 1 N 1.0918323 2.03883227 1.788879 0.40636099 0.008592439 22657 M KL 2 N 3.1173521 2.34449199 1.788879 2.44231398 0.249185968
And, is available here. The R package used was
MCMCglmm
.I start by stating my prior:
#inverse whishart prior prior.miw<list(R=list(V=diag(4), nu=0.002), G=list(G1=list(V=diag(4), nu=0.002, alpha.mu=c(0,0,0,0), alpha.V=diag(4)*1000)))
Then my model:
mod.1 < MCMCglmm( cbind(OFT1, MIS1, wt, g.rate) ~ (trait1):grid + (trait1):collar + (trait1):sex + (trait1):fieldBirthDate + (trait1):trialnumber, random = ~us(trait):sq_id, rcov = ~us(trait):units, #allows for trait to have different residuals family = c("gaussian", "gaussian", "gaussian", "gaussian"), #state response variables distributions data=multi_data, prior = prior.miw, verbose = FALSE, nitt=103000, #number of iterations thin=100, #the interval at which the Markov chain is stored burnin=3000) #number of iterations before samples are stored
But, I then get the following two errors (which change each time I run my code):
Gstructure 1 is illconditioned: use proper priors if you haven't or rescale data if you have
and
illconditioned crossproduct: can't form Cholesky factor
I made sure to scale my variables properly (mean centered and standardized) and I have tried other priors such as the inverse gamma prior (that works). However, I want to use the inverse whishart prior. If I modify my inverse whishart prior code by removing the
alpha.mu
andalpha.V
functions from it, like so:prior.miw<list(R=list(V=diag(4), nu=0.002), G=list(G1=list(V=diag(4), nu=0.002)))
my multivariate model runs. But, I would like to keep
alpha.mu
andalpha.V
in my prior.I have two questions:
Why am I getting these errors? (i.e. why is my inverse whishart prior causing me this issue as I've currently written it?)
What is the proper inverse whishart prior with
alpha.mu
andalpha.V
functions in it?
Any suggestions or ideas would be much appreciated!

How to colour species in multivariate plot according to group in other df?
I am performing multivariate analysis on species (spec) and environmental (env) data. I know how to show groups in my environmental data (e.g. using orditorp), but not how to do this in my species data, since the grouping is present in a separate df (veg_groups (EDIT: this other df does not necessarily have the same order of species as my species data, it may contain more species). Also, I don't know how to just color the dots according to group (e.g. env$fac1) for my environmental data, instead of using e.g. orditorp which also adds lines. I assume it works the same for species and environmental data, so in this question I focus on the species data, since for me this one is even more difficult. Any help is highly appreciated!
Example data:
data(dune) spec = rbind(dune,dune) spec = spec[1:36,] v = paste0(rep(LETTERS[1:9], each=4), rep(1:4,9)) row.names(spec) = v veg_groups = as.data.frame(t(rbind(names(spec),rep(c("sp1","sp2","sp3"), each=10)))) env = data.frame(replicate(40,sample(0.0:20.0,36,rep=TRUE))) env = env/10 row.names(env) = v env$fac1 = as.factor(rep(c("gr1","gr2","gr3"), each=12)) env$year = as.numeric(rep(1:4,9)) env_excl_year = subset(env, select = year)
What I plotted so far:
library(vegan) library(ggplot2) spec.ca = cca(spec) ef_ca = envfit(spec.ca, env_excl_year, permu=999) #without year spec.cca = cca(spec ~ X13 + year + fac1, env) #since X13 and fac1 had p<0.05 for ef_ca plot(spec.cca) #very crowded, seems not to be adaptable plot(spec.ca, scaling=2) #less crowded, and adaptable p1 = plot(spec.ca, scaling=2, type="n") p1_sel = orditorp(spec.ca, display="species",labels = names(spec), pcol="green", pch="+") p1_sel = orditorp(spec.ca, display="sites",labels = row.names(spec), pcol="blue", pch="o") plot(ef_ca, p.max=0.05, col = "red") with(env, ordispider(spec.ca, fac1, col=c("orange","purple","magenta"), label=T)) #this groups environmental data, but makes graph really full
Another thing I am not sure about, is when to use which scaling (I think scaling = 1 means sites are weighted averages of species, scaling = 2 means species are weighted averages of sites, but then which one do you choose and why? If I change it in the code above ef_ca keeps plotting the arrows in the same direction, even though the arrangement of the species and sites differs), but maybe I should ask this in a separate question. For now my main objective is to know how to colour species based on grouping in a separate df.

Is there a GRS (or GMV) statistic in R to test the portfolio efficiency of adding test assets to a benchmark portfolio?
There are many ways to test the efficiency of a portfolio where the particular test statistic tests whether adding a couple of test assets to the portfolio that consists of 1 or more benchmark assets is efficient or not. One way is to use the GRS or GMV test and another way is to test the mean variance spanning test.
My question is, are there any ways in R to compute some kind of test to test the efficiency of adding test assets to the benchmark portfolio? If so, is there also a way to compare this when there is a shortsales restriction?
A couple of papers to illustrate what I mean:
Testing for MeanVariance spanning with short sales constraints and transaction costs Nijman, Theo; de Roon, Frans; Werker, Bas
https://pure.uvt.nl/ws/files/429652/roonijwer1.pdf
An Exact Test of the Improvement of the Minimum Variance Portfolio Paskalis Glabadanidis https://acfr.aut.ac.nz/__data/assets/pdf_file/0019/105481/89759gmv7.pdf
EDIT: I did found some GRStest package, however there seems no option to add a shortsales constraint...: https://cran.rproject.org/web/packages/GRS.test/GRS.test.pdf
Thank you

Is there an AndersonDarling implementation for python that returns pvalue?
I want to find the distribution that best fit some data. This would typically be some sort of measurement data, for instance force or torque.
Ideally I want to run AndersonDarling with multiple distributions and select the distribution with the highest pvalue. This would be similar to the 'Goodness of fit' test in Minitab. I am having trouble finding a python implementation of AndersonDarling that calculates the pvalue.
I have tried scipy's
stats.anderson()
but it only returns the ADstatistic and a list of critical values with the corresponding significance levels, not the pvalue itself.I have also looked into
statsmodels
, but it seems to only support the normal distribution. I need to compare the fit of several distributions (normal, weibull, lognormal etc.).Is there an implementation of the AndersonDarling in python that returns pvalue and supports nonnormal distributions?

Find a parallel a ttest for means and a test for slope=0 ttest
So picture these two groups of females and males ages:
femalesage<c(30,52,59,25,26,72,46,32,64,45) malesage<c(40,56,31,63,63,78,42,45,67)
I can easily do a t.test(females age,malesage) to achieve the following result:
t.test(femalesage,malesage) Welch Two Sample ttest data: femalesage and malesage t = 1.2013, df = 16.99, pvalue = 0.2461 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 24.224797 6.647019 sample estimates: mean of x mean of y 45.10000 53.88889
Now, suppose I have this same data organized differently, so something like this:
ages<c(30,52,59,25,26,72,46,32,64,45,40,56,31,63,63,78,42,45,67) genders<c("F","F","F","F","F","F","F","F","F","F","M","M","M","M","M","M","M","M","M","M") df<data.frame(ages, genders)
I want to achieve a similar result to the welsh twosample ttest using some sort of regression test, testing the slope of Beta1=0 vs. Beta1 not equal to 0, where B1 is the coefficient of Gender and the response is ages. Any idea how I could get the same result?

STAT!! Hypothesis test
Random samples of 50 male athletes who had won at least one medal in Athletics were selected from both the 2000 and the 2012 Summer Olympics.
Did the average age of male medalists in Athletics differ between the 2000 and the 2012 Summer Olympics
so what test should be apply for this? Pairtest? or Two sample test?

R: KruskalWallis test in loop over specified columns in data frame
I would like to run a KWtest over certain numerical variables from a data frame, using one grouping variable. I'd prefer to do this in a loop, instead of typing out all the tests, as they are many variables (more than in the example below).
Simulated data:
library(dplyr) set.seed(123) Data < tbl_df( data.frame( muttype = as.factor(rep(c("missense", "frameshift", "nonsense"), each = 80)), ados.tsc = runif(240, 0, 10), ados.sa = runif(240, 0, 10), ados.rrb = runif(240, 0, 10)) ) %>% group_by(muttype) ados.sim < as.data.frame(Data)
The following code works just fine outside of the loop.
kruskal.test(formula (paste((colnames(ados.sim)[2]), "~ muttype")), data = ados.sim)
But it doesn't inside the loop:
for(i in names(ados.sim[,2:4])){ ados.mtp < kruskal.test(formula (paste((colnames(ados.sim)[i]), "~ muttype")), data = ados.sim) }
I get the error:
Error in terms.formula(formula, data = data) : invalid term in model formula
Anybody who knows how to solve this? Much appreciated!!

summarise p values for kruskal wallis test r
Is there a way to create a table with just pvalues for krsukal wallis test?
For example, I am able to create a table of pvalues for Wilcoxon test with the following code
tar< xs %>% summarise_each(funs(wilcox.test(.[RVF_cat == "Group 1"], .[RVF_cat == "Group 2"])$p.value), vars = RAVol_N, RA_N)
But when I am doing it for kruskal test using the code:
z< xs %>% summarise_each(funs(kruskal.test(.[RVF_cat == "Group 1"], .[RVF_cat == "Group 2"])$p.value), vars = RAVol_N, RA_N)
I am getting this error:
summarise_each()` is deprecated. Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead. To map `funs` over a selection of variables, use `summarise_at()` Error in summarise_impl(.data, dots) : Evaluation error: 'x' and 'g' must have the same length.
My Variable
RVF_cat
is as.factor variableAny suggestions? Thanks

kruskal wallis in python for feature selection
I am new to machine learning  especially using statistical methods. I am using KW test for preliminary feature extraction for a classification problem. I have two groups  positive and negative. And about 4000 features.
From what I understand from the guide to using kw test, the syntax says that the input should be from two different populations. And have implemented according to this answer in this community.
1] So for feature selection, should I go through each feature; club the values corresponding to positive labels and negative labels, respectively; and proceed as follows
H, pval = mstats.kruskalwallis(featureN_positive_set, featureN_negative_set)
And select the features with lower p values
Obsevations: Does not gives really low pvalues like approach 2. Values not repeated
OR 2] Should I go through combinations of various feature pairs from positive and negative sets; and proceed as follows
H, pval = mstats.kruskalwallis(featureN_positive_set, featureM_negative_set)
And select the features with lower p values
Obsevations: Gives really low pvalues of the order 10^13. However, many of them are repeated to 8th decimal  even if I am giving different features for comparison.
OR 3] should I test the distribution of pairs of two different features as follows
H, pval = mstats.kruskalwallis(featureN_entire_set, featureM_entire_set)
and choose the ones with different distributions (lower p values)
Obsevations: Gives really low pvalues of the order 10^25. However, many of them are repeated even to 8th decimal  even if I am giving different features for comparison.
OR 4] Although fourth approach does not fit my understanding of the syntax, my mentor has implemented in MATLAB and insists that same works for Python. According to MATLAB  we are supposed to give a matrix of features and the corresponding classification labels. The result will contain an array of p values corresponding to every feature. But I don't suppose Python implementation does that. Hence, I want to have more opinions regarding the following approach:
H, pval = mstats.kruskalwallis(featureN_entire_set, class_labels_pos_or_neg)
Finally 5] is there any difference in stats.kruskal and mstats.kruskalwallis?
Sorry for the really long question  but I am really confused at all these suggestive approaches that I have come up with and do not know which one is the correct one.