Can KruskalWallis test be used to test significance of multiple groups within multiple factors?
I have tried to read what I can on KruskalWallis and while I have found some useful information, I still seem to not find the answer to my question. I am trying to use the KruskalWallis test to determine the significance of multiple groups, within multiple factors, in predicting a set of dependent variables.
Here is an example of my data:
ID Date Point Season Grazing Cattle_Type AvgVOR PNatGr NatGrHt
181 7/21/2015 B22 late pre Large 0.8 2 20
182 7/21/2016 B32 early post Small 1.0 4 24
In this example, my dependent variables are "AcgVor", "PNatGR" and"NatGrHt" while the independent variables (factors) are "Season', 'Grazing", and "Cattle_Type". As you can see, each of my factors has 2 group levels each.
What I am trying to accomplish is to run a nonparamatric test that looks at the separate and combined importance of my factor groups to each of my dependent variables. I chose KrukalWallis and it seems to work for testing one of my grouping factors at a time.
Here is the result for AvgVor ~ Grazing
kruskal.test(AvgVOR ~ Grazing, data = Veg)
KruskalWallis rank sum test
data: AvgVOR by Grazing
KruskalWallis chisquared = 94.078, df = 1, pvalue < 2.2e16
This tells me that AvGVor is significantly different according to whether they were recorded pre or post grazing.
Is there a way to build a similar model using KruskalWallis that includes all of my grouping factors? Even if I have to run a separate one for each dependent variable.
I attempted the following code, but it is flawed.
lapply(Veg[,c("Grazing", "Cattle_Type", "Season")]),function(AvgVOR) kruskal.test(AvgVOR ~ Veg)
See also questions close to this topic

R regression: Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :undefined columns selected
I've been trying to build regression models using different ways and method and as I was trying out this code here, using the Caret package:
library(caret) set.seed(222) ind < sample(2, nrow(model2), replace = T, prob = c(0.7,0.3)) train < model2[ind==1,] test < model2[ind==2,] custom < trainControl(method = "repeatedcv", number = 6, repeats = 6, verboseIter = T) lm < train(train$SS~., train, method = 'lm', trControl = custom) lm$results
But I kept receiving this error note:
Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :undefined columns selected
Here are the
str()
of my data set.> str(train) 'data.frame': 19 obs. of 15 variables: $ SST : num 0 0 0 0 1 0 0 0 0 1 ... $ SSA : num 0 1 0 0 0 0 0 1 0 0 ... $ SSR : num 0 0 0 1 0 0 0 0 0 0 ... $ SSC : num 0 0 0 0 0 0 0 0 0 1 ... $ SSF : num 1 1 1 0 1 1 1 1 1 1 ... $ SSS : num 0 0 0 1 0 0 0 0 0 0 ... $ SST : num 1 1 1 0 1 1 1 1 1 1 ... $ SSH : num 1 1 1 0 1 1 1 1 1 1 ... $ SSC : num 1 1 1 0 1 1 1 1 1 1 ... $ SSW : num 0 0 0 0 0 0 0 0 0 0 ... $ QTY : num 45 45 49 13 48 109 45 42 45 31 ... $ SS : num 470000 550000 460000 630000 1060000 530000 480000 510000 460000 630000 ... $ BASE..SS : num 6.27e+09 6.67e+09 6.14e+09 8.54e+09 1.43e+10 ... $ Ex : num 13341 13341 13341 13341 13341 ... $ TPH. : num 45 45 45 90 95 65 45 45 45 45 ...
Any kind of help will be greatly appreciated. Thank you for taking time off this question!

How to execute a function without argument using lapply
I have the following functions:
set.seed(1) make_seq < function() { paste0(sample(LETTERS, size = 30, replace = TRUE), collapse = "") } make_seq() #> [1] "GJOXFXYRQBFERJUMSZJUYFQDGKAJWI"
It takes no argument and spits out a sequence.
What I want to do is to compactly create 100 sequences with the above function with
lapply
. But why this failed?> lapply(1:100, make_seq()) Error in get(as.character(FUN), mode = "function", envir = envir) : object 'GJOXFXYRQBFERJUMSZJUYFQDGKAJWI' of mode 'function' was not found
What's the right way to do it?

ggplot change color of one bar from stacked bar chart
Is there way to change colors of one bar( x  value) manualy in
ggplot
data
for_plot_test=structure(list(name = c("A", "B", "C", "A1", "A2", "A3", "A4", "BI", "A", "B", "C", "A1", "A2", "A3", "A4", "BI"), n = c(1L, 3L, 5L, 7L, 9L, 11L, 13L, 15L, 2L, 4L, 6L, 8L, 10L, 12L, 14L, 16L), value = c(0, 0.05, 0, 0.05, 0.05, 0.1, 0.05, 0, 1, 0.7, 0.6, 0.5, 0.4, 0.2, 0.2, 0.1), variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("PROGRESS", "prev_progress"), class = "factor")), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, 16L), vars = "name", labels = structure(list(name = c("Applications", "BI", "Clients", "CRE & Scoring", "Portfolio & Production", "SG Russia", "Transactions", "УКЛ & Prescoring")), row.names = c(NA, 8L), class = "data.frame", vars = "name", drop = TRUE, indices = list(0:1, 14:15, 6:7, 10:11, 2:3, 12:13, 8:9, 4:5), group_sizes = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), biggest_group_size = 2L, .Names = "name"), indices = list(c(0L, 8L), c(7L, 15L), c(3L, 11L), c(5L, 13L), c(1L, 9L), c(6L, 14L), c(4L, 12L), c(2L, 10L)), drop = TRUE, group_sizes = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), biggest_group_size = 2L, .Names = c("name", "n", "value", "variable"))
Current plot
colot_progress=c("#be877a","#dcbfad") s < ggplot(for_plot_test, aes(x= reorder(name, n),y = value, fill = variable,label=ifelse(for_plot$value==0,"",scales::percent(for_plot$value))))+ geom_bar(stat='identity',position = "stack")+ scale_fill_manual(values=colot_progress,aesthetics = "fill")+ coord_flip()+ theme_minimal() + theme( axis.title = element_blank(), axis.text.x=element_blank(), panel.grid = element_blank(), legend.position="none" )+ geom_text(size = 5, position = position_stack(vjust = 0.5)) s

Scipy poisson distribution with an upper limit
I am generating a random number using scipy stats. I used the Poisson distribution. Below is an example:
import scipy.stats as sct A =2.5 Pos = sct.poisson.rvs(A,size = 20)
When I print Pos, I got the following numbers:
array([1, 3, 2, 3, 1, 2, 1, 2, 2, 3, 6, 0, 0, 4, 0, 1, 1, 3, 1, 5])
You can see from the array that some of the number,such as 6, is generated.
What I want to do it to limit the biggest number(let's say 5), i.e. any random number generated using sct.poisson.rvs should be equal or less than 5,
How can I tweak my code to achieve it. By the way, I am using this in Pandas Dataframe.

Why do some analysts use binomial distribution to describe click through rates/probabilities?
Binomial Distribution assumed that p is consistent for all individual events but probability of clicking is totally different for every user. Isn’t it contradicted?
How could I apply poisson distribution and ztest into a/b testing with average order numbers per user as the metric? Is it possible to break down poisson into binomial in this kind of experiment?

Summary Statistics in JMP
New to JMP for statistical analysis. When you run the distribution (Analyze < Distribution) it usually gives you a histogram for the data and a list of summary statistics. The summary statistics are no longer appearing instead, only the histogram and a frequency chart appear. I reset my summary statistics to default and that didn't work. The data in the columns is set to numeric. Not sure what's going on. Thanks for your responses in advance.

NMDS and adonis
I know usually people run the
adonis
function on the original community matrix against an environmental matrix or a matrix with different site parameters for the community. Although I am wondering if it would be ok to run the original community matrix against a matrix that was a consolidated form of the original matrix (for example collapsing plant guilds) in order to determine what in the original community matrix is causing the differences in visualized NMDS?Example: I had a plant community composition data that I used to plot the NMDS. I then created a second matrix that consolidated groups like Native Forbs, Exotic Shrubs, Native Grasses, etc  originating from the original community matrix. To me it makes sense, but I am not sure because it technically uses the same data set. BUT I am not trying to figure out if there are separate site characteristics causing this, just why the NMDS plots differ. I know it is circular thinking, but is it acceptable?
Here is the plot: Plot output
I can post the data set if it is helpful, but I figured this was more of a theoretical question
YLRveg2018 < ddply(VegComp, c("Plot"), summarise, Avena.fatua= sum(Avena.fatua), Bromus.diandrus= sum(Bromus.diandrus), Bromus.hordeaceous= sum(Bromus.hordeaceous), Festuca.muyros= sum(Festuca.muyros), Festuca.perennis= sum(Festuca.perennis), Carduus.pycnocephalus= sum(Carduus.pycnocephalus), Cirsium.vulgare= sum(Cirsium.vulgare), Erodioum.cicatarium= sum(Erodioum.cicatarium), Geranium.dissectum= sum(Geranium.dissectum), Helminthotheca.echiodies= sum(Helminthotheca.echiodies), Lactuca.seriola= sum(Lactuca.seriola), Medicago.polymorpha= sum(Medicago.polymorpha), Oxalis.pes.capre= sum(Oxalis.pes.capre), Raphanus.sativus= sum(Raphanus.sativus), Senecio.vulgare= sum(Senecio.vulgare), Sonchus.oleraceous= sum(Sonchus.oleraceous), Vicia.sativa= sum(Vicia.sativa), Artemisia.californica= sum(Artemisia.californica), Baccharis.pilularis= sum(Baccharis.pilularis), Ericameria.ericoides= sum(Ericameria.ericoides), Mimulus.aurantiacus= sum(Mimulus.aurantiacus), Bromus.carinatus= sum(Bromus.carinatus), Elymus.triticoides= sum(Elymus.triticoides), Hordeum.Brachyantherum= sum(Hordeum.Brachyantherum), Stipa.pulchra= sum(Stipa.pulchra), Achillea.millefolium= sum(Achillea.millefolium), Eschscholzia.californica= sum(Eschscholzia.californica), Lupinus.variicolor= sum(Lupinus.variicolor), Echium.sp= sum(Echium), Bareground = sum(Bare), Thatch= sum(Thatch)) Guilds2018 < ddply(GuildComp, c("Plot"),summarise, Thatch.Depth = mean(Thatch.depth), Thatch = sum(Thatch), Bareground = sum(Bare), OverallPlantCover = sum(VegCover), ExoticGrasses = sum(ExoGrass), ExoticForbs = sum(ExoForb), ExoticShurbs = sum(ExoShrub), NativeGrasses = sum(NativeGrass), NativeForbs = sum(NativeForb), NativeShrubs = sum(NativeShrub))
adonis(Veg2018 ~ NativeGrasses+Bareground+ExoticGrasses+Thatch+Thatch.Depth+ExoticForbs+NativeForbs+NativeShrubs, permutations = 999, distance = "bray", data=PlantGuilds)

Multivariate polynomial regression
I am trying to do a multivariate polynomial regression on my data in python. I have 4 independent and 1 dependent variable. I'm unsure even where to begin. I've gone through a lot of blogs in the last few days, but didn't get a lot of clarity. Most of the blogs are for univariate data. Is there a good post anywhere that anyone could direct me to, that explains the concept and has a sample code? Any help is much appreciated.
Thank you!

Multivariate fractional outcome regression
In a NBER 2010 paper, Mullahy extended Papke and Wooldridge's univariate fractional regression estimation strategies (implemented e.g. in the
fracreg
Stata command) to multivariate regression. However, I am not aware of Stata commands or R packages to run such analysis. Do they exist? What if the two dependent variables are one fractional and the other continuous? 
Assessing the significance of change in one parameter between two samples
Let's assume we have two data samples. One of them has an additional parameter. This parameter does not initially exist in that data sample. But after being added, we can see one of other attributes of the sample (e.g. performance percentage) has increased considerably. In the other sample, the performance percentage is not increased. While the other sample does not have this parameter, how can we make sure if the change in performance percent in the first sample is directly related with the increase in that specific parameter?
Thanks

How to judge the results of the Wilcoxon test?
all! I want to conduct the Wilcoxon ranksum test on two datasets, i.e.,
x1
andx2
. My code is as follows,from scipy import stats x1 = [9,5,8,7,10,6,7] # len(x1) = 7 x2 = [7,4,5,6,3,6,4,4] # len(x2) = 8 stats.mannwhitneyu(x1, x2)
Then I get the result like this,
> MannwhitneyuResult(statistic=6.5, pvalue=0.006966479792405637)
I do know how the
statistic
variable is calculated. But whether should I reject the null hypothesis or not according to the result generated by functionmannwhitneyu()
?I also noticed that some other python functions return the
statistic
andpvalue
variables. I am quite confused about the meaning of pvalue generated by these functions, such asranksums()
,wilcoxon()
, etc. Can I simply compare it with 0.05 to judge whether I should reject it or not? 
Reporting means in T test after Log Transformation
I have a variable namely INSULIN, and I want to test the significance of means using T test. The original data had some outliars and was not normally distributed. I did log transformation, and it followed normality which I checked using KolmogorovSmirnov test. See below table of INSULIN and Log transformed INSULIN.
Group Statistics Type N Mean Std. Deviation Std. Error Mean INSULIN_Log_Transformed Cases 74 2.4006 0.56882 0.06612 Controls 196 1.9629 0.73320 0.05237 INSULIN Cases 74 13.0678 8.61817 1.00184 Controls 196 9.5318 8.88716 0.63480
After applying T test for log transformed data. Mean difference was significant. Now I am confused, which means I have to report. If I have to report transformed means, how to do that?

Pandas apply kruskalwallis to numeric columns
I have a dataframe of 27 columns (26 are numeric variables and the 27th column tells me which group each row is associated with). There are 7 groups in total I'm trying to apply the KruskalWallis test to each variable, split by group, to determine if there is a significant difference or not.
I have tried:
df.groupby(['treatment']).apply(kruskal)
which throws an error "Need at least 2 groups two groups in stats.kruskal()".
My other attempts haven't produced an output either. I'll be doing similar analyses on a regular basis and with larger datasets. Can someone help me understand this issue and how to fix it?

R: KruskalWallis test in loop over specified columns in data frame
I would like to run a KWtest over certain numerical variables from a data frame, using one grouping variable. I'd prefer to do this in a loop, instead of typing out all the tests, as they are many variables (more than in the example below).
Simulated data:
library(dplyr) set.seed(123) Data < tbl_df( data.frame( muttype = as.factor(rep(c("missense", "frameshift", "nonsense"), each = 80)), ados.tsc = runif(240, 0, 10), ados.sa = runif(240, 0, 10), ados.rrb = runif(240, 0, 10)) ) %>% group_by(muttype) ados.sim < as.data.frame(Data)
The following code works just fine outside of the loop.
kruskal.test(formula (paste((colnames(ados.sim)[2]), "~ muttype")), data = ados.sim)
But it doesn't inside the loop:
for(i in names(ados.sim[,2:4])){ ados.mtp < kruskal.test(formula (paste((colnames(ados.sim)[i]), "~ muttype")), data = ados.sim) }
I get the error:
Error in terms.formula(formula, data = data) : invalid term in model formula
Anybody who knows how to solve this? Much appreciated!!

summarise p values for kruskal wallis test r
Is there a way to create a table with just pvalues for krsukal wallis test?
For example, I am able to create a table of pvalues for Wilcoxon test with the following code
tar< xs %>% summarise_each(funs(wilcox.test(.[RVF_cat == "Group 1"], .[RVF_cat == "Group 2"])$p.value), vars = RAVol_N, RA_N)
But when I am doing it for kruskal test using the code:
z< xs %>% summarise_each(funs(kruskal.test(.[RVF_cat == "Group 1"], .[RVF_cat == "Group 2"])$p.value), vars = RAVol_N, RA_N)
I am getting this error:
summarise_each()` is deprecated. Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead. To map `funs` over a selection of variables, use `summarise_at()` Error in summarise_impl(.data, dots) : Evaluation error: 'x' and 'g' must have the same length.
My Variable
RVF_cat
is as.factor variableAny suggestions? Thanks