Can KruskalWallis test be used to test significance of multiple groups within multiple factors?
I have tried to read what I can on KruskalWallis and while I have found some useful information, I still seem to not find the answer to my question. I am trying to use the KruskalWallis test to determine the significance of multiple groups, within multiple factors, in predicting a set of dependent variables.
Here is an example of my data:
ID Date Point Season Grazing Cattle_Type AvgVOR PNatGr NatGrHt
181 7/21/2015 B22 late pre Large 0.8 2 20
182 7/21/2016 B32 early post Small 1.0 4 24
In this example, my dependent variables are "AcgVor", "PNatGR" and"NatGrHt" while the independent variables (factors) are "Season', 'Grazing", and "Cattle_Type". As you can see, each of my factors has 2 group levels each.
What I am trying to accomplish is to run a nonparamatric test that looks at the separate and combined importance of my factor groups to each of my dependent variables. I chose KrukalWallis and it seems to work for testing one of my grouping factors at a time.
Here is the result for AvgVor ~ Grazing
kruskal.test(AvgVOR ~ Grazing, data = Veg)
KruskalWallis rank sum test
data: AvgVOR by Grazing
KruskalWallis chisquared = 94.078, df = 1, pvalue < 2.2e16
This tells me that AvGVor is significantly different according to whether they were recorded pre or post grazing.
Is there a way to build a similar model using KruskalWallis that includes all of my grouping factors? Even if I have to run a separate one for each dependent variable.
I attempted the following code, but it is flawed.
lapply(Veg[,c("Grazing", "Cattle_Type", "Season")]),function(AvgVOR) kruskal.test(AvgVOR ~ Veg)
See also questions close to this topic

Color values in boxplot based on xaxis variable in ggplot
I have a dataframe like the following:
df = data.frame(cat = rep(c("A", "B", "C", "D"), each = 20), val = runif(80))
And an annotation dataframe like the following:
ann = data.frame(cat = c("A", "B", "C", "D"), col = c(34, 84, 23, 85))
I want to make a boxplot for each of these
cat
s along the xaxis, and the value in the data frame as the yaxis, but I also want to color each boxplot by the value inann$col
(continuous color mapping).I can get the boxplot like the following:
ggplot(df, aes(x = variable, y = BACC)) + geom_boxplot(width = 0.12)
But I am unsure how to color each boxplot by the category value.
How can this be done?
Thanks, Jack

Loop in R, extracting the last line of the output
Hei guys,
I've been trying to extract the values from a loop for a while but it seems I just can't find the answer.
To make it simple, here is the code I'm using in R programme:
seatsInitial < function (df,x,y) { for (i in ModQuota) { filtered = subset(df, Poll> x) OnlyPoll < filtered$Poll Calculus < OnlyPoll/i Calculus2 < floor(Calculus) print(Calculus2) if(sum(Calculus2) == y) stop("WE GOT IT") } }
Then, to get the results:
seatsInitial(PartiesE, 5, 33)
The loop works properly and I am able to get the results I need. The loop stops when the condition is met. However, I would also need to convert the last line of the output (when the loop stops as a result of meeting the condition) into a vector.
Do you have any idea how this could be done?
If I run the loop, I get the following results (showing only the last 5 lines of the output here):
[1] 8 14 4 3 5 [1] 8 14 4 3 5 [1] 8 14 4 3 5 [1] 8 14 4 3 5 [1] 7 14 4 3 5 Error in seatsInitial(PartiesEPP, 5, 33) : WE GOT IT
I therefore need the last line (7,14,4,3,5) to be converted into a vector.
Thanks for your help!

Remove records matching a value on a single column occurring within a 5 minute window of a different value on the same column
I have a data frame that looks like this:
require(data.table) require(tidyverse) df < as.data.frame(matrix(c(123, "20180105 09:09:02", "Mobile", 123, "20180106 11:11:15", "Organic", 123, "20180107 13:24:45", "Email", 123, "20180107 13:24:55", "Organic", 321, "20180105 15:15:29", "Organic", 989, "20180108 08:09:21", "Feeds", 989, "20180108 08:09:55", "Organic", 989, "20180110 10:21:40", "Email"), nrow = 8, ncol = 3, byrow = TRUE, dimnames = list(NULL, c("customer_id", "entry_time", "channel")))) df$entry_time < as.POSIXct(df$entry_time) df customer_id entry_time channel 1 123 20180105 09:09:02 Mobile 2 123 20180106 11:11:15 Organic 3 123 20180107 13:24:45 Email 4 123 20180107 13:24:55 Organic 5 321 20180105 15:15:29 Organic 6 989 20180108 08:09:21 Feeds 7 989 20180108 08:09:55 Organic 8 989 20180110 10:21:40 Email
What I would like to do is remove all "organic" records occurring within a five minute window of a nonorganic record, for each customer.
In other words, I want to remove all records where: 1) channel = organic and 2) entry_time < 5 minutes removed from the previous record and 3) the previous record's channel != Organic. I need to be able to do this for each customer id.
My desired output looks as follows:
df_desired < as.data.frame(matrix(c(123, "20180105 09:09:02", "Mobile", 123, "20180106 11:11:15", "Organic", 123, "20180107 13:24:45", "Email", 321, "20180105 15:15:29", "Organic", 989, "20180108 08:09:21", "Feeds", 989, "20180110 10:21:40", "Email"), nrow = 6, ncol = 3, byrow = TRUE, dimnames = list(NULL, c("customer_id", "entry_time", "channel")))) df_desired$entry_time < as.POSIXct(df_desired$entry_time) df_desired customer_id entry_time channel 1 123 20180105 09:09:02 Mobile 2 123 20180106 11:11:15 Organic 3 123 20180107 13:24:45 Email 4 321 20180105 15:15:29 Organic 5 989 20180108 08:09:21 Feeds 6 989 20180110 10:21:40 Email
I am able to do this with the following nested loop (apologies for exposing you to this monstrosity).
dat_splt < split(df, df$customer_id) for (h in 1:length(dat_splt)){ dat_splt[[h]]$prox_flag < 0 if (nrow(dat_splt[[h]]) == 1) {next} else {for (g in 2:nrow(dat_splt[[h]])){ if (dat_splt[[h]][g,]$channel != "Organic") {next} else if (dat_splt[[h]][g1,]$channel != "Organic" & as.numeric((difftime(dat_splt[[h]][g,]$entry_time, dat_splt[[h]][g1,]$entry_time, units = "mins")) < 5)) {dat_splt[[h]][g,]$prox_flag < 1} else {next} }} } dat < rbindlist(dat_splt) dat < dat %>% filter(prox_flag != 1)
Needless to say, this does not scale well. Can someone please help me unravel this Gordian knot of a solution into something more practical?
Much thanks in advance.

Data perturbation  How to perform it?
I am doing some projects related to statistics simulation using R based on "Introduction to Scientific Programming and Simulation Using R" and in the Students projects session (chapter 24) i am doing the "The pipe spiders of Brunswick" problem, but i am stuck on one part of an evolutionary algorithm, where you need to perform some data perturbation according to the sentence bellow:
"With probability 0.5 each element of the vector is perturbed, independently of the others, by an amount normally distributed with mean 0 and standard deviation 0.1"
What does being "perturbed" really mean here? I dont really know which operation I should be doing with my vector to make this perturbation happen and im not finding any answers to this problem. Thanks in advance!

Python: Weighted coefficient of variation
How can I calculate the weighted coefficient of variation (CV) over a NumPy array in Python? It's okay to use any popular thirdparty Python package for this purpose.
I can calculate the CV using
scipy.stats.variation
, but it's not weighted.import numpy as np from scipy.stats import variation arr = np.arange(5, 5) weights = np.arange(9, 1, 1) # Same size as arr cv = abs(variation(arr)) # Isn't weighted

PCA results on imbalanced data with duplicates
I am using sklearn IPCA decomposition and surprised that if I delete duplicates from my dataset, the result differs from the "unclean" one.
What is the reason? As I think, the variance is the same.

Gaussian process regression with multiple independent variables in Python
I was looking into the development of a Python code that permits Gaussian process regression with multivariable inputs. This exact question has been asked in the past but I am seeking an actual test example with code. I have been searching packages and perhaps am simply overlooking the right documentation, but I cannot find any routines in Python that can handle multiple independent variables (i.e.
x_1, x_2, ..., x_n
) to predict a single output (y
). 
PLS in R: Model training and predicting values with two Y variables
I' ve like to model training and predicting values using PLS model for more than one Y variables, but I have some problems when I try this approach, in my code below:
#First simulate some data set.seed(123) bands=20 data < data.frame(matrix(runif(60*bands),ncol=bands)) colnames(data) < paste0(1:bands) data$nitrogen < rpois(60,10) data$carbon < rpois(60,10) # #Tranning data set cal_BD<data[1:50,] #Validation data set val_BD<data[51:60,] # define explanatory variables (x) spectra < cal_BD[,1:20] #Build PLS model using training data only mod_pls < plsr(carbon + nitrogen ~ spectra, ncomp = 20, data =cal_BD, validation = "LOO", jackknife = TRUE) summary(mod_pls) # #Prediction in validation data set est_pls<predict(mod_pls, comps = 20, newdata = val_BD) est_pls #
1) Doesn't work when I try carbon + nitrogen in model; and
2) I've like to create a new data frame with estimate values for carbon and nitrogen, using the code below:
val_BD2<val_BD[,(21:22)] # remove carbon + nitrogen beccause my goal is predict this values est_pls<predict(mod_pls, comps = 20, newdata = val_BD)#Prediction in validation data set (only X's) final_est_DF<cbind(val_BD2est_pls[,1],est_pls[,2])
And my desirable output with estimated carbon and nitrogen and not observed values is:
1 2 3 ... carbon nitrogen 51 0.04583117 0.93529980 0.6299731 ... 15.3 8.6 52 0.44220007 0.30122890 0.1838285 ... 10.0 7.1 53 0.79892485 0.06072057 0.8636441 ... 9.0 7.3 54 0.12189926 0.94772694 0.7465680 ... 11.1 6.5 55 0.56094798 0.72059627 0.6682846 ... 10.3 8.4 56 0.20653139 0.14229430 0.6180179 ... 13.9 9.1 ...
This is possible?

r covariance matrix and correlation matrix
Hello I am using the data dystrophy from package ipred. I've used a subset to separate from carriers and normal:
carrier = subset(dystrophy,dystrophy$Class == "carrier") normal = subset(dystrophy,dystrophy$Class == "normal")
and I've reduce this data selecting only the patients with 1 visit at the hospital:
carrier = subset(carrier,carrier$OBS == "1") normal = subset(normal,normal$OBS == "1")
So now I would like to practice calculating the means vector, covariance matrix and a correlation matrix of the proteins but by separated groups(Class factor).
I 've tried with cor and cov, but I think I am doing something wrong. Any help would be appreciated. thanks!!

Excel Normal Distribution for proving null hypothesis
i been trying to prove the null hypothesis for a question for step 4 question 2
with this data in excel : excel table
i have been following my college's presentation file formula to get h8, i assume it is a random variable ? my presentation file is like this slide1 slide2 slide3
and my h8 excel table formula is
=SQRT(D2/SQRT(E2)^2+(D9)/SQRT(E9)^2)
and the h7 excel table is only subtraction of 2017 average with 2010 average
and my normal distribution formula is
=NORMDIST(H7,0,H8,TRUE)
is it suppose to return 0 in normal distribution formula? is the value correct for the question i have thanks

optimal sample size for control/test group for Ttest
Recently we launched a feature on one of our website pages.I have all the historical data worth 6 months about the page including impressions and CTR .It has been 20 days since we launched the feature and now we want to know if there is any significant lift in CTR postlaunch. Is there a way to determine how many impressions is statistically significant to conduct a Ttest to check lift in CTR assuming page views before launch was control group and post launch is test group. How much historical data do I need to look at for control group and evaluate the required sample size for test group based on that. Any lead OR different approach is highly appreciated.

Mielke & Berry 1985 significance test for Goodman & Kruskal tau
How to perform Mielke & Berry 1985 (DOI 10.1177/0049124185013004005) nonasymptotic significance test for Goodman & Kruskal's Marginal Predictability Coefficient (GoodmanKruskal tau) in R? The test seems simple enough to programme a function for, but I simply do not know how to programme functions in R yet and I have multiple deadlines for multiple assignments this year, so I simply do not have the time to learn it before the deadline. If anyone who does know R programming bothered to sip through the article (8 pages in full, though the math I need to encode is contained entirely on pages 45) and reply with simple coding that I can copy and paste into RStudio to obtain a function that calculates p.value for Goodman & Kruskal's tautest, I would be infinitely grateful until the end of my academic life and more than willing to include the good Samaritan in the bibliographical references section of my report.
P.S.: I have already asked a similar (broader) question here before, but someone closed the topic because, for reasons I cannot comprehend, they mistook Goodman & Kruskal's tauTest for Marginal Predictability between 2 nominal categorical variables (which is an improvement of the Goodman & Kruskal's lambdaTest for Modal Predictability between 2 nominal categorical variables, since the latter only considers the modal categories whilest the other considers all) for Kruskal & Wallis's HTest for Equality of Rank Sums for one quantitative or ordinal variable amongst 2 or more nominal categorical groups (which is itself an extension of the MannWhitney UTest for Equality of Rank Sums that can only compare quantitative or ordinal variables amongst exactly 2 nominal categorical groups). I enjoy KruskalWallis ANOVA as much as the next guy, but it simply doesn't apply to my case (nominal variables that cannot be ranked or compared; besides, I want to test for increased marginal predictability of one variable's value conditioned by knowledge of the other variable's value, not for likelihood that observations with at least one certain value for one variable have "superior" values for the other), so linking me to an alreadyanswered question regarding variation of diamond price per karat based on diamond colour just won't help me.
P.P.S.: Packages
DescTools
,GoodmanKruskal
, etc don't give pvalues. 
Test for difference by outcome (Yes/No) by each level of a factor  with one outcome level missing in a factor level
I need a method to divide a vector (bloodsample values) by each level of a factor of time intervals (gacat) and compare these data (by t.test/ANOVA or Kruskal Wallis) between two levels in an outcome factor variable (EPL (yes/no))
In the mtcars df:
df1 < mtcars df1$cyl < factor(df1$cyl) df1$gear < factor(df1$gear)
This code solves my problem nicely., using anova:
lapply(split(df1, df1$gear), function(d){summary(aov(mpg~cyl, data=d))})
However, as the last level of my data in
split
(in the above example gear) only has one of the yes/no outcome, the entire code throws the error:Error in `contrasts<`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
My data:
No Yes <6 weeks 89 21 68 weeks 166 37 810 weeks 158 18 1012 weeks 131 5 >12 weeks 90 **0**
It is the zero in * that seems to be the problem... In the example, this is not a problem as every factor has at least two levels:
table(df1$cyl, df1$gear) 3 4 5 4 1 8 2 6 2 4 1 8 12 0 2
As I need to examine a lot of blood samples, I would love a way to do these comparisons by a short codeblock. Any way to make R throw a NaN for the last level, in stead of dropping the entire code?

Get KruskalWallis Ranks from function
I'm using KruskalWallis to do some analysis on nonnormal/heterskedastic data and I'm keen to use Welch's ttest implementation data as described by Cribbie et al here. Their process involves performing the Welch's Ttest on the ranks from KW.
The Python part of the question is: is there a way to save and recall the ranked values from the KW test? I'm using the KW implementation stats.kruskal and I don't see much info in the API. Is there an alternate implementation that would give me what I'm after?
Thanks in advance.

Can I use both parametric and nonparametric tests?
I have 3 variables, A1, A2 and A3
A1 is temperature A2 is month A3 is location
A2 has 2 months  March and May. A3 has 2 cities  Chennai and Dubai.
My data is nonnormally distributed and I am trying to compare the 3 groups with the KruskallWallis test to see there is a difference between A2 and A3
kruskal.test(my_data=data, A2~A3)
KruskalWallis rank sum test data: A2 by A3 KruskalWallis chisquared = 0, df = 1, pvalue = 1
is a pvalue of 1 correct and can I do a ANOVA (Parametric test) to make sure that there is no difference between A2 and A3, even though the data is nonnormally distributed?
Many Thanks, Ishack