Can KruskalWallis test be used to test significance of multiple groups within multiple factors?
I have tried to read what I can on KruskalWallis and while I have found some useful information, I still seem to not find the answer to my question. I am trying to use the KruskalWallis test to determine the significance of multiple groups, within multiple factors, in predicting a set of dependent variables.
Here is an example of my data:
ID Date Point Season Grazing Cattle_Type AvgVOR PNatGr NatGrHt
181 7/21/2015 B22 late pre Large 0.8 2 20
182 7/21/2016 B32 early post Small 1.0 4 24
In this example, my dependent variables are "AcgVor", "PNatGR" and"NatGrHt" while the independent variables (factors) are "Season', 'Grazing", and "Cattle_Type". As you can see, each of my factors has 2 group levels each.
What I am trying to accomplish is to run a nonparamatric test that looks at the separate and combined importance of my factor groups to each of my dependent variables. I chose KrukalWallis and it seems to work for testing one of my grouping factors at a time.
Here is the result for AvgVor ~ Grazing
kruskal.test(AvgVOR ~ Grazing, data = Veg)
KruskalWallis rank sum test
data: AvgVOR by Grazing
KruskalWallis chisquared = 94.078, df = 1, pvalue < 2.2e16
This tells me that AvGVor is significantly different according to whether they were recorded pre or post grazing.
Is there a way to build a similar model using KruskalWallis that includes all of my grouping factors? Even if I have to run a separate one for each dependent variable.
I attempted the following code, but it is flawed.
lapply(Veg[,c("Grazing", "Cattle_Type", "Season")]),function(AvgVOR) kruskal.test(AvgVOR ~ Veg)
See also questions close to this topic

Quantstrat: Is it possible to execute several orders for the same instrument within a single day?
I'm a bit puzzled with how exactly Quantstrat places orders if they are generated within a single day by several signals. I'm trying to implement a simple pairtrading strategy (with zero being both an entry and exit threshold), so when zscore flips its sign, two signals should be generated for each symbol (SELL LONG > SELL SHORT or COVER SHORT > BUY LONG). The problem is QS always ignores exit order (makes it 'replaced' by a subsequent entry order). What should I do to prevent signals from being replaced or canceled? Is it possible to execute several orders on a single bar at all?
Here's my setup. Signals:
add.signal( strategy = strategy_st, name = 'sigFormula', arguments = list( columns = c('score', 'entry_threshold'), formula = 'score > entry_threshold', cross = TRUE ), label = 'sigEnterUpper') add.signal( strategy = strategy_st, name = 'sigFormula', arguments = list( columns = c('score', 'entry_threshold'), formula = 'score <= entry_threshold', cross = TRUE ), label = 'sigEnterLower') #exit add.signal( strategy = strategy_st, name = 'sigFormula', arguments = list( columns = c('score', 'exit_threshold'), formula = 'score <= exit_threshold', cross = TRUE ), label = 'sigExitUpper') add.signal( strategy = strategy_st, name = 'sigFormula', arguments = list( columns = c('score', 'exit_threshold'), formula = 'score >= exit_threshold', cross = TRUE ), label = 'sigExitLower')
Rules:
#entries #longs add.rule( strategy = strategy_st, name = 'ruleSignal', label = 'increase_long_up', arguments = list( sigcol = 'sigEnterUpper', sigval = TRUE, orderqty = default_order_quantity, ordertype = 'market', orderside = 'long', osFUN = orderSizeIncrease), type = 'enter', path.dep = TRUE ) add.rule( strategy = strategy_st, name = 'ruleSignal', label = 'increase_long_low', arguments = list( sigcol = 'sigEnterLower', #sigcol = 'sigFlipper', #sigcol = 'sigStartTrading', sigval = TRUE, orderqty = default_order_quantity, ordertype = 'market', orderside = 'long', osFUN = orderSizeIncrease), type = 'enter', path.dep = TRUE) #shorts add.rule( strategy = strategy_st, name = 'ruleSignal', label = 'increase_short_up', arguments = list( sigcol = 'sigEnterUpper', #sigcol = 'sigFlipper', #sigcol = 'sigStartTrading', sigval = TRUE, orderqty = default_order_quantity, ordertype = 'market', orderside = 'short', osFUN = orderSizeIncrease), type = 'enter', path.dep = TRUE) add.rule( strategy = strategy_st, name = 'ruleSignal', label = 'increase_short_low', arguments = list( sigcol = 'sigEnterLower', #sigcol = 'sigFlipper', #sigcol = 'sigStartTrading', sigval = TRUE, orderqty = default_order_quantity, ordertype = 'market', orderside = 'short', osFUN = orderSizeIncrease), type = 'enter', path.dep = TRUE) #exits add.rule( strategy = strategy_st, name = 'ruleSignal', label = 'close_position_up', arguments = list( sigcol = 'sigExitUpper', sigval = TRUE, orderqty = 'all', ordertype = 'market', orderside = NULL), type = 'exit') add.rule( strategy = strategy_st, name = 'ruleSignal', label = 'close_position_low', arguments = list( sigcol = 'sigExitLower', sigval = TRUE, orderqty = 'all', ordertype = 'market', orderside = NULL), type = 'exit')
Signals generated:
EWA.Close score entry_threshold exit_threshold sigEnterUpper sigEnterLower sigExitUpper sigExitLower 20060404 22.95 0.00000000 0 0 NA NA NA NA 20060405 22.78 0.00000000 0 0 NA NA NA NA 20060406 22.94 0.00000000 0 0 NA NA NA NA 20060407 23.05 0.00000000 0 0 NA NA NA NA 20060408 23.32 0.00000000 0 0 NA NA NA NA 20060409 23.12 0.00000000 0 0 NA NA NA NA 20060410 23.01 0.00000000 0 0 NA NA NA NA 20060411 23.33 0.00000000 0 0 NA NA NA NA 20060412 23.23 0.00000000 0 0 NA NA NA NA 20060413 23.54 0.00000000 0 0 NA NA NA NA 20060414 23.56 0.00000000 0 0 NA NA NA NA 20060415 23.17 0.00000000 0 0 NA NA NA NA 20060416 22.76 0.00000000 0 0 NA NA NA NA 20060417 22.20 0.00000000 0 0 NA NA NA NA 20060418 22.29 0.00000000 0 0 NA NA NA NA 20060419 21.79 0.00000000 0 0 NA NA NA NA 20060420 21.47 0.00000000 0 0 NA NA NA NA 20060421 21.55 0.00000000 0 0 NA NA NA NA 20060422 21.35 0.00000000 0 0 NA NA NA NA 20060423 21.37 0.00000000 0 0 NA NA NA NA 20060424 21.15 0.00000000 0 0 NA NA NA NA 20060425 21.99 0.00000000 0 0 NA NA NA NA 20060426 22.20 1.89349767 0 0 NA 1 NA 1 20060427 22.10 2.05270575 0 0 NA NA NA NA 20060428 22.38 2.68490142 0 0 NA NA NA NA 20060429 22.38 2.40434146 0 0 NA NA NA NA 20060430 22.60 1.87012946 0 0 NA NA NA NA 20060501 22.09 1.42771737 0 0 NA NA NA NA 20060502 21.86 1.49838977 0 0 NA NA NA NA 20060503 21.44 1.18071344 0 0 NA NA NA NA 20060504 21.04 0.59868857 0 0 NA NA NA NA 20060505 21.32 0.85730960 0 0 NA NA NA NA 20060506 21.02 0.77078417 0 0 NA NA NA NA 20060507 20.23 0.74856764 0 0 NA NA NA NA 20060508 20.41 0.01843937 0 0 1 NA 1 NA #I expect to cover short and buy long here 20060509 21.00 0.26363152 0 0 NA NA NA NA 20060510 20.78 0.44662726 0 0 NA NA NA NA 20060511 20.32 0.71957534 0 0 NA NA NA NA 20060512 20.29 0.76088199 0 0 NA NA NA NA
Orders generated:
Order.Qty Order.Price Order.Type Order.Side Order.Threshold Order.Status Order.StatusTime Prefer Order.Set Txn.Fees Rule Time.In.Force 20060426 "1.046" "22.2" "market" "short" NA "closed" "20060427 00:00:00" "" NA "0" "increase_short_low" "" 20060508 "all" "20.41" "market" "short" NA "replaced" "20060508" "" NA "0" "close_position_up" "" 20060508 "1.046" "20.41" "market" "long" NA "closed" "20060509 00:00:00" "" NA "0" "increase_long_up" ""

R: Row sums by same column patterns
I'm trying to merge columns and calculate row sums which with same defined string pattern.
For example:
mat < matrix( 1:20, nrow = 2, ncol = 10, byrow = TRUE) colnames(mat) < c("slurry","slurrys","liquid","liquids","solut","solution","aqueou","aqueous","agent","agents") mat slurry slurrys liquid liquids solut solution aqueou aqueous agent agents [1,] 1 2 3 4 5 6 7 8 9 10 [2,] 11 12 13 14 15 16 17 18 19 20
I want to get the following result:
slurry liquid solut aqueou agent [1,] 3 7 11 15 19 [2,] 23 27 31 35 39
I have tried use sapply to do, but the calculate result is wrong.
nams < c("slurry", "liquid", "solut","aqueou", "agent") nams_bind < sapply(nams, function(i)rowSums(mat[, nams==i, drop=FALSE])) nams_bind slurry liquid solut aqueou agent [1,] 7 9 11 13 15 [2,] 27 29 31 33 35
Is any way to revise it?

Aggregate data in one column before a date
Am trying to sum value in another column that is grouped by ids. One id may have different accounts, which were opened on different days. I want to sum the amount before each account was opened i.e less than the date opened for each account. here is the sample data. The result should be this This is the result. Note the sum_amount is sum of amount if there were any accounts that were opened before the account was opened.

Anomaly detection on sequence of events which are time ordered
I have continuous sequence of events which are time ordered. Example I can have events like A,B,A,D,C,F,A,E,F,....... I want to find sub sequences which are very much different from other subsequences. I am not sure which machine learning algorithm suits this problem. This data points doesn’t have labels, so this will be an unsupervised technique. Also I want to understand how to choose the window length of subsequence. Please provide which machine learning algorithms work better and also give me the intuition on how to choose the window length or suggest some techniques where it can learn the length of sequences dynamically.

Nested Anova in python with Spm1d. Can't print f statistics and p values
I'm looking for a simple solution to perform multifactor ANOVA analysis in python. A 2factor nested ANOVA is what I'm after, and the SPM1D python module is one way to do that, however I am having an issue.
http://www.spm1d.org/doc/Stats1D/anova.html#twowaynestedanova
for any of the nested approach examples, there is never any Fstatistic or p_values printed, nor can I find any way to print them or send them to a variable.
To go through the motions of running one of their examples, where B is nested inside A, with Y observations:
import numpy as np from matplotlib import pyplot import spm1d dataset = spm1d.data.uv1d.anova2nested.SPM1D_ANOVA2NESTED_3x3() Y,A,B = dataset.get_data() #(1) Conduct ANOVA: alpha = 0.05 FF = spm1d.stats.anova2nested(Y, A, B, equal_var=True) FFi = FF.inference(0.05) print( FFi ) #(2) Plot results: pyplot.close('all') FFi.plot(plot_threshold_label=True, plot_p_values=True) pyplot.show()
The only indication of statistical significance provided is whether the h0 hypothesis is rejected or not.
> print( FFi ) SPM{F} inference list design : ANOVA2nested nEffects : 2 Effects: A z=(1x101) array df=(2, 6) h0reject=True B z=(1x101) array df=(6, 36) h0reject=False
In reality, that should be enough. However, in science, scientists like to think of something as more or less significant, which is actually kind of crap... significance is binary. But that's how they think about it, so I have to play along in order to get work published.
The example code produces a matplotlib plot, and this DOES have the f statistic and p_values on it!
#(2) Plot results: pyplot.close('all') FFi.plot(plot_threshold_label=True, plot_p_values=True) pyplot.show()
But I can't seem to get any output which prints it.
FFi.get_p_values
and
FFi.get_f_values
produce the output:
<bound method SPMFiList.get_p_values <kabammi edit  or get_f_values> of SPM{F} inference list design : ANOVA2nested nEffects : 2 Effects: A z=(1x101) array df=(2, 6) h0reject=True B z=(1x101) array df=(6, 36) h0reject=False
So I don't know what to do. Clearly the FFi.plot class can access the p_values (with plot_p_values) but FFi.get_p_values cant!!? Can anyone lend a hand?
cheers, K

How to accurately calculate a running average in JavaScript without summing entire set
Say I am tracking the time it takes for a function to execute, and am showing the average, updating each time a function completes.
The array of times would be like:
var completionTimes = [123, 1234, 128, 1000, ...]
But it could get very large, into the millions or billions of runs. Averaging that every frame would be expensive.
var sum = completionTimes.reduce(function(a, b) { return a + b }) var avg = sum / completionTimes.length
Wondering if there is a trick of some sort to perform this running average without having to sum up all the values each time. Wondering if there is a way to do this without loss of accuracy/precision, but if not, knowing how to do it with small loss of accuracy works too.
Maybe there is a way to sort them and group them into chunks, average the chunks, then do it that way. Not sure what best practices are here.

Correlation within a group of variables
I have a group of variables, e.g. 5 variables. I have used the corrcoef() in Matlab to calculate the correlation matrix between each two variables. Now I got the correlation value between any two variables. My question is: Is there any approach that can be used to calculate one correlation value, which can be used to denote the correlation of variables within in this group (5 variables)? Thanks.

Questions answers from the book Analyzing Multivariate Data
Do you have the answers from the book "Analyzing Multivariate Data"(J. Douglas Carroll, Paul E. Green, James Lattin, Harue Avritscher)? There is a teacher book(only available for teachers) that provides the answers. Non oficial answers would help too.

Why the hypothesis has to introduce two parameters, namely θ0 and θ1
I was learning Machine Learning from this course on Coursera taught by Andrew Ng. The instructor defines the hypothesis as a linear function of the "input" (x, in my case) like the following:
h_{θ}(x) = θ_{0} + θ_{1}(x)
In supervised learning, we have some training data and based on that we try to "deduce" a function which closely maps the inputs to the corresponding outputs. To deduce the function, we introduce the hypothesis as a linear function of input (x). My question is, why the function involving two θs is chosen? Why it can't be as simple as
y
^{(i)} =a * x
^{(i)} wherea
is a coefficient? Later we can go about finding a "good" value ofa
for a given example(i)
using an algorithm? This question might look very stupid. I apologize but I'm not very good at machine learning I am just a beginner. Please help me understand this.Thanks!

Maximum Likelihood Hypothesis Definition
I need the definition of Maximum Likelihood Hypothesis ; I know that the maximum likelihood estimation is a statistic method to perform fitting parameters for different models. So can I say that the Maximum Likelihood Hypothesis is the hypothesis with parameters found using ML estimation ? Thanks for the help

KruskalWallis test between a list sublists in R
I'm pretty new to R. I'm trying to run a KruskalWallis test between dataframed sublists (containing numeric data) within one list but I keep on getting errors.
Each sublist has one column but an unequal number of rows (hence, they can't be stored, as far as I know, within one dataframe)
data:
data_list < list(tumor = 0.004255040 0.002703172 0.007478089 0.003554968 0.003803952 0.005225325 0.004816366 0.005674340 0.003474605 0.004784456, t = 0.004326186 0.008126497 0.009110830 0.004030094 0.005784066 0.006752136 0.009840556, b = 0.004872971 0.009066809 0.005964638 0.003622466 0.011660714, caf = 0.003618611 0.007463386 0.007463134 0.005453387 0.010409640 0.012020965))
So it looks like this:
$tumor 1 0.004255040 2 0.002703172 3 0.007478089 4 0.003554968 5 0.003803952 6 0.005225325 7 0.004816366 8 0.005674340 9 0.003474605 10 0.004784456 $t 1 0.004326186 2 0.008126497 3 0.009110830 4 0.004030094 5 0.005784066 6 0.006752136 7 0.009840556 $b 1 0.004872971 2 0.009066809 3 0.005964638 4 0.003622466 5 0.011660714 $caf 1 0.003618611 2 0.007463386 3 0.007463134 4 0.005453387 5 0.010409640 6 0.012020965
I've tried many things, all came back with errors and unsuccessful:
> kruskal.test(data_list) Error in `[.data.frame`(u, complete.cases(u)) : undefined columns selected > kruskal.test(list(data_list$tumor,data_list$t,data_list$b,data_list$caf)) Error in `[.data.frame`(u, complete.cases(u)) : undefined columns selected > kruskal.test(list(data_list$tumor[,1],data_list$t,data_list$b[,1],data_list$caf[,1])) Error in `[.data.frame`(u, complete.cases(u)) : undefined columns selected > kruskal.test(unlist(data_list)) Error in kruskal.test.default(unlist(data_list)) : argument "g" is missing, with no default
Thank you! :)

Post hoc test for Kruskal Wallace nonparametric test
I am trying to run a post hoc test for the comparisons that showed significant differences in my Kruskall Wallace test and I keep getting a variety of errors. I don't think I have my data set up right, as I keep getting errors such as
object not found
orinvalid term in model formula
. Here is a sample of the data I am trying to run:LMWSed: 77 0 3.4 22.7 73.5 79 57 19 16 70
group3: ref ref ref ref low low low low high high high high
The script I used was:
> dunnTest(LMWSed ~ group3, data = tpah, method = "bh")
This returned this error:
Error in eval(expr, envir, enclos) : object 'LMWSed' not found
I also tried it with quotes around the LMWSed, and had this error:
Error in terms.formula(formula, data = data) : invalid term in model formula
Thanks for any help in advance.
Jenn