Can KruskalWallis test be used to test significance of multiple groups within multiple factors?
I have tried to read what I can on KruskalWallis and while I have found some useful information, I still seem to not find the answer to my question. I am trying to use the KruskalWallis test to determine the significance of multiple groups, within multiple factors, in predicting a set of dependent variables.
Here is an example of my data:
ID Date Point Season Grazing Cattle_Type AvgVOR PNatGr NatGrHt
181 7/21/2015 B22 late pre Large 0.8 2 20
182 7/21/2016 B32 early post Small 1.0 4 24
In this example, my dependent variables are "AcgVor", "PNatGR" and"NatGrHt" while the independent variables (factors) are "Season', 'Grazing", and "Cattle_Type". As you can see, each of my factors has 2 group levels each.
What I am trying to accomplish is to run a nonparamatric test that looks at the separate and combined importance of my factor groups to each of my dependent variables. I chose KrukalWallis and it seems to work for testing one of my grouping factors at a time.
Here is the result for AvgVor ~ Grazing
kruskal.test(AvgVOR ~ Grazing, data = Veg)
KruskalWallis rank sum test
data: AvgVOR by Grazing
KruskalWallis chisquared = 94.078, df = 1, pvalue < 2.2e16
This tells me that AvGVor is significantly different according to whether they were recorded pre or post grazing.
Is there a way to build a similar model using KruskalWallis that includes all of my grouping factors? Even if I have to run a separate one for each dependent variable.
I attempted the following code, but it is flawed.
lapply(Veg[,c("Grazing", "Cattle_Type", "Season")]),function(AvgVOR) kruskal.test(AvgVOR ~ Veg)
See also questions close to this topic

In R from a csv file get rows from a column which contain more then 2 characters
I have a csv file from which I have to get rows which have more then 2 characters from a specific coloumn. My csv file look like this
"Name" "Age" "ID" "RefID" "ABC" "12" "Abccc" "xyzw" "AAA" "14" "A" "X" "BBB" "18" "DEfff" "dfg" "CCY" "10" "F" "XY" "CCZ" "20" "R" "XYC"
So from column 3 and 4 I have take rows which have >= two characters.
I tried following way
data = read.table(file ='res.csv', header = T) dat2 = as.character(data[,3]) ind = dat2[(which(nchar(dat2) >=2)),]
But its giving me error and I am not able to find out how can I proceed with both cols at once. My result should be like below
"Name" "Age" "ID" "RefID" "ABC" "12" "Abccc" "xyzw" "BBB" "18" "DEfff" "dfg" "CCY" "10" "F" "XY" "CCZ" "20" "R" "XYC"
Any help would be appriciated

Error in attr(x, "memberships") < value : attempt to set an attribute on NULL
I am trying to implement fuzzy logic for classifying the weather into different climatic zone. I am getting this Null exception when I call fuzzy_inference() function of sets package in RStudio. The following is my code defining sets and fuzzy rules.
sets_options("universe", seq(1, 200, 0.5)) variables < set( rainfall=fuzzy_partition(varnames = c(range1 = 30, range2 = 70, range3 = 90),sd = 5.0), maxTemp = fuzzy_partition(varnames = c(hot = 30, hotter = 70),sd = 5.0), humidity = fuzzy_partition(varnames = c(less_humid = 30, humid = 60, more_humid = 80),sd = 3.0), soil = fuzzy_partition(varnames=c(Black_soil=10,Mixed_Red_and_Black_soil=20, Mixed_Red_and_clay_soil=30, Mixed_red_and_sandy_soil=40,Mixed_Red_Loamy_and_sandy_soil=50, Mixed_sandy_and_clayey_soil=60, Red_forest_soil=70,Red_Lateritic_soil=80,Red_Loamy_soil=90, Red_sandy_soil=100),sd = 3.0), climate=fuzzy_partition(varnames = c(Northern_Dry_Zone=10, Eastern_Dry_Zone=20, North_Eastern_Transition_Zone=30, Southern_Dry_Zone=40,Eastern_Dry_Zone=50,Hilly_Zone=60, Central_Dry_Zone=70,Coastal_Zone=80,Southern_Transition_Zone=90, Northern_Transition_Zone=100),sd = 3.0), FUN = fuzzy_cone, radius = 10) rules < set( fuzzy_rule(humidity %is% less_humid && maxTemp %is% hot && rainfall %is% range2 && soil %is% Black_soil ,climate %is% Northern_Dry_Zone), fuzzy_rule(humidity %is% humid && maxTemp %is% hot && rainfall %is% range2 && soil %is% Red_Loamy_soil ,climate %is% Eastern_Dry_Zone), fuzzy_rule(humidity %is% humid && maxTemp %is% hot && rainfall %is% range2 && soil %is% Black_soil ,climate %is% Northern_Dry_Zone), fuzzy_rule(humidity %is% less_humid && maxTemp %is% hotter && rainfall %is% range1 && soil %is% Mixed_Red_and_Black_soil ,climate %is% Northern_Dry_Zone), fuzzy_rule(humidity %is% less_humid && maxTemp %is% hot && rainfall %is% range1 && soil %is% Black_soil ,climate %is% Northern_Transition_Zone), fuzzy_rule(humidity %is% humid && maxTemp %is% hot && rainfall %is% range3 && soil %is% Red_forest_soil ,climate %is% Southern_Dry_Zone), fuzzy_rule(humidity %is% humid && maxTemp %is% hot && rainfall %is% range2 && soil %is% Mixed_red_and_sandy_soil ,climate %is% Eastern_Dry_Zone), fuzzy_rule(humidity %is% less_humid && maxTemp %is% hotter && rainfall %is% range2 && soil %is% Mixed_Red_and_Black_soil ,climate %is% Central_Dry_Zone), fuzzy_rule(humidity %is% more_humid && maxTemp %is% hot && rainfall %is% range3 && soil %is% Red_Loamy_soil ,climate %is% Hilly_Zone), fuzzy_rule(humidity %is% more_humid && maxTemp %is% hotter && rainfall %is% range3 && soil %is% Red_Lateritic_sandy_soil ,climate %is% Coastal_Zone), fuzzy_rule(humidity %is% less_humid && maxTemp %is% hotter && rainfall %is% range3 && soil %is% Black_soil ,climate %is% North_Eastern_Dry_Zone), fuzzy_rule(humidity %is% more_humid && maxTemp %is% hot && rainfall %is% range3 && soil %is% Red_Loamy_soil ,climate %is% Southern_Transition_Zone), fuzzy_rule(humidity %is% humid && maxTemp %is% hot && rainfall %is% range2 && soil %is% Mixed_Red_Loamy_and_sandy_soil ,climate %is% Eastern_Dry_Zone), fuzzy_rule(humidity %is% less_humid && maxTemp %is% hotter && rainfall %is% range2 && soil %is% Black_soil ,climate %is% Northern_Dry_Zone), fuzzy_rule(humidity %is% humid && maxTemp %is% hot && rainfall %is% range3 && soil %is% Red_Loamy_soil ,climate %is% Southern_Dry_Zone), fuzzy_rule(humidity %is% less_humid && maxTemp %is% hot && rainfall %is% range1 && soil %is% Black_soil ,climate %is% North_Eastern_Dry_Zone), fuzzy_rule(humidity %is% humid && maxTemp %is% hotter && rainfall %is% range2 && soil %is% Mixed_Red_Loamy_and_sandy_soil ,climate %is% Eastern_Dry_Zone) ) model< fuzzy_system(variables, rules) ### error code #### example.1 < fuzzy_inference(model, list(humidity=63,maxTemp=31,rainfall=72.5,soil=41.5))
The following line of the code is throwing null error.
### error code #### example.1 < fuzzy_inference(model, list(humidity=63,maxTemp=31,rainfall=72.5,soil=41.5))

R neuralnet on NIR data poor results
I am attempting to predict a validation set with a neuralnet() with NIR spectra and I have been getting very poor results.
Within the literature some have reported using a sigmoid function from the input layer to the hidden layer, and a linear from hidden to output; which I have chosen here. I have tried several different iterations (ie # of neurons, PLS factors as inputs, ets) and still have gotten fairly poor results in my validation.
maxs1 < apply(cal, 2, max) mins1 < apply(cal, 2, min) scaledcal < as.data.frame(scale(cal, center = mins1, scale = maxs1 mins1)) maxs2 < apply(val, 2, max) mins2 < apply(val, 2, min) scaledval < as.data.frame(scale(val, center = mins2, scale = maxs2 mins2)) library(neuralnet) sigmoid = function(x) { 1 / (1 + exp(x)) } set.seed(400) n < names(cal) z < length(n)1 f < as.formula(paste("DM ~", paste(n[!n %in% "DM"], collapse = " + "))) nn < neuralnet(f, data=scaledcal, act.fct = sigmoid, hidden=(20), linear.output=T, lifesign ="full", threshold=0.01, stepmax = 1e+07) pr.nn < compute(nn,scaledval[,1:83]) pr.nn_ < pr.nn$net.result*(max(cal$DM)min(cal$DM))+min(cal$DM) test.cv.r < (scaledval$DM)*(max(val$DM)min(val$DM))+min(val$DM) MSE.nn < sum((pr.nn_  val$DM)^2)/nrow(val)
The MSE I have been getting on the nn has been anything between 202, why this is surprising is that I am getting a MSE of 0.8 (range is 922) when performing a simple LM on the same data.
lm.fit < glm(DM~., data=cal) pr.lm < predict(lm.fit,val) MSE.lm < sum((pr.lm  val$DM)^2)/nrow(val)
Cheers,

C subroutine in R's lm.fit function
I'm trying to understand how the regression coefficients are generated in the lm.fit function. The only place that math could be done is this line.
143 z < .Call(C_Cdqrls, x, y, tol, FALSE)
How do I look at the code being called by .Call? Thanks in advance!

How can I create a boxplot in R, by column?
year month t.max t.min frost rain NA 25 1961 1 6.1 0.5 13 320 29.2 26 1961 2 9.9 3.8 1 210 55.7 27 1961 3 13.1 3.3 2 390 139.6 28 1961 4 13.9 5.8 0 411 77.2 29 1961 5 15.1 5.4 3 69 188.9 30 1961 6 19.8 8.7 0 84 223.6 31 1961 7 19.3 10.3 0 281 157.4 32 1961 8 19.9 10.8 0 413 184.0 33 1961 9 19.7 10.2 0 152 129.5 34 1961 10 14.5 6.6 1 260 110.8 35 1961 11 9.2 2.7 6 238 60.0 36 1961 12 4.9 1.2 19 429 48.1 37 1962 1 7.4 0.5 8 290 72.0 38 1962 2 7.4 1.1 10 66 63.1 39 1962 3 6.4 1.7 21 131 98.5 40 1962 4 11.5 3.1 3 215 130.4 41 1962 5 14.0 5.7 2 143 132.3 42 1962 6 18.5 7.8 2 414 194.6 43 1962 7 18.9 10.3 0 108 114.9 44 1962 8 18.3 10.0 0 469 130.0 45 1962 9 16.4 8.6 0 489 108.9 46 1962 10 14.1 5.8 2 486 90.2 47 1962 11 8.3 2.1 7 112 37.4 48 1962 12 4.5 2.6 20 188 43.7
I have the following data and I want to create 12 boxplots (one for each month) of the variable t.max . My problem is there is more than one of each month and I don't know how to tell R I want a boxplot for month 1, 2, ...etc? Thanks

xtsum for panel data in Stata  understanding Tbar
I'm using the xt commands, including xtsum, to analyze panel data in Stata. I'm either misunderstanding the output for "Tbar", or I made a mistake in one of my variables.
My xtsum output looks like this:
Variable Mean Std. Dev. Min Max Observations hours overall 43.83559 31.24379 1 160 N = 1215108 between 25.89261 1 160 n = 44773 within 19.69052 92.83108 188.2023 Tbar = 27.1393
My understanding is that Tbar represents the average number of observations per panel variable (in this case, per person). However, I also have a variable set up that counts the number of observations per person:
sort hcw_id pp_id egen ppcount = max(pp_id), by(hcw_id)
pp_id is an observation ID per person, and hcw_id is the person ID. I've checked in the Data Editor to verify that pp_id is counting observations per peson and ppcount is taking the max value of pp_id for each person. For example, if there are 10 records for a person, each record will be labeled 110 with pp_id and ppcount will be 10.
Here is the confusing thing: the mean of ppcount is 46. This should mean that people in the data have, on average, 46 observations. But why is this so different from Tbar in the xtsum output? Am I misunderstanding something in the documentation for xtsum or is my ppcount variable way off?
By the way, I just thought of another way of doublechecking this  pp_id counts the number of pay roll records per person. Based on other variables, I know that each person has worked around a year and a half, which would be about 40 biweekly pay periods. This suggests that ppcount is accurate, and Tbar is measuring something different.
Is anyone familiar with xtsum and able to shed some light on this? Thank you!

R  Testing for homo/heteroscedasticity and collinearity in a multivariate regression model
I'm trying to optimize a multivariate linear regression model
lmMod=lm(depend_var~var1+var2+var3+var4....,data=df)
and I'm presently working on the premises of the model: the constant variance of residuals and the absence of autocorrelation. For this I'm using:BreuschPagan test for homo/heteroscedasticity:
lmtest::bptest(lmMod)
Durbin Watson test for autocorrelation:
durbinWatsonTest(lmMod)
I found examples which are testing either one independent variable at a time:
example for BreushPagan test – one independent variable: https://datascienceplus.com/howtodetectheteroscedasticityandrectifyit/
example for Durbin Watson test  one independent variable: http://math.furman.edu/~dcs/courses/math47/R/library/lmtest/html/dwtest.html
or the whole model with several independent variables at a time:
example for Durbin Watson test – multiple independent variable: https://www.rdocumentation.org/packages/car/versions/2.16/topics/durbinWatsonTest
Here are the questions:
 Can
durbinWatsonTest()
andbptest()
be fed with a whole multivariate model  If answer to 1 is yes, how is it then possible to determine which variable is causing heteroscedasticity or autocorrelation in the model in order to fix it as each of those tests give only one pvalue for the entire multivariate model?
 If answer to 1 is no, the test should be then performed with one dependent variable at a time. But in the case of homoscedasticity, it can only be tested AFTER a particular regression has been modelled. Hence a pattern of homo/heteroscedasticity in an univariate regression model
lmMod_1=lm(depend_var~var1, data=df)
will be different from the pattern of a multivariate regression modellmMod_2=lm(depend_var~var1+var2+var3+var4....,data=df)
Thank very much in advance for your help!

MARIMA (Multivariate ARIMA Model) Error in optim(init[mask]
Hi All,
I am currently using Multivariate ARIMA model using multivariable data. I followed some tutorials on the website, but until now I still get an error in this code. I am trying to estimate the data by using multiple variables in my data. The original data that I want to load is as follows:
A B C 34.77452838 191.9425766 2 51.34502757 167.3464789 2 52.22462976 163.9371881 2 44.05790246 153.8406923 2 49.01569605 173.6158416 2 43.29661071 168.8358248 2 52.82309532 162.5718306 2 47.71516049 173.088253 2 40.46453333 173.1037056 2 43.96892905 156.5669003 2 50.53534484 168.1469926 2 54.29771912 191.7929009 2 49.84792352 153.2957975 2 47.29999697 157.3547558 2 50.74218512 167.5583718 2 51.04921973 169.2961421 2 42.94721591 161.0413471 2 46.857445 161.5720713 2 59.63383675 168.1864209 2 41.55891144 164.8500058 2 43.80523336 170.6241115 2 40.54447174 163.5741909 2 54.73254502 145.0296938 2 48.37875497 165.6939519 2 54.28330553 185.9884903 2 54.56595576 187.6435893 2 46.3341881 166.420677 2 48.64491069 197.8226879 2 52.16211736 156.9031211 2 49.69608331 174.5718085 2 44.3995806 165.870505 2 54.30303288 172.1445435 2 59.42004347 181.4705789 2 55.66045916 171.5254118 2 50.9424361 167.3649679 2 52.74883235 173.8478608 2 51.88352442 180.565498 2 48.82856619 181.4147966 2 52.91082788 166.5114416 2 61.81566179 186.9227225 2 59.18481278 171.9375519 2 53.73142791 165.4547203 2 45.90694845 167.5522616 2 53.20516884 177.0629353 2 58.42090189 179.9778995 2 48.20763016 181.7318594 2 64.8245225 182.4593924 2 53.4869312 178.0462498 2 50.98494649 198.5952871 2 57.18331504 180.2024605 2 56.07515264 184.3858003 2 53.02835584 119.6381436 2 52.83894384 190.4696428 2 64.60654461 190.7581833 2 57.41005075 187.114192 2 52.50923717 188.1313879 2 51.53594851 151.6875626 2 52.83034897 178.0217738 2 51.62433934 173.5677154 2 57.34844828 185.234996 2 56.90921938 197.0879734 2 51.47780192 201.7652231 2 53.35943985 184.3207373 2 55.12425864 202.156465 2 58.62819278 176.0629181 2 62.51672983 187.3645886 2 61.35114133 219.0208926 2 69.00147164 235.7991778 2 48.19376576 225.5629393 2 59.97428918 221.5355618 2 61.42004728 221.3157335 2 56.09382308 256.9632484 2 58.12355614 228.7762196 2 61.71775579 243.6576741 2 58.30987263 226.1145379 2 66.17636418 139.7724771 2 55.87241828 205.2602213 2 55.59852946 223.8182453 2 61.9962275 199.569491 2 61.64428604 209.228291 2 60.89585352 212.1291189 2 35.60671771 147.2426824 2 38.79108548 186.9673003 2 47.54260194 112.340896 2 55.62540507 178.5126261 2 53.74473393 211.8612091 2 46.5023706 196.6053041 2 56.1878252 216.0413246 2 60.25617886 213.8808684 2 55.06903768 191.5933198 2 46.60528803 193.228591 2 61.16524374 196.2754949 2 66.04131413 215.8749835 2 55.65532315 195.3569658 2 64.23627043 188.9093598 2 61.10553396 206.492755 2 47.89633095 124.2445289 2 58.98819304 124.9045309 2 57.35811675 189.5788539 2 55.1452173 196.5306579 2 55.75572991 221.4020661 2 63.98512149 232.496751 2 55.9264102 194.8117255 2 53.8304683 194.4430075 2 64.15636933 233.0761708 2 66.40452218 225.4115469 2 64.83224857 237.8602368 2 61.32524633 273.593856 2 59.92230034 255.1825145 2 55.16024733 233.8121403 2 48.41788888 213.9495593 2 45.87945926 182.9052714 2 37.12352931 214.9122403 2 36.23515606 184.767225 2 38.80360508 203.3612199 2 33.82899797 189.0318566 2 29.51428461 184.9648764 2 28.57971239 186.4950075 2 24.77523589 188.7776908 2 31.33283567 197.7679355 2 40.3275789 186.8421363 2 41.6556977 193.8692738 2 41.73315692 197.7107161 2 49.60428262 188.7511304 2 41.20251334 219.1196343 2 43.95248687 210.6029669 2 44.04459047 203.7891106 2 41.86283159 222.413766 2 44.25611329 219.7560903 2 53.94720805 244.3697403 2 47.52787864 233.2155174 2 59.11544025 83.04917738 2 51.94031441 221.0877964 2 54.44512963 180.7172578 2 74.13162887 141.8472405 2 58.41810071 52.32844725 2 34.09419501 137.1900364 2 44.34173286 204.0982826 2 52.73150897 204.2542315 2 52.38756931 215.1004646 2 54.43326163 198.5332844 2 52.59076023 213.0977851 2 56.87320662 200.364619 2 60.35769665 226.456495 2 55.56278813 183.1343606 2 51.77700472 204.6629449 2 56.84940159 220.0856916 2 50.40147328 234.4078768 2 57.58252537 227.4383631 2 54.13363087 200.6441259 2 64.28838575 187.2168003 2 50.38259399 229.655229 2 61.97002161 169.5985726 2 53.4693228 228.3208836 2 43.46435559 227.805616 2 50.76040995 221.5824154 2 37.74385309 243.5761809 2 44.50296307 260.2599051 2 52.93274188 208.5485301 2 48.94673824 229.0260541 2 44.69142139 235.2965628 2 45.87330532 230.5345631 2 50.15886772 231.5080565 2 50.46518803 233.3240541 2 39.64570737 252.3943345 2 50.31436396 257.1978301 2 45.72599959 255.5499076 2 35.90772772 235.547594 2 50.64034235 256.1243588 2 42.75332725 250.8706385 2 43.32600331 218.2298083 2 51.61082256 256.5452666 2 44.12634993 239.613958 2 46.19641519 238.9780285 2 51.83620155 215.1587094 2 43.97586417 219.6388278 2 48.92009437 251.0237063 2 49.06462419 71.92791063 2 55.18403387 262.2636031 2 32.57642794 236.0216013 2 43.32588315 217.0611338 2 47.7579248 244.944423 2 47.83448696 218.8873504 2 59.93806362 224.4503085 2 47.25084496 191.4388145 2 53.37664378 215.1277665 2 48.80090845 203.8133075 2 43.18675303 202.2225439 2 50.6951052 191.3762616 2 51.30918467 219.4169811 2 47.40389884 239.3385311 2 41.98253107 224.8199498 2 48.2856251 213.795973 2 44.51645684 218.5239546 2 47.280002 209.8027239 2 40.54046452 220.6853516 2 45.78141475 229.3941645 2 51.80096102 240.4784644 2 61.72670174 237.1822198 2 49.78334701 257.9867854 2 50.54248619 214.5982823 2 48.62856185 223.9721336 2 48.48320973 229.0910186 2 50.2515676 26.9172095 2 50.27965224 256.2915689 2 56.12183511 245.8979531 2 55.52810276 271.3756176 2 44.64924526 222.7377583 2 50.11476493 259.5392736 2 46.87115741 228.0808933 2 56.00424314 199.4617359 2 48.82094455 245.3930835 2 53.08217335 229.5334666 2 51.90354311 244.6746334 2 43.06275117 246.7410669 2 50.51734376 232.7116094 2 48.31047595 232.380875 2 49.67464817 217.3068271 2 57.93947721 231.5371941 2 56.18981779 248.290565 2 50.98111296 275.0819118 2 42.7040962 228.5930888 2 54.32273841 237.5194329 2 40.49863088 246.5566334 2 44.50957453 271.3497973 2 48.52619112 282.228831 2 55.82672036 271.9136146 2 49.16413927 279.3997284 2 57.59784043 281.7114483 2 52.18583846 272.4423745 2 49.37065029 267.0389461 2 47.12135267 236.7228498 2 52.00103676 248.8965285 2 51.79680061 279.691718 2 55.56773472 282.9113431 2 50.85593855 227.6402154 2
Here is the code to call this files and using MARIMA:
# Remove All Variables rm(list=ls()) library(marima) # Calling the Library MARIMA library(tseries) # Calling the Library tseries library(ggplot2) options(max.print = 1000000) setwd("E:/LoRA Technology/myPaper")
mydata < read.csv("Copy of combining_2.csv", header = TRUE)
Model5 < define.model(kvar=3, ar=1, ma=1) Marima5 < marima(ts(mydata), Model5$ar.pattern, Model5$ma.pattern, penalty=1) nstart <418 nstep <10
However, I got an error as follows when I tried to run the Marima5
Error in optim(init[mask], armaCSS, method = optim.method, hessian = FALSE, : nonfinite value supplied by optim
I just don't know how to solve this, do you have any solution for this?
Best Regards M Idham Habibie

R  Explain variation between groups in panel data
I have a panel data set as follows:
Year Industry FDI RW TFP IY CY GDP LP IR ER PS 1998 AGR xx xx xx xx xx xx xx 1998 MAN xx xx xx xx xx xx xx 1998 NTR xx xx xx xx xx xx xx 1998 TRN xx xx xx xx xx xx xx 1999 AGR xx xx xx xx xx xx xx 1999 MAN xx xx xx xx xx xx xx 1999 NTR xx xx xx xx xx xx xx 1999 TRN xx xx xx xx xx xx xx ...
I want to build a multivariate model that can explain the variation in FDI between the Industry groups using the variables RW, TFP, IY, CY, GDP, LP.
The variables, RW, TFP, IY, CY, GDP, LP are specific to the Industry. The variables IR, ER, PS are not specific to any industry but I would like to be able to incorporate them into the model if possible.
How would I go about doing this in R?

understanding durbinwatson test in R
I seem to misunderstand how the
dwtest
of the lmtest package in R works.> r1 < runif(1000, 0.0, 1.0) > r2 < runif(1000, 0.0, 1.0) > dwtest(lm(r2 ~ r1 )) DurbinWatson test data: lm(r2 ~ r1) DW = 1.9806, pvalue = 0.3789 alternative hypothesis: true autocorrelation is greater than 0 > > r1 < seq(0, 1000, by=1) > r2 < seq(0, 1000, by=1) > dwtest(lm(r2 ~ r1 )) DurbinWatson test data: lm(r2 ~ r1) DW = 2.2123, pvalue = 0.8352
When i understand everything right, I first test 2 sets of random numbers with each other (which do not correlate  correct)
Then i correlate the numbers from 1 to 1000 incrementing with them self (which does not correlate  uhm... what)
Can someone point me to the obvious error i do?

Hypothesis on a proportion in R
I have a random sample of 100 YES/No answers. 13 of them are "YES". The rest are "NO". I have to test a hypotheses on the proportion.
Hypothesis: H0: p = p0
H1: p > p0
Confidence level is 95%
I have the following code: (The z.prop function calculates the test statistic.)
z.prop = function(k, n, p, p0){ zeta = (p  p0) / (sqrt( p0*(1p0)/n ) ) return(zeta) } k< 13 n< 100 p< k/n p0< 0.1 z < z.prop(k,n,p,p0) cat("z: ",z) z.alpha < qnorm(0.05,lower.tail=FALSE) cat("z alpha: ",z.alpha) pval< pnorm(abs(z),lower.tail = FALSE) cat("pvalue",pval,"\n")
If I use this code then the pvalues are different.
binom.test(k, n, p = p0,alternative ="greater",conf.level = 0.95)
Using my function I got 0.1586553 for the pvalue. Using the binom.test function I got pvalue = 0.1982.
How is this possible? Is my code wrong, or it is just some kind of rounding error? Thank you.

Converting categorical values into numeric in r
Before I do my hypothesis test in r, how do I convert the values in my 'gender' variable to numeric? Looking for correlation between party votes and gender, so I need to convert values of gender variable (Male/Female/NA) to numeric to do use my formula. Thanks in advance.

ggpubr: change font size of stat_compare_means KruskalWallis pvalues
How can I change the font size of
stat_compare_means
on the plot below? I.e, change the "KruskalWallis, p = 1.5e09" and the other pvalues font size? I would like to use a smaller font size than the default one...Following the data example...
library(ggpubr) data("ToothGrowth") compare_means(len ~ dose, data = ToothGrowth) # Visualize: Specify the comparisons I want my_comparisons < list( c("0.5", "1"), c("1", "2"), c("0.5", "2") ) # Plotting ggboxplot(ToothGrowth, x = "dose", y = "len", color = "dose", palette = "jco")+ stat_compare_means(comparisons = my_comparisons)+ # Add pairwise comparisons pvalue stat_compare_means(label.y = 50) # Add global pvalue
Plot:

Breaking Ties before nonparametric tests
I have a bunch of data and when running Kruskal Wallis posthoc nemenyi pairwise tests on them I get a ties. I'm by no means very fluent on stats so I've tried a couple of things to deal with this problem.
I've first tried ranking the data manually with
rank(... ties.method="first")
and this gave a similar result to myposthoc.kruskal.nemenyi.test
(packagePMCMR
) which detected ties, but spat out what I would expect when looking at the data visualization. However, I found the "Chisq" correction that's built into the command and this appears to inflate my pvalues pretty dramatically.I don't exactly understand what the chisq correction is doing nor do I understand what a chisq distribution is, so, because of my understanding, I'm leaning toward using the manual ranking method. I feel like I should mention that the kruskal.test before and after manual ranking is not far off from each other (X^2= 25.202,df = 6, p = 0.000313; X^2 = 25.586, df = 6, p = 0.00265, respectively). I wanted to get an expert opinion on the matter.
Thanks!