How to subset from aov summary in R?
Maybe this is a simple question but I'm wondering how can I subset DF
and F.values
for the terms appearing in an aov
summary?
For example, using the base R builtin dataset npk
, how can I extract the residual and other DF
s and F.values
that appear in the summary of the following model:
fit < summary(aov(yield ~ block + N * P + K, data = npk)) # example is fully reproducible
P.S. I'm looking for base R solutions.
1 answer

The
fit
output is alist
oflength
1 (by checkingstr(fit)
). We extract it with[[
and then do$
or[[
to extract the componentsfit[[1]]$Df #[1] 5 1 1 1 1 14 #where 14 is the Residuals df fit[[1]]$`F value` #[1] 4.391098 12.105541 0.537330 6.088639 1.361073 NA
See also questions close to this topic

Add Hive odbc driver to server
I'm using hosted RStudio on Red Hat CentOs7.
I would like to connect to a Hive database and was looking at odbc package after reading a how to blog page on rstudio.
Example code from the page:
library(odbc) con < dbConnect(odbc::odbc(), driver = <driver>, host = <host>, dbname = <dbname>, user = <user>, password = <password>, port = 10000)
This is my current objective, to create a connection to Hive. The part that's tripping me up is the driver.
On the Horton Works add ons page I copied a link to CENTOS7 (64Bit) driver: https://publicrepo1.hortonworks.com/HDP/hiveodbc/2.1.16.1023/Linux/EL7/hiveodbcnative2.1.16.10231.el7.x86_64.rpm
Then, on my linux server:
sudo wget https://publicrepo1.hortonworks.com/HDP/hiveodbc/2.1.16.1023/Linux/EL7/hiveodbcnative2.1.16.10231.el7.x86_64.rpm sudo yum install hiveodbcnative2.1.16.10231.el7.x86_64.rpm
Everything appeared to work OK up till this point.
I added the driver to the con call:
con < dbConnect(odbc::odbc(), driver = 'hiveodbcnative2.1.16.10231.el7.x86_64', host = 'example.com', dbname = 'mydb', user = 'doug', password = 'password123', port = 10000)
However, R tells me: "Error: nanodbc/nanodbc.cpp:950: 01000: [unixODBC][Driver Manager]Can't open lib 'drivers/hiveodbcnative2.1.16.10231.el7.x86_64' : file not found ".
How can I use this obdc driver for my connection?

fill in the blanks type question generation NLP (nonenglish)
I am working on a project related to NLP for Mandarin language. The objective is to generate fill in the blanks type question for the corpus text data.
Any related existing work or any references to start with especially for nonenglish language. Suggestions welcomed.
Thanks in advance.

how to join two data frames in r
I've two data frames dt and dt1 as follows:
dt dt1 id date id date 1 20180920 1 20180920 1 20180918 2 20180914 2 20180916 2 20180915 3 20180914
how to combine two data frames and get the output as:
dt2 id date 1 20180920 1 20180920 1 20180918 2 20180916 2 20180915 2 20180914 3 20180914

What is the common name for mapping one range of numbers maps to another range of numbers?
I'm pretty sure it's a common pattern, but I'm looking for the name of the pattern where you match one range of numbers to another range of numbers. Something similar like:
Map(from1: 60, to1: 90, from2: 100, to2: 140, value: 75); // Result: 120 (middle of from2/to2) Map(from1: 60, to1: 90, from2: 100, to2: 140, value: 30); // Result: 100 (clamped bottom) Map(from1: 60, to1: 90, from2: 100, to2: 140, value: 60); // Result: 100 (bottom) Map(from1: 60, to1: 90, from2: 100, to2: 140, value: 500); // Result: 140 (clamped to2) Map(from1: 60, to1: 90, from2: 100, to2: 140, value: 85); // Result: 133.33 (in between)
What is the name for this method? I'm specifically looking for a solution in Unity, but I'm pretty sure if I know the name of the pattern I can find it.

VBA function returns #VALUE in excel with the message: A value in the formula is off the wrong data type
I am a just beginning to learn VBA. I am trying to do a loop that will act as a solver to find the number of payments for a loan:
Function FonctionValue(rate As Double, pmt1 As Double, loan As Double, nbpmt As Double) FonctionValue = Application.WorksheetFunction.pmt(rate, nbpmt, loan)  pmt1 End Function Function Npmt(nbpmtlow As Double, nbpmthigh As Double, rate As Double, loan As Double, pmt1 As Double) As Variant Dim i As Integer Dim nbpmt As Double For i = 1 To 100 flow = FonctionValue(rate, pmt1, loan, nbpmtlow) fhigh = FonctionValue(rate, pmt1, loan, nbpmthigh) value = nbpmtlow  flow * (nbpmthigh  nbpmtlow) / (fhigh  flow) fnbpmt = FonctionValue(rate, pmt1, loan, nbpmt) If fhigh * fnbpmt > 0 Then nbpmthigh = value Else nbpmtlow = value End If Next i value = Npmt End Function

Is it possible to do a Php operation inside ajax?
Is it possible to do a Php operation inside ajax?I'm trying to put a google maps API Marker inside my map and do it while I'm receiveing the javascript attribute to transorm it to a php vaiable
function rootFolder() { var variable = this.getAttribute('datavalue'); $.ajax({ type: "POST", url: 'MapScript.php', data: { variable : variable }, success: function(data){ var markerLastFind = new google.maps.Marker({ position: {lat: <?php echo $row_last['latitude'] ?>, lng: <?php echo $row_last['longitude'] ?>}, map: map, icon: iconBlue2, }); } }); }

Minimal glmnet example for factors
I am trying to understand how to use the R package glmnet.
Suppose I have a dataset, representing games played between two teams, with the 'win' column defining the result.
library(RcppAlgos) library(dplyr) data < RcppAlgos::permuteGeneral(c("A", "B", "C", "D", "E"), 2, repetition = TRUE) %>% as.data.frame() %>% setNames(c("team1", "team2")) %>% mutate(win = rbinom(25, 1, 0.5))
where 1 represents that team1 won, and 0 represents that team1 lost.
I now want to run this data through glmnet, with the 'won' column as the response.
I know that I need to use model.matrix with my factor variables, but it doesn't seem to me that that would give the right result.
For example:
x < model.matrix(data$win ~ data$team1 + data$team2) fit < glmnet(x, data$win)
Can anyone help?
Thanks!

One shot learning for a regression task
I know one shot learning can be used for classification as in the Siamesenetwork, but can we use one shot learning for a regression task?

Centering variables for multiple regression  interested in group effects
I'm trying to run a multiple regression model looking at the lengthweight relationship in fish. So y = weight, x = weight. What I want to examine specifically is if the lengthweight relationship between different populations (same species) differs  I've run the model as:
weight = length * population
BUT have also reading a lot about centering data in regression models. It seems to make no sense to me to grandmean centre length for this analysis as i'm specifically interested in the differences in LW relationship between the groups, but should I groupcentre for length? Or, not centre at all?
Any help or pointers greatly apriciated.
Cheers. G.

Equalize number of trials in two data subsets based on overlapping distributions
I have an issue and could not find the solution by myself. Here is some background on my problem and how the data set is organized:
I did an experiment where subjects (n = 14) had to respond on a keyboard to stimuli presented on a screen. They could get a monetary penalty for incorrect responses in two different conditions > these two conditions are called Penalty 4 and Penalty 14 in the following. I measured the decision time (DT) of these subjects in the task among other variables. As expected, the DT distribution is shifted on the right when the penalty was 14 as subjects waited longer to respond (they were more cautious). Hence, the average DT is longer in the Penalty 14 than in the Penalty 4 condition.
The grouplevel density distributions for Penalty 4 (black) and Penalty 14 (green) are represented on the figure here; vertical lines represent the grouplevel average
The DT data is in a dataset which we could call 'data' (hence present in data$DT). In this dataset, each row represents a trial and each column represents a dependent or an independent variable. Here are some columns that I have for instance and which contain variables of interest: data$Trialnbr (represents the number of the trial in which the DT and other variables were recorded), data$Subjectnbr (from 1 to 14), data$Penalty (4 or 14), data$EMGactivity (relates to the quantity of muscle activity we recorded when subjects responded on the keyboard), etc. Here is my issue: when I average 'EMGactivity' for a given subject and in each penalty condition, the resulting averages could actually depend on the DT (i.e., as DT is different in the two penalty conditions). I would like to get rid of this confounding factor. To do so, I would like to homogenize the average DT across the two penalty conditions in each subject. I was thinking that one way to do so would be to select in each subject the trials present in the fraction of the distributions plotted above that overlap each other (i.e., where the green distrib overlap the black distrib).
In other words, I'd like to bin the distribution for each subject and to equalize the number trials in the Penalty 4 and Penalty 14 conditions in each bin. I think that doing so would allow me to get a comparable average DT in each penalty condition. Then, I could compute the average 'EMGactivity' for each subject and each penalty condition without the issue that the data exploited to compute these averages come from trials in which the DT was different.
How could I do this? Is there any R function already doing it?
Thank you in advance for your help,
Gerard

How to optimize for double for loop in R
I am using a nested for loop for subsetting the data based on category and its unique sub category. I have a categorical columns like age, gender, state, region. Based on this category i am subsetting the data using nested for loop with its subcategory like age contains 97 unique subcategory, gender contains 3, region 4 like tat. After subsetting each subcategory data, the data is applied to my function for each category. Its taking much time to execute my function because of nested for loop. How can i optimize my code. My code is here.
get_forecasting_daywise_category < function(data,forecast_by,period,freq,input_key_column,input_date_column,input_amt_column,categorical_columns) { forecasted_category < list() for( i in 1:length(categorical_columns)) { if(categorical_columns[i] %in% names(data)==TRUE){ categorical_df_name < paste(categorical_columns[i],"_df",sep="") forecasted_by_categories < list() for(j in 1:length(unique(data[,categorical_columns[i]]))){ categorical_data < (subset(data,data[,categorical_columns[i]] == unique(data[,categorical_columns[i]])[j])) if (forecast_by == "sales"){ agg_day < aggregate(categorical_data[,input_amt_column]~categorical_data[,input_date_column],categorical_data,sum) names(agg_day) = c(input_date_column, input_amt_column) forecast_input_column < agg_day[,input_amt_column] } else if (forecast_by == "customers") { agg_day < aggregate(categorical_data[,input_key_column]~categorical_data[,input_date_column],categorical_data,length) names(agg_day) = c(input_date_column, input_key_column) forecast_input_column < agg_day[,input_key_column] } else if (forecast_by == "average_sales") { agg_day <aggregate(categorical_data[,input_amt_column]~categorical_data[,input_date_column],categorical_data,mean) names(agg_day) = c(input_date_column, input_amt_column) forecast_input_column < agg_day[,input_amt_column] } min_day < min(agg_day[,input_date_column]) max_day < max(agg_day[,input_date_column]) get_autoarima_model < get_autoarima_model(forecast_input_column,period,min_day,freq) if (is.null(get_autoarima_model)) { category_forecast < NULL }else { forecasted_date < seq(as.Date(max_day)+1, by = "day", length.out = period) forecasted_date < as.data.frame(forecasted_date) label < sprintf("D%s",seq(1:period)) if (forecast_by == "customers") { category_forecast < cbind.data.frame(forecasted_date=forecasted_date,label=label,value=round(get_autoarima_model$Point.Forecast)) }else { category_forecast < cbind.data.frame(forecasted_date=forecasted_date,label=label,value=get_autoarima_model$Point.Forecast) } } forecasted_by_categories[[j]] < list(sub_category=unique(categorical_data[,categorical_columns[i]]),category_forecast=category_forecast) } } category < list(category_name=categorical_columns[i]) category_name < as.data.frame(category) forecasted_category[[i]] < list(categories=category_name,forecasted_by_categories=forecasted_by_categories) } return(forecasted_category) }
Can anyone guide me how to optimize or different way to achieve this logic.

Trying subsetting dataframe in R
I am trying to sub set dataframe by comparing values in two column. I am using below line
open < open[open$AssignedGroup==open$Assigned.Group, ]
It was working fine, but it didn't work when some values in columns have more character For example , I got below value in both column for same rows, but above script unable to subset dataframe. ABC DE Demo Integration E2E test 2
Can anyone please help me to know what is the issue here?
Inserting first 10 rows for reference
Num AssignedGroup Priority Assigned.Group 1 ABC DE Demo Integration E2E test 2 Medium ABC DE Demo Integration E2E test 2 1 ABC DE Demo Integration E2E test 2 Medium Group 1 1 ABC DE Demo Integration E2E test 2 Medium Group 2 2 ABC DE Demo Integration E2E test 2 High ABC DE Demo Integration E2E test 2 2 ABC DE Demo Integration E2E test 2 High Group 1 2 ABC DE Demo Integration E2E test 2 High Group 2 3 ABC DE Demo Integration E2E test 2 Low ABC DE Demo Integration E2E test 2 3 ABC DE Demo Integration E2E test 2 Low Group 1 3 ABC DE Demo Integration E2E test 2 Low Group 2 4 ABC DE Demo Integration E2E test 2 Low ABC DE Demo Integration E2E test 2
I have inserted structure of dataframe for reference
'data.frame': 82710 obs. of 4 variables: $ Num : chr "INC0615378" "INC0615378" "INC0615378" "INC0615495" ... $ AssignedGroup : chr "ABC DE Demo Integration E2E test 2" "ABC DE Demo Integration E2E test 2" "ABC DE Demo Integration E2E test 2" "ABC DE Demo Integration E2E test 2" ... $ Priority : chr "Medium" "Medium" "Medium" "Medium" ... $ Assigned.Group: chr "ABC DE Demo Integration E2E test 2" "GROUP 1" "Group 2" "ABC DE Demo Integration E2E test 2" ...

A fast way to run many anova's and extract certain columns
I have data which consist of a response variable (
y
) and two factors (sex
andtime
), for severalgroup
s:set.seed(1) df < data.frame(y = rnorm(26*18), group = sort(rep(LETTERS,18)), sex = rep(c(rep("F",9),rep("M",9)),26), time = rep(rep(sort(rep(1:3,3)),2),26)) df$sex < factor(df$sex, levels = c("M","F"))
I'd like to test between models using
R
'sanova
, for eachgroup
, and combine it all in onedata.frame
that has a column of theanova
pvalue
for each of the factors in the model I'm fitting, where each row is each of thegroup
s I'm running theanova
on.Here's what I'm currently doing:
anova.df < do.call(rbind,lapply(unique(df$group),function(i){ an.df < anova(lm(y ~ sex*time,data=df %>% dplyr::filter(group == i))) an.df < data.frame(factor.name=rownames(an.df)[1:(nrow(an.df)1)],p.value=an.df[1:(nrow(an.df)1),which(colnames(an.df) == "Pr(>F)")]) %>% tidyr::spread(factor.name,p.value) %>% dplyr::mutate(group=i) return(an.df) }))
But in reality I have ~15,000
group
s so I'm wondering of there's any faster way of doing this. 
Anova Dimension reduction on whole data or just on traing set
my dataset has 871 sample and 19900 feature I want use SVM but I don't know I use Anova that make a score on a feature I must use whole data and get feature form ANOVA or I split before using ANOVA and after that use SVM which way is true? and if I must split first how can I apply the same feature that selected in my train data set?

Statistical analysis on dependent numerical variable vs likert scale variables
I have a sample dataset like:
name<c("Bob","JACK","Steve","Daniel","Ed") ans1<c("5","4","4","1","4") ans2<c("4","4","4","2","3") ans3<c("4","3","4","2","5") hours<c(200,100,150,10,500) datat<data.frame(name,ans1,ans2,ans3,hours)
As you can see in this dataset there are 5 students and their answers to 3 questions in a scale from 1 to 5. You can also see the hours that these students spent on preparing for these "condidence" tests. I am not sure what kind of analysis should I perform in R on this dataframe as it seems to me that ans1,ans,ans3 are correlated since they come from the same student. Is a 2way anova a solution?
res.aov2 < aov(hours ~ ans1 + ans2 + ans3, data = datat) summary(res.aov2)