How to subset from aov summary in R?
Maybe this is a simple question but I'm wondering how can I subset DF
and F.values
for the terms appearing in an aov
summary?
For example, using the base R builtin dataset npk
, how can I extract the residual and other DF
s and F.values
that appear in the summary of the following model:
fit < summary(aov(yield ~ block + N * P + K, data = npk)) # example is fully reproducible
P.S. I'm looking for base R solutions.
1 answer

The
fit
output is alist
oflength
1 (by checkingstr(fit)
). We extract it with[[
and then do$
or[[
to extract the componentsfit[[1]]$Df #[1] 5 1 1 1 1 14 #where 14 is the Residuals df fit[[1]]$`F value` #[1] 4.391098 12.105541 0.537330 6.088639 1.361073 NA
See also questions close to this topic

silhouette calculation in R for a large data
I want to calculate silhouette for cluster evaluation. There are some packages in R, for example cluster and clValid. Here is my code using cluster package:
# load the data # a data from the UCI website with 434874 obs. and 3 variables data < read.csv("./data/spatial_network.txt",sep="\t",header = F) # apply kmeans km_res < kmeans(data,20,iter.max = 1000, nstart=20,algorithm="MacQueen") # calculate silhouette library(cluster) sil < silhouette(km_res$cluster, dist(data)) # plot silhouette library(factoextra) fviz_silhouette(sil)
The code works well for smaller data, say data with 50,000 obs, however I get an error like "Error: cannot allocate vector of size 704.5 Gb" when the data size is a bit large. This might be problem for Dunn index and other internal indices for large datasets.
I have 32GB RAM in my computer. The problem comes from calculating dist(data). I am wondering if it is possible to not calculating dist(data) in advance, and calculate corresponding distances when it is required in the silhouette formula.
I appreciate your help regarding this problem and how I can calculate silhouette for large and very large datasets.

What is an alternative for this slow forloop to fill in single days between dates?
For a project I'm working on, I need to have a dataframe to indicate whether a person was absent (0) or not (1) on a particular day.
The problem is: my data is in a format where it gives the starting date of absenteïsm and then the number of days the person was absent.
Example of my dataframe:
df1 < data.frame(Person = c(1,1,1,1,1), StartDate = c("0101","0201","0301","0401","0501"), DAYS = c(3,NA,NA,NA,1))
Instead of the "Start date" and "number of days absent" per person, it should look like this instead:
df2 < data.frame(Person = c(1,1,1,1,1), Date = c("0101","0201","0301","0401","0501"), Absent = c(1,1,1,0,1))
For now I solved it with this for loop with two ifconditions:
for(i in 1:nrow(df1)){ if(!is.na(df1$DAYS[i])){ var < df1$DAYS[i] } if(var > 0){ var < var1 df1$DAYS[i] < 1 } }
This works, however I have thousands of persons with a full year of dates each, meaning that I have more than 5 million rows in my dataframe. You can imagine how slow the loop is.
Does anyone know a quicker way to solve my problem? I tried looking at the lubridate package to work with periods and dates, but I don't see a solution there.

If value in a column starts with...mutate another column with given text, in R
I'm trying to build an if function that allows me to mutate the "city" column of a dataframe with a certain city name if in the "zipcode" column the value starts with a certain number.
For example: If zipcode starts with 1, mutate city column value with "NYC", else if zipcode starts with 6, mutate city column value with "Chicago", else if zipcode starts with 2, mutate city column value with "Boston",
and so on.
Dataframe name
data.frame
Column namezipcode
andcity
.I want do directly mutate the dataframe as I will need it for further use.
PS: Sorry for bad writing. I'm new in the community.
Thanks in advance!

WordPress: echo function inside Ajax with foreach loop not working?
I'm really stuck with this :(
I'm trying to echo function inside Ajax function with foreach loop for a list of users!
To be clear: I have a list of followers/following users inside
author.php
, Then i have created Ajax pagination for them with load more button.I'm using " Users Following System " plugin, Below you can find that function for displaying the follow button link, then i echo that function like that
echo pwuf_get_follow_unfollow_links( $user>ID );
I used this before with another users paged template it working without any problems!
But when i trying to add inside Ajax function not working, when i click on the (follow button) it jumping to the top of page?
I tried to add #! to
href="#!"
link, the page stopped from jumping to top but still the follow button not working!Console and debugging is clear! So what i'm doing wrong?
Here are all my code and i hope to find what's wrong with,
displayfunctions.php of the follow button.
function pwuf_get_follow_unfollow_links( $follow_id = null ) { global $user_ID; if( empty( $follow_id ) ) return; if( ! is_user_logged_in() ) return; if ( $follow_id == $user_ID ) return; ob_start(); ?> <div class="followlinks"> <?php if ( pwuf_is_following( $user_ID, $follow_id ) ) { $classes = "unfollow"; $text = "Following"; } else { $classes = "follow"; $text = "Follow"; } ?> <span><a href="#" class="<?php echo $classes; ?>" datauserid="<?php echo $user_ID; ?>" datafollowid="<?php echo $follow_id; ?>"><span><?php echo $text; ?></span></a></span> <img src="<?php echo PWUF_FOLLOW_URL; ?>/images/loading.svg" class="pwufajax" style="display:none;"/> </div> <?php return ob_get_clean(); }
Ajax function.php
add_action('wp_ajax_user_following_by_ajax', 'user_following_by_ajax_callback'); add_action('wp_ajax_nopriv_user_following_by_ajax', 'user_following_by_ajax_callback'); function user_following_by_ajax_callback() { check_ajax_referer('user_more_following', 'security'); $paged = $_POST['page']; /// detect the author data with the name or with the author id $curauth = NULL; if( isset($_POST['author_name']) AND trim($_POST['author_name']) != '' ) { $curauth = get_user_by('slug', trim($_POST['author_name']) ); } elseif( isset($_POST['author']) AND intval($_POST['author']) > 0 ) { $curauth = get_userdata( intval($_POST['author']) ); } if( !is_null($curauth) ) { $include = get_user_meta($curauth>ID, '_pwuf_following', true); if ( empty( $include ) ) { echo '<div class="nofollower">Not found followers yet <i class="fa faslack fa1x" ariahidden="true"></i></div>'; } else { $args = array ( 'order' => 'DESC', 'include' => $include, 'number' => '20', 'paged' => $paged ); $wp_user_query = new WP_User_Query( $args ); $users = $wp_user_query>get_results(); foreach ( $users as $user ) { $avatar = get_avatar($user>user_email, 100); $author_profile_url = get_author_posts_url($user>ID); $profile = get_userdata($user>ID); echo '<div class="membersname9"><a href="', $author_profile_url, '">' . $profile>first_name .'</a>'; echo '<a href="', $author_profile_url, '">', $avatar , '</a>'; echo pwuf_get_follow_unfollow_links( $user>ID ); } echo '</div>'; } } else { echo '<div class="nofollower">Not found followers yet <i class="fa faslack fa1x" ariahidden="true"></i></div>'; } wp_die(); }
follow.js
jQuery(document).ready(function($) { /******************************* follow / unfollow a user *******************************/ $( '.followlinks a' ).on('click', function(e) { e.preventDefault(); var $this = $(this); if( pwuf_vars.logged_in != 'undefined' && pwuf_vars.logged_in != 'true' ) { alert( pwuf_vars.login_required ); return; } var data = { action: $this.hasClass('follow') ? 'follow' : 'unfollow', user_id: $this.data('userid'), follow_id: $this.data('followid'), nonce: pwuf_vars.nonce }; $this.closest('.followlinks').find('img.pwufajax').show(); $.post( pwuf_vars.ajaxurl, data, function ( response ) { if ( response == 'success' ) { if ( $this.hasClass( 'follow' ) ) {; $this.removeClass( 'follow' ).addClass( 'unfollow' ); $this.find( 'span' ).text( 'Unfollow' ); } else {; $this.removeClass( 'unfollow' ).addClass( 'follow' ); $this.find( 'span' ).text( 'Follow' ); } } else { alert( pwuf_vars.processing_error ); } $this.closest('.followlinks').find('img.pwufajax').hide(); }); }); });

WooCommerce loop start with specific category
I am building a woocommerce shop. I want to build a function or something to display a specific category at the top of the loop. For example, in shop page, first products to be from Shoes category and the rest of the products continue normally.
add_action('pre_get_posts','shop_filter_cat'); function shop_filter_cat($query) { if (!is_admin() && is_post_type_archive( 'product' ) && $query>is_main_query()) { $query>set('tax_query', array( array ('taxonomy' => 'product_cat', 'field' => 'slug', 'terms' => 'agricole', ) ) ); } }
Sorry for my bad english and thank you!

Elixir. mirror a tree
Im new to Elixir and i have question that i'm stack in.
Implement a function that mirrors a binary tree. A tree is mirrored by mirroring, and changing place on, its two branches.
Appreciate all ideas !

Multiple regression of variables with different units
I'm new in statistical modelling and using R, so please excuse my mistake for this question.
I want to make multiple regression model with these variables:
 Revenue (in million USD) as dependent variable
 Customer experience score (with scale 1 to 5) as independent variable
 Number of package return (in unit) as independent variable
Since they have different unit and the variation is quite big, I'm thinking about standardize the variables before perform the regression. Is it will be better to model with standardize variable or do regression directly? I also read from the following source about how to rescale it with R.
But how to interpret the model if the variables are rescaled and no longer has a certain unit?

Applying regression and classification to prediction, can I use many records of test data to produce one prediction?
I am trying to make a machine learning based program which can give an early alert of cow's calving/labor time (few hours before). As I search for examples and technique, I am interested in using the predictive maintenance, especially after I read this one in Azure machine learning studio: https://gallery.azure.ai/Experiment/df7c518dcba7407fb855377339d6589f
It has a very clear explanation on the data preparation, etc. And the research questions I would like to adopt are:
 Regression models: How many more cycles an inservice engine will last before it fails?
 Binary classification: Is this engine going to fail within several cycles?
Which I will try to do on calving prediction, e.g. regression models: how many more hours/days a cow will calve? binary classification: is this cow going to calve within several hours?
Each cow has information on activity feature per minute and per hour (such as acceleration, activity type), and previous calving time which I will use for prediction.
However, I have some confusion on the test data that I should use. As far as I understand, usually if we want to use regression or binary classification, for one ID (machine or cow), there will be only one record for each engine/cow. In the Azure case, they also used only one record which has maximum cycle time for each engine.
My question is, to be able to predict when one specific cow will calve, I would like to use its historical activity data (for example using their activity data for last 48 hours, which will be 48 records/cow) and from several records for one cow, I need one prediction output (hours to calve). Is it possible to do that?
Any help will be appreciated.

Using Simex Package to Correct for Measurement Error
I am estimating a regression model that includes a categorical latent variable with 8 levels that was created using Latent Class Analysis. The data is multilevel so the 3step approach will not work (confirmed by an expert in this area). During my search, I found that the R package simex is a viable option. Does anyone have experience with this package? If so how should the correction for latent class measurement error for each class be specified in R?

subsetting with Relational Operator !=
I have a dataframe df with various columns. In column df$xyz I have about 20 character variables. I want to retain 3 variables ("HL%", "HH$", "LL$") and all other variables ("truncated", "kk$", "hhb"...) should be replaced with "other".
Thats my data frame:
xz xyz 2.5 HL% 4.4 HH$ 9.3 kk$ 2.4 kk$ 4.5 LL$ 5.6 truncated
I need:
xz xyz 2.5 HL% 4.4 HH$ 9.3 other 2.4 other 4.5 LL$ 5.6 other
I tried:
df$xyz[df$xyz!="HL%" df$xyz!="HH$" df$xyz!="LL$"] < "other"
That doesn't seem to do the trick.

Subset in R: extract logical factor from multiple columns
I would like to known, how to subset in R based on condition. I have a large object with 10 columns, the 8 columns are logical. I want to extract all values TRUE for any 4 columns out of total 8 ?

Subset data.table by condition but retain all rows belonging to group
I have data that looks like this:
require("data.table") dt1 < data.table( code=c("A001", "A001","A001","A002","A002","A002","A002","A003","A003"), value=c(40,38,55,10,12,16,18,77,87))
I would like to subset it so any group (
code
) that contains a value over or under a given number is retained. For example, if I wanted any group that contained a value over 50 then the result would look like this:dt2 < data.table( code=c("A001", "A001","A001","A003","A003"), value=c(40,38,55,77,87))

ggplot2: Faceting multiple regressions with slope & pvalue labels
I thought I would share some code for those who ever struggled to automate adding R2, slope, intercept, pvalue onto your multifacet figures. This code shows you how to calculate your regression fits, and place them onto your figure as a labels for multiple regressions (ancova).
Bellow figure is how the figure looks like following a quick touchup in a vector software.
I tried annotating as many lines as possible. Hopefully this will help some of you!

Manual Excel computation for standard deviation differs from ANOVA and StatCrunch
I have a problem in Excel where I manually compute a few metrics e.g. the mean and standard deviation of a frequency table. The mean works out fine but the standard deviation differs a little with an error of 1.68%.
The standard deviation is in the first instance manually computed from a frequency distribution which is made from the following formula but distributed across multicolumns in Excel sqrt(sum([frequency * (xi  mean)^2]/sum(frequency)))
I get results from the Excel computations for the manual Standard Deviation:
Minimum 0 Maximum 10 Median 5 Mode 5 Mean 4.6666666666666700 Standard Deviation 2.3711225658371700 Number 30
For the ANOVA computations (confirmed by Statcrunch):
Minimum 0 Maximum 10 Median 5 Mode 5 Mean 4.6666666666666700 Standard Deviation 2.4116575117588700 Number 30
However you can clearly see that the two standard deviations do not agree. I am inclined to think I have made an error in my computations but in spite of repeated checking I cannot see the discrepancy, if there is indeed one.
Can anyone think of a suggestion? Any help is gratefully received.

ANOVA Stats Computation in Spark 2 with Java 8
I have a piece of code in Java 8 in order to compute Annova statistics by using Spark SQL API as shown in the SNIPPET 1 below. This code segment is tailored based on the original SCALA code available at https://gist.github.com/srnghn/c74835818802fefabd76f1bcd6746831/77690607caab9039b015d2232c1216500427a995
QUESTION
When I run this as a spark job I am getting the error that is indicated in the SNIPPET 2 below where the problem occurs in the dataFrame named "joined". The part where error occurs has been marked with a comment " //!!!! VARIABLE UNDER QUESTION AS FOLLOWS !!!!" in SNIPPET 1 below. Following the definition of this variable, namely "joined", I have provided the original SCALA version from above indicated URL. Could you please point out what I am missing in the Java version? Thanks.
The essence of the problem has been highlighted in bold in the following text:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '
c.sum(valueSq))
' given input columns: [b.sum(value), d.cat, a.count, c.cat, c.sum(valueSq), b.cat, d.avg(value), a.cat];; 'Project [cat#51, count#74L, sum(value)#70, 'c.sum(valueSq)), 'avg(value))]SNIPPET 1:
private static AnovaStats computeAnovaStats(SparkSession spark, Dataset<Row> outliersDF){ outliersDF.createOrReplaceTempView("outliersDF"); Dataset<Row> anovaBaseDF = spark.sql("SELECT usercode as cat, cast((frequency) as double) as value FROM outliersDF"); anovaBaseDF.createOrReplaceTempView("anovaBaseDF"); Dataset<Row> newDF = spark.sql( "SELECT " + "A.cat, A.value, " + "cast((A.value * A.value) as double) as valueSq, " + "((A.value  B.avg) * (A.value  B.avg)) as diffSq " + "FROM anovaBaseDF A " + "JOIN " + "(SELECT cat, avg(value) as avg FROM anovaBaseDF GROUP BY cat) B " + "WHERE A.cat = B.cat"); RelationalGroupedDataset grouped = newDF.groupBy("cat"); Dataset<Row> sums = grouped.sum("value"); Dataset<Row> counts = grouped.count(); long numCats = counts.count(); Dataset<Row> sumsq = grouped.sum("valueSq"); Dataset<Row> avgs = grouped.avg("value"); double totN = toDouble(counts.agg(org.apache.spark.sql.functions.sum("count")).first().get(0)); double totSum = toDouble(sums.agg(org.apache.spark.sql.functions.sum("sum(value)")).first().get(0)); double totSumSq = toDouble(sumsq.agg(org.apache.spark.sql.functions.sum("sum(valueSq)")).first().get(0)); double totMean = totSum / totN; double dft = totN  1; double dfb = numCats  1; double dfw = totN  numCats; //!!!! VARIABLE UNDER QUESTION IS AS FOLLOWS !!!! Dataset<Row> joined = (counts.as("a") .join(sums.as("b"), (col("a.cat").$eq$eq$eq(col("b.cat")))) .join(sumsq.as("c"), (col("a.cat").$eq$eq$eq(col("c.cat")))) .join(avgs.as("d"), (col("a.cat").$eq$eq$eq(col("d.cat")))) .select(col("a.cat"), col("count"), col("sum(value)"), col("sum(valueSq))"), col("avg(value))"))); /* The original SCALA version of the local variable "joined", which is of type "Dataset<Row>", is as follows: val joined = (counts.as("a").join(sums.as("b"), $"a.cat" === $"b.cat")).join(sumsq.as("c"),$"a.cat" === $"c.cat").join(avgs.as("d"),$"a.cat"===$"d.cat").select($"a.cat",$"count",$"sum(value)",$"sum(valueSq)",$"avg(value)") */ Dataset<Row> finaldf = joined.withColumn("totMean", lit(totMean)); JavaPairRDD<String, Double> ssb_tmp = finaldf.javaRDD() .mapToPair(x > new Tuple2(x.getString(0), ((toDouble(x.get(4))  toDouble(x.get(4))) * (toDouble(x.get(5)) * toDouble(x.get(4))  toDouble(x.get(4)) * toDouble(x.get(1)))))); Dataset<Row> ssbDR = spark.sqlContext().createDataset(JavaPairRDD.toRDD(ssb_tmp), Encoders.tuple(Encoders.STRING(),Encoders.DOUBLE())).toDF(); double ssb = ssbDR.agg(org.apache.spark.sql.functions.sum("_2")).first().getDouble(0); Dataset<Row> ssw_tmp = grouped.sum("diffSq"); double ssw = toDouble(ssw_tmp.agg(org.apache.spark.sql.functions.sum("sum(diffSq)")).first().get(0)); double sst = ssb + ssw; double msb = ssb / dfb; double msw = ssw / dfw; double fValue = msb / msw; double etaSq = ssb / sst; double omegaSq = (ssb  ((numCats  1) * msw))/(sst + msw); AnovaStats anovaStats = new AnovaStats(dfb, dfw, fValue, etaSq, omegaSq); return anovaStats; } private static double toDouble(Object value){ double retVal = 0d; if(value instanceof Double){ retVal = ((Double) value).doubleValue(); } else if (value instanceof Long){ retVal = ((Long) value).doubleValue(); } else if (value == null){ retVal = 0d; } return retVal; }
SNIPPET 2:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`c.sum(valueSq))`' given input columns: [b.sum(value), d.cat, a.count, c.cat, c.sum(valueSq), b.cat, d.avg(value), a.cat];; 'Project [cat#51, count#74L, sum(value)#70, 'c.sum(valueSq)), 'avg(value))] + AnalysisBarrier + Join Inner, (cat#51 = cat#175) : Join Inner, (cat#51 = cat#154) : : Join Inner, (cat#51 = cat#139) : : : SubqueryAlias a : : : + Aggregate [cat#51], [cat#51, count(1) AS count#74L] : : : + Project [cat#51, value#52, cast((value#52 * value#52) as double) AS valueSq#56, ((value#52  avg#55) * (value#52  avg#55)) AS diffSq#57] : : : + Filter (cat#51 = cat#59) : : : + Join Inner : : : : SubqueryAlias A : : : : + SubqueryAlias anovabasedf : : : : + Project [usercode#10 AS cat#51, cast(frequency#0L as double) AS value#52] : : : : + SubqueryAlias outliersdf : : : : + Filter ((cast(frequency#0L as double) >= 718.5) && (cast(frequency#0L as double) <= 1413.5)) : : : : + Project [flowId#6, StateId#9, usercode#10, frequency#0L] : : : : + Filter (frequency#0L > cast(30 as bigint)) : : : : + SubqueryAlias T : : : : + SubqueryAlias basedf : : : : + Project [flowId#6, StateId#9, usercode#10, frequency#0L] : : : : + Sort [flowId#6 ASC NULLS FIRST, StateId#9 ASC NULLS FIRST, usercode#10 ASC NULLS FIRST], true : : : : + Aggregate [flowId#6, StateId#9, usercode#10], [flowId#6, StateId#9, usercode#10, count(instanceuserid#25) AS frequency#0L]