R datatable create the average value over the five previous years
I have data with a variable for which I want to get the difference between the current level and the average over the same variable for the same month for the 5 previous years.
library(tidyverse)
library(data.table)
library(lubridate)
MWE <- as.data.table(ggplot2::economics) %>%
.[,c("pce","psavert","uempmed","unemploy"):=NULL]
> MWE
date pop
1: 1967-07-01 198712.0
2: 1967-08-01 198911.0
3: 1967-09-01 199113.0
4: 1967-10-01 199311.0
5: 1967-11-01 199498.0
---
570: 2014-12-01 319746.2
571: 2015-01-01 319928.6
572: 2015-02-01 320074.5
573: 2015-03-01 320230.8
574: 2015-04-01 320402.3
I can do it by month, but I have trouble incororating the reference to the current line to do something like year(date) < year(currentline) & year(date) >= year(currentline)-6
MWE_2 <- MWE[,MeanPastYears:=mean(pop),by=month(date)]
My desired output would be
date pop avg_5yrs
1: 1967-07-01 198712.0 NA
2: 1967-08-01 198911.0 NA
3: 1967-09-01 199113.0 NA
4: 1967-10-01 199311.0 NA
5: 1967-11-01 199498.0 NA
---
570: 2014-12-01 319746.2 313013.8
571: 2015-01-01 319928.6 313192.1
572: 2015-02-01 320074.5 313350.7
573: 2015-03-01 320230.8 313511.2
574: 2015-04-01 320402.3 313640.3
1 answer
-
answered 2020-11-24 12:38
Abdessabour Mtk
the columns inside
[
can be indexed as vectors so we first create a vector for each rowyear(date) < year(date[..I]) & year(date) >= year(date[..I]) - 6
that has true when the date is in the interval, and then get the mean ofpop
by month:df[, year:=year(date) ][, avg_5yrs := sapply(1:.N, function(..I) mean(pop[year < year[..I] & year >= year[..I] -6])), by=month(date) ][, year:=NULL][] date pop avg_5yrs 1: 1967-07-01 198712.0 NaN 2: 1967-08-01 198911.0 NaN 3: 1967-09-01 199113.0 NaN 4: 1967-10-01 199311.0 NaN 5: 1967-11-01 199498.0 NaN --- 570: 2014-12-01 319746.2 311845.5 571: 2015-01-01 319928.6 312028.1 572: 2015-02-01 320074.5 312192.6 573: 2015-03-01 320230.8 312357.4 574: 2015-04-01 320402.3 312498.1
See also questions close to this topic
-
How to plot Argentina map with ggplot2 and add dots (according to latitude and longitude) on the map
I am NOT finding the way to plot an Argentina map with de limits between Provinces/states in order to add special dots (according to lat/lon values). I wanna do it with ggplot2.
-
R - Adjust graph axis with cowplot
I want to plot some results of a time-series analysis with ggplot, plotting the variable and its predictions on the same graph, while having the error plotted on a graph below (similar to how plot.ts work, but I cannot use this library).
Here's my code :
library(ggplot2) library(cowplot) set.seed(1111) my_df = data.frame( date = 1:150, initial = c(runif(100, max=100), rep(NA, 50)), predicted_intra = c(rep(50, 100), rep(NA, 50)), predicted_extra = c(rep(NA, 100), 51:100), err_intra = c(runif(100, max=100), rep(NA, 50)) ) my_colors = c("Init" = "grey30", "Predict" = "red") p1 <- ggplot(my_df) + aes(x = date, y = predicted_intra, color="Predict") + geom_line() + geom_line(aes(y = predicted_extra, color="Predict")) + geom_line(aes(y = initial, color="Init")) + scale_color_manual(name = "", values = my_colors) + ylab("Numberz") p2 <- ggplot(my_df) + aes(x = date, y = err_intra) + geom_line(color="red") + ylab("Error") plot_grid(p1, p2, nrow=2, rel_heights = c(2,1))
Which gives : The result of the code above
All is well until the legend shows. Is there a way to align the "date" axis of the two graphs?
-
Problems plotting log-likelihood-function with ggplot2
I'm currently trying to plot a log-likelihood-function using ggplot2; the function is defined by
y <- rpois(100, lambda = 3) f_1 <- function(z) -100*z + sum(log(1/factorial(y)*z^y)).
When trying to calculate values of f_1, everything works fine (e.g. f_1(1) = -316.1308)
But when I try to plot f_1 using ggplot2, an error pops up:
p <- ggplot(data = data.frame(z = 0), mapping = aes(z=z)) p <- p + stat_function(fun = f_1)
error: "longer object length is not a multiple of shorter object length".
How can I fix this error? Thanks
-
Java: SimpleDateFormat("yyyy-MMM-dd").parse(dateString) throws Exception (unparsable date: "Wed Dec 16 23:48:11 AEDT 2020" )
I am trying to parse (dow mon dd hh:mm:ss zzz yyyy) format to (yyyy-MM-dd) using java SimpleDateFormat
Date dt = new SimpleDateFormat("yyyy-MM-dd").parse("Wed Dec 16 23:48:11 AEDT 2020");
Exception: unparsable date: "Wed Dec 16 23:48:11 AEDT 2020"
Is there any other way to extract (yyyy-MM-dd)
-
How do I reward the user daily?
I want to reward the user daily in my app, but when the app closes the timer always stops
I've tried doing this...
Timer.scheduledTimer(withTimeInterval: 86400, repeats: true) { _ in self.points += 100 }
Can you please tell me how to reward the user daily once?
-
Snowflake how to obtain cumulative date values
Can someone help me with obtaining the result for below logic. I have a table with below columns.
TYPE SRC_CURR TAR_CURR EX_RATE EX_RATE_START_DATE M GBP USD 1.36687 2/1/2021 M GBP USD 1.33636 1/1/2021 M GBP USD 1.32837 12/1/2020 M GBP USD 1.30242 11/1/2020 M GBP USD 1.27421 10/1/2020 M GBP USD 1.31527 9/1/2020 ZEU GBP USD 1.3654 1/20/2021 ZEU GBP USD 1.363 1/19/2021 ZEU GBP USD 1.3587 1/18/2021 ZEU GBP USD 1.359 1/15/2021 ZEU GBP USD 1.3689 1/14/2021 ZEU GBP USD 1.3639 1/13/2021 ZEU GBP USD 1.3664 1/12/2021 ZEU GBP USD 1.3518 1/11/2021 ZEU GBP USD 1.3568 1/8/2021
So I need to form a new column which is EX_RATE_END_DATE from above values as shown below. Ideally the requirement is to have EX_RATE_END_DATE to max 9999-12-31 by default for the latest start date and for rest of the records it should be previous max start date - 1.
Please find below the output required,
TYPE SRC_CURR TAR_CURR EX_RATE EX_RATE_START_DATE EX_RATE_END_DATE M GBP USD 1.36687 2/1/2021 12/31/9999 M GBP USD 1.33636 1/1/2021 1/31/2021 M GBP USD 1.32837 12/1/2020 12/31/2020 M GBP USD 1.30242 11/1/2020 11/30/2020 M GBP USD 1.27421 10/1/2020 10/31/2020 M GBP USD 1.31527 9/1/2020 9/30/2020 ZEU GBP USD 1.3654 1/20/2021 12/31/9999 ZEU GBP USD 1.363 1/19/2021 1/19/2021 ZEU GBP USD 1.3587 1/18/2021 1/18/2021 ZEU GBP USD 1.359 1/15/2021 1/17/2021 ZEU GBP USD 1.3689 1/14/2021 1/14/2021 ZEU GBP USD 1.3639 1/13/2021 1/13/2021 ZEU GBP USD 1.3664 1/12/2021 1/12/2021 ZEU GBP USD 1.3518 1/11/2021 1/11/2021 ZEU GBP USD 1.3568 1/8/2021 1/10/2021
It would be great if someone help me with getting the desired result set by any possible ways in snowflake.
-
data.table fifelse giving wrong warning?
I found warning difference using
fifelse
fromdata.table
library:set.seed(123) df <- data.table(ID = rep(1:10,each = 2),x = sample(c(1,NA),20,replace = T)) test1 <- df[,fifelse(any(!is.na(x)),max(x,na.rm = T),as.numeric(NA)),by = ID]
produces a warning:
Warning messages 1: In max(x, na.rm = T) : no non-missing arguments to max; returning -Inf 2: In max(x, na.rm = T) : no non-missing arguments to max; returning -Inf
while:
test2 <- df[,ifelse(any(!is.na(x)),max(x,na.rm = T),as.numeric(NA)),by = ID]
don't. And the two results are identical:
identical(test1,test2) [1] TRUE
And there is no -Inf in the result. What does this mean ?
-
How to speed up forloop grep in a large dataframe using R
Please I need help.
I have I script that works well for many dataframes even if that takes several hours (on the cluster: > 100 GB memory). For some large dataframes (> 3 Million rows ) the loop for doesn't work even after two days of running. So, I need help if there is a way to speed up the for loop or replace the script with more speeder functions in R.
This is a short description of my script/data:
snp1 <- c("R0100004", "R0100009", "R0100044", "R0100061", "R0100066","R0100067") snp2 <- c("R0100039", "R0100152", "R0100066", "R0100067", "R0100068", "R0100082") blocks <- c("R0100004|R0100009|R0100190|R0100015|R0100016|R0100017|R0100018|R0100021|R0100022|R0100024|R0100025", "R0100039|R0100038|R0100037|R0100036|R0100043|R0100044", "R0100220|R0100052|R0100053|R0100054|R0100055|R0100057|R0100058|R0100059", "R0100061|R0100066|R0100067", "R0100068|R0100069|R0100071|R0100072|R0100073|R0100074|R0133440|R0100076|R0100077|R0100078", "R0100079|R0100081|R0100082") # This is my forloop I <- length(snp1) # 3000000 res1 <- list() res2 <- list() for(j in 1:I){ myres1 <- list(grep(snp1[j], blocks, value=T)) myres2 <- list(grep(snp2[j], blocks, value=T)) res1[j] <- myres1 res2[j] <- myres2 }
How can I replace or speed up this for loop to work with large dataframes (> 3000000 rows)
Thanks in advance.
-
change variable class, without dropping labels
I have a df that looks something like this
library(Hmisc) library(data.table) df–<- structure(list(id1 = structure(c(108791, 154542, 32742, 51033, 123998, 165156, 159221, 51806, 82668, 94864), label = "label 1", format.stata = "%12.0g"), id2 = structure(c(372925, 14792, 24970, 24970, 24970, 24970, 324930, 14792, 14792, 23284), label = "label 2", format.stata = "%12.0g")), row.names = c(NA,-10L), class = c("tbl_df", "tbl", "data.frame")) contents(tst) # Labels Storage # id1 label 1 double # id2 label 2 double
I consist, of multiple variables that are labelled. I am going through some data management using
data.table
. My issue is that every time I do some manipulations, my variables lose their labels.Take the following example: I would like to change the variable class, from numeric to character using the following approach:
idvars <- c("id1","id2") df_final <- setDT(df)[,(idvars):=lapply(.SD,function(x) as.character(x)),.SDcols=idvars]
While df_final correctly changes variable class, the resulting variables are no longer labelled.
contents(df_final) # Storage # id1 character # id2 character
Does anyone know how I can continue to use data table to do data management while keeping the variables labels?
thanks
-
Convert columns a data frame to a list in R
I want to convert the columns in data frame to a list. The format of data frame is described as follows:
H1.time H1.response E9.time E9.response F12.time F12.response 1: 0.0 0.00000000 0.0 0.00000000 0.0 0.00000000 2: 0.2 0.00142469 0.2 0.00826733 0.2 0.00703381 3: 0.4 -0.00418229 0.4 0.01416873 0.4 0.00863728 4: 0.6 0.00361758 0.6 0.00845066 0.6 0.00739067 5: 0.8 0.00281592 0.8 0.01258872 0.8 0.00786157 6: 1.0 -0.00293035 1.0 0.01097368 1.0 0.00679848
H1, E9, and F12 are the file names, and I need to convert them into a list, i.e., each file will be one element of the list, and for each element, it is a data frame, with time and response as the column names.
Thank you for your help.
-
Manipulate data in SQL (backfilling, pivoting)
I have a table similar to this small example:
I want to manipulate it to this format:
Here's a sample SQL script to create an example input table:
CREATE TABLE sample_table ( id INT, hr INT, tm DATETIME, score INT, ) INSERT INTO sample_table VALUES (1, 0, '2021-01-21 00:26:45', 2765), (1, 0, '2021-01-21 00:49:00', 2765), (1, 5, '2021-01-21 07:47:03', 1593), (1, 7, '2021-01-21 11:50:48', 1604), (1, 7, '2021-01-21 12:00:32', 1604), (2, 0, '2021-01-21 00:50:45', 3500), (2, 2, '2021-01-21 01:49:00', 2897), (2, 2, '2021-01-21 05:47:03', 2897), (2, 4, '2021-01-21 09:30:48', 2400), (2, 6, '2021-01-21 12:00:32', 1647);
I tried using combination of LAG and CASE WHEN, not successful so far. Looking for some ideas on how to manipulate (what functions etc). Would be awesome to see example script for the manipulation.
Where there is multiple values per id & hr, then earliest values to be used. E.g. id=1 & hr=7, then hr_7=uses value from 11:50. Although in this example, it's the same values for both records, it can differ.
-
R: changing plot colors
I made the following plot in R :
library(MASS) a = rnorm(100, 10, 10) b = rnorm(100, 10, 5) c = rnorm(100, 5, 10) group <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25) ) d = data.frame(a, b, c, group) d$group = as.factor(d$group) parcoord(d[, c(3, 1, 2)], col = 1 + (0:149) %/% 50) title(main = "Plot", xlab = "Variable", ylab = "Values") axis(side = 2, at = seq(0, 5, 0.1), tick = TRUE, las = 1)
Can someone please show me how to color each line in this plot according the value of "d$group" and add a legend in the corner of the screen?
I think the following line of code changes the color of lines per value of "d$group":
parcoord(d[, c(3, 1, 2)], col = d$group)
And this line of code creates a legend:
legend( "topleft", c("A", "B", "C", "D"), text.col=c("blue", "red", "yellow", "green") ) title("Legend", cex.main = 1.1)
But I am not sure how to have the colors from the legend match the real colors. I tried:
legend( "topleft", c("A", "B", "C", "D"), text.col= d$group)
But this did not work. Can someone please show me how to fix this?
Thanks
EDIT: is there a way to run the code (provided in the answer) if variable "C" is a factor?
e.g.
a = rnorm(100, 10, 10) b = rnorm(100, 10, 5) c <- sample( LETTERS[1:2], 100, replace=TRUE, prob=c(0.5,0.5) ) group <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25) ) d = data.frame(a, b, c, group) d$group = as.factor(d$group) d$c = as.factor(d$c) library(tidyverse) library(ggplot2) library(dplyr) d %>% mutate(rn = row_number()) %>% pivot_longer(a:c) %>% ggplot(aes(name, value, group = rn, color = group)) + scale_x_discrete(expand = expansion(0, 0)) + geom_line()
-
Simulate multiple dates between two dates
library(lubridate) date1<-ymd("2021/01/01") date2<-ymd("2021/01/31")
How can I simulate multiple dates between
"2021-01-01"
and"2021-01-31"
, for example ten dates like this:[1] "2021-01-21" "2021-01-07" "2021-01-09" "2021-01-18" "2021-01-02" "2021-01-13" "2021-01-24" "2021-01-30" "2021-01-11" "2021-01-25"
-
Is there a function (preferably in lubridate) to get the number of weeks in any given year?
I can probably hard code a lookup table and reference from that but I am wondering if there is a handy function in any R package that can return the number of weeks in a given year. ISO 8601 calendar is needed.
lubridate
has the functionisoweek
which returns the week number for a given datelubridate::isoweek("2020-12-31")
lubridate::isoweek("2021-12-31")
However, what I need is something like
isoweeks(2020)
which would return 53isoweeks(2021)
which would return 52 -
How do I calculate the shortest and longest time intervals in R?
I have the following dataset called
phone
:TIMESTAMP time date 2021-01-12 10:42:50.221 10:42:50 2021-01-12 2021-01-12 10:46:01.826 10:46:01 2021-01-12 2021-01-12 10:50:10.063 10:50:10 2021-01-12 2021-01-12 10:53:10.715 10:53:10 2021-01-12 2021-01-12 10:53:14.329 10:53:14 2021-01-12 2021-01-12 10:54:19.792 10:54:19 2021-01-12 2021-01-12 11:01:43.044 11:01:43 2021-01-12 2021-01-12 11:04:36.202 11:04:36 2021-01-12 I would like to calculate the time intervals between two consecutive
time
values for the entire dataset so that I can find the shortest and longest time intervals of the day. I've tried the following code, trying to calculate the time differences but it gives me negative values and I don't exactly know how accurate this is. I've tried changing the unit to minutes as well, but the output doesn't make sense to me.v2 <- ymd_hms(phone$TIMESTAMP) v1 <- difftime(v2[-length(v2)], v2[-1], unit = "hour")
Output of v1:
x -0.0532236 hours -0.0689547 hours -0.0501811 hours -0.0010039 hours -0.0181842 hours How can I calculate the intervals and arrange them in descending order of the length?
dput of my data is as follows:
structure(list(TIMESTAMP = c("2021-01-12 10:42:50.221", "2021-01-12 10:46:01.826", "2021-01-12 10:50:10.063", "2021-01-12 10:53:10.715", "2021-01-12 10:53:14.329", "2021-01-12 10:54:19.792", "2021-01-12 11:01:43.044", "2021-01-12 11:04:36.202", "2021-01-12 11:07:36.636", "2021-01-12 11:18:59.169", "2021-01-12 11:25:44.954", "2021-01-12 11:25:54.263", "2021-01-12 11:26:25.414", "2021-01-12 11:28:05.471", "2021-01-12 11:30:24.349"), time = c("10:42:50", "10:46:01", "10:50:10", "10:53:10", "10:53:14", "10:54:19", "11:01:43", "11:04:36", "11:07:36", "11:18:59", "11:25:44", "11:25:54", "11:26:25", "11:28:05", "11:30:24"), date = structure(c(18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639), class = "Date")), row.names = c(NA, 15L), class = "data.frame")