Check if two values within consecutive dates are identical
Let's say I have a tibble like
df <- tribble(
~date, ~place, ~wthr,
#------------/-----/--------
"2017-05-06","NY","sun",
"2017-05-06","CA","cloud",
"2017-05-07","NY","sun",
"2017-05-07","CA","rain",
"2017-05-08","NY","cloud",
"2017-05-08","CA","rain",
"2017-05-09","NY","cloud",
"2017-05-09","CA",NA,
"2017-05-10","NY","cloud",
"2017-05-10","CA","rain"
)
I want to check if the weather in a specific region on a specific day was same as yesterday, and attach the boolean column to df
, so that
tribble(
~date, ~place, ~wthr, ~same,
#------------/-----/------/------
"2017-05-06","NY","sun", NA,
"2017-05-06","CA","cloud", NA,
"2017-05-07","NY","sun", TRUE,
"2017-05-07","CA","rain", FALSE,
"2017-05-08","NY","cloud", FALSE,
"2017-05-08","CA","rain", TRUE,
"2017-05-09","NY","cloud", TRUE,
"2017-05-09","CA", NA, NA,
"2017-05-10","NY","cloud", TRUE,
"2017-05-10","CA","rain", NA
)
Is there a good way to do this?
1 answer
-
answered 2020-11-23 18:23
Ben
To get a logical column, you check
wthr
value if equal to row before usinglag
after grouping byplace
. I addedarrange
for date to make sure in chronological order.library(dplyr) df %>% arrange(date) %>% group_by(place) %>% mutate(same = wthr == lag(wthr, default = NA))
Edit: If you want to make sure dates are consecutive (1 day apart), you can include an
ifelse
to see if the difference is 1 betweendate
andlag(date)
. If is not 1 day apart, it can be coded asNA
.Note: Also, make sure your date is a
Date
:df$date <- as.Date(df$date) df %>% arrange(date) %>% group_by(place) %>% mutate(same = ifelse( date - lag(date) == 1, wthr == lag(wthr, default = NA), NA))
Output
date place wthr same <chr> <chr> <chr> <lgl> 1 2017-05-06 NY sun NA 2 2017-05-06 CA cloud NA 3 2017-05-07 NY sun TRUE 4 2017-05-07 CA rain FALSE 5 2017-05-08 NY cloud FALSE 6 2017-05-08 CA rain TRUE 7 2017-05-09 NY cloud TRUE 8 2017-05-09 CA NA NA 9 2017-05-10 NY cloud TRUE 10 2017-05-10 CA rain NA
See also questions close to this topic
-
How to plot Argentina map with ggplot2 and add dots (according to latitude and longitude) on the map
I am NOT finding the way to plot an Argentina map with de limits between Provinces/states in order to add special dots (according to lat/lon values). I wanna do it with ggplot2.
-
R - Adjust graph axis with cowplot
I want to plot some results of a time-series analysis with ggplot, plotting the variable and its predictions on the same graph, while having the error plotted on a graph below (similar to how plot.ts work, but I cannot use this library).
Here's my code :
library(ggplot2) library(cowplot) set.seed(1111) my_df = data.frame( date = 1:150, initial = c(runif(100, max=100), rep(NA, 50)), predicted_intra = c(rep(50, 100), rep(NA, 50)), predicted_extra = c(rep(NA, 100), 51:100), err_intra = c(runif(100, max=100), rep(NA, 50)) ) my_colors = c("Init" = "grey30", "Predict" = "red") p1 <- ggplot(my_df) + aes(x = date, y = predicted_intra, color="Predict") + geom_line() + geom_line(aes(y = predicted_extra, color="Predict")) + geom_line(aes(y = initial, color="Init")) + scale_color_manual(name = "", values = my_colors) + ylab("Numberz") p2 <- ggplot(my_df) + aes(x = date, y = err_intra) + geom_line(color="red") + ylab("Error") plot_grid(p1, p2, nrow=2, rel_heights = c(2,1))
Which gives : The result of the code above
All is well until the legend shows. Is there a way to align the "date" axis of the two graphs?
-
Problems plotting log-likelihood-function with ggplot2
I'm currently trying to plot a log-likelihood-function using ggplot2; the function is defined by
y <- rpois(100, lambda = 3) f_1 <- function(z) -100*z + sum(log(1/factorial(y)*z^y)).
When trying to calculate values of f_1, everything works fine (e.g. f_1(1) = -316.1308)
But when I try to plot f_1 using ggplot2, an error pops up:
p <- ggplot(data = data.frame(z = 0), mapping = aes(z=z)) p <- p + stat_function(fun = f_1)
error: "longer object length is not a multiple of shorter object length".
How can I fix this error? Thanks
-
Python Pandas change values from a condition
how are you? I'm pretty new at code, and I have this question:
I want to iterate through a column, and I want to change this column values based on a condition, in this case, I want to change very value from column 'a1': if the value contains the word 'Juancito' I want to change it to just 'Juancito'. The for loop works OK, but the value doesn't change in the end.
What I'm doing wrong?
import pandas as pd inp = [{'a1':'Juancito 1'}, {'a1':'Juancito 2'}, {'a1':'Juancito 3'}] df = pd.DataFrame(inp) for i in df['a1']: if 'Juancito' in i: i = 'Juancito' else: pass df.head()
-
How to iterate over rows of a Frame in DataTable
With pandas I usually iterate the rows of a DataFrame with
itertuples
oriterrows
. How can I do this kind of iteration on a Frame from Python DataTable?Exemple of Pandas iteration that I need:
for row in df_.itertuples(): print(row)
-
How to transform a list of dictionaries, containing nested lists into a pandas df
I have a list of dicts:
list_of_dicts = [{'name': 'a', 'counts': [{'dog': 2}]}, {'name': 'b', 'counts': [{'cat': 1}, {'capibara': 5}, {'whale': 10}]}, {'name': 'c', 'counts': [{'horse':1}, {'cat': 1}]]
I would like to transform this into a pandas dataframe like so:
Name Animal Frequency a dog 2 b cat 1 b capibara 5 b whale 10 c horse 1 c cat 1 In the current code, I try to normalize it:
from pandas import json_normalize df = json_normalize(list_of_dicts, 'counts')
But I think I am going in the wrong direction. Also, if I do a simple
df = pd.DataFrame(list_of_dicts)
, it results in each list of dicts being a single row value, which is not desired. -
Finding the differences of paired-columns using dplyr
set.seed(3) library(dplyr) dat <- tibble(Measure = c("Height","Weight","Width","Length"), AD1_1= rpois(4,10), AD1_2= rpois(4,9), AD2_1= rpois(4,10), AD2_2= rpois(4,9), AD3_1= rpois(4,10), AD3_2= rpois(4,9), AD4_1= rpois(4,10), AD4_2= rpois(4,9), AD5_1= rpois(4,10), AD5_2= rpois(4,9), AD6_1= rpois(4,10), AD6_2= rpois(4,9))
Suppose I have data that looks like this. I wish to calculate the difference for each AD, paired with underscored number, i.e., AD1diff, AD2diff,AD3diff.
Instead of writing
dat %>% mutate(AD1diff = AD1_1 - AD1_2, AD2diff = AD2_1 - AD2_2, ...)
what would be an efficient way to write this?
-
How to get top n % and button n% in data frame in R
Here is my data:
dat <- read.table(text = "id val1 val2 vt 1 14 12 19 2 13 13 12 3 12 12 13 4 12 13 13 5 12 14 22 6 12 12 14 7 12 13 14 8 12 14 12 9 13 13 14 10 13 14 14 11 14 14 14 12 13 14 17 13 13 14 31 14 13 13 14 15 13 14 13 16 13 14 23 ", header = TRUE)
I want to get the top 25 % and the bottom 45% according to vt.
Here is the output top25%
id val1 val2 vt 13 13 14 31 16 13 14 23 5 12 14 22 1 14 12 19
and the top 45% is
id val1 val2 vt 7 12 13 14 9 13 13 14 10 13 14 14 11 14 14 14 14 13 13 14 3 12 12 13 4 12 13 13 15 13 14 13 2 13 13 12 8 12 14 12
I have tried subset() with quantile, it seems it does not work for the bottom n%. Is it possible to do it with dplyr? I have checked the other links, they have not provided for the bottom n%. In addition, I do not want to get them by any group.
-
Why does `filter` crash with an input length error in my shiny app?
i am pretty new to programmring but i have to make a shiny app for a university course.
As you can see i webscraped a data table thats presents different bike geometries and i wanted to create a shiny app, where i can compare the geometries with each other. I am quite happy with my progress, but now i got the problem that it always shows me the error: "Error in : Problem with
filter()
input..1
. x Input..1
must be of size 19 or 1, not size 0. i Input..1
is!=...
. 161: "I want that its possible in the app to choose one bike and it automatically compares the bike and shows me the 10 most similar bikes.
#table Canyon <- read_html("https://enduro-mtb.com/canyon-strive-cfr-9-0-ltd-test-2020/") Rose <- read_html("https://enduro-mtb.com/rose-root-miller-2020-test/") Ghost <- read_html("https://enduro-mtb.com/ghost-riot-enduro-2021-erster-test/") Cube <- read_html("https://enduro-mtb.com/cube-stereo-170-sl-29-test-2020/") Comparison <- tibble( Geometry = Canyon %>% html_nodes(".geometry strong") %>% html_text()%>% str_trim(), CanyonStrive = Canyon %>% html_nodes("td:nth-child(3)") %>% html_text()%>% str_trim(), GhostRiot = Ghost %>% html_nodes("td:nth-child(3)") %>% html_text()%>% str_trim(), CubeStereo = Cube %>% html_nodes("td:nth-child(3)") %>% html_text()%>% str_trim(), RoseRootMiller = Rose %>% html_nodes("td:nth-child(3)") %>% html_text()%>% str_trim(), ) ComparisonTable <- Comparison %>% mutate_all(~gsub("mm|°|-.*|/.*|\\.", "", .)) %>% mutate_all(~gsub(",", ".", .)) %>% mutate_all(type.convert, as.is=TRUE) %>% gather("Bikes", "value", 2:ncol(Comparison)) %>% spread(Geometry,value) Art <- c("Enduro", "Enduro", "AllMountain", "Enduro") ComparisonTableHallo <- ComparisonTable ComparisonTableHallo$Art <- Art # server server <- function(input, output, session) { selectedData1 <- reactive({ ComparisonTableHallo %>% filter(ComparisonTableHallo$Bikes != gsub("[[:space:]]*$","",gsub("- .*",'',input$Bikes))) }) selectedData2 <- reactive({ selectedData1() %>% select(1:12) %>% filter(selectedData1()$Art %in% input$Art) }) selectedData3 <- reactive({ ComparisonTableHallo %>% select(1:12) %>% filter(ComparisonTableHallo$Bikes == gsub("[[:space:]]*$","",gsub("- .*",'',input$Bikes))) }) selectedData4 <- reactive({ rbind(selectedData3(),selectedData2()) }) selectedData5 <- reactive({ selectedData4() %>% select(3:11) }) selectedData6 <- reactive({ as.numeric(knnx.index(selectedData5(), selectedData5()[1, , drop=FALSE], k=2)) }) selectedData7 <- reactive({ selectedData4()[selectedData6(),] }) selectedData8 <- reactive({ selectedData7() %>% select(3:11) }) # Combine the selected variables into a new data frame output$plot1 <- renderPlotly({ validate( need(dim(selectedData2())[1]>=2, "Sorry, no ten similar bikes were found. Please change the input filters." ) ) plot_ly( type = 'scatterpolar', mode = "closest", fill = 'toself' ) %>% add_trace( r = as.matrix(selectedData8()[1,]), theta = c("Kettenstrebe", "Lenkwinkel","Oberrohr","Radstand","Reach","Sattelrohr","Sitzwinkel","Stack","Steuerrohr", "Tretlagerabsenkung"), showlegend = TRUE, mode = "markers", name = selectedData7()[1,1] ) %>% add_trace( r = as.matrix(selectedData8()[2,]), theta = c("Kettenstrebe","Lenkwinkel","Oberrohr","Radstand","Reach","Sattelrohr","Sitzwinkel","Stack","Steuerrohr", "Tretlagerabsenkung"), showlegend = TRUE, mode = "markers", visible="legendonly", name = selectedData7()[2,1] ) %>% layout( polar = list( radialaxis = list( visible = T, range = c(0,100) ) ), showlegend=TRUE ) }) } #shiny app ui <- fluidPage(navbarPage("Bike Comparison", tabPanel("Graphic",fluidPage(theme = shinytheme("flatly")), tags$head( tags$style(HTML(".shiny-output-error-validation{color: red;}"))), pageWithSidebar( headerPanel('Apply filters'), sidebarPanel(width = 4, selectInput('Bike', 'Choose a Bike:',paste(ComparisonTableHallo$Bikes)), checkboxGroupInput(inputId = "Art", label = 'Art:', choices = c("Enduro" = "Enduro", "AllMountain" = "AllMountain" ), selected = c("Enduro" = "Enduro","AllMountain" = "AllMountain"),inline=TRUE), submitButton("Update filters") ), mainPanel( column(8, plotlyOutput("plot1", width = 800, height=700), p("To visualize the graph of the player, click the icon at side of names in the graphic legend. It is worth noting that graphics will be overlapped.", style = "font-size:25px") ) ) ))) ) shinyApp(ui = ui, server = server)
-
Simulate multiple dates between two dates
library(lubridate) date1<-ymd("2021/01/01") date2<-ymd("2021/01/31")
How can I simulate multiple dates between
"2021-01-01"
and"2021-01-31"
, for example ten dates like this:[1] "2021-01-21" "2021-01-07" "2021-01-09" "2021-01-18" "2021-01-02" "2021-01-13" "2021-01-24" "2021-01-30" "2021-01-11" "2021-01-25"
-
Is there a function (preferably in lubridate) to get the number of weeks in any given year?
I can probably hard code a lookup table and reference from that but I am wondering if there is a handy function in any R package that can return the number of weeks in a given year. ISO 8601 calendar is needed.
lubridate
has the functionisoweek
which returns the week number for a given datelubridate::isoweek("2020-12-31")
lubridate::isoweek("2021-12-31")
However, what I need is something like
isoweeks(2020)
which would return 53isoweeks(2021)
which would return 52 -
How do I calculate the shortest and longest time intervals in R?
I have the following dataset called
phone
:TIMESTAMP time date 2021-01-12 10:42:50.221 10:42:50 2021-01-12 2021-01-12 10:46:01.826 10:46:01 2021-01-12 2021-01-12 10:50:10.063 10:50:10 2021-01-12 2021-01-12 10:53:10.715 10:53:10 2021-01-12 2021-01-12 10:53:14.329 10:53:14 2021-01-12 2021-01-12 10:54:19.792 10:54:19 2021-01-12 2021-01-12 11:01:43.044 11:01:43 2021-01-12 2021-01-12 11:04:36.202 11:04:36 2021-01-12 I would like to calculate the time intervals between two consecutive
time
values for the entire dataset so that I can find the shortest and longest time intervals of the day. I've tried the following code, trying to calculate the time differences but it gives me negative values and I don't exactly know how accurate this is. I've tried changing the unit to minutes as well, but the output doesn't make sense to me.v2 <- ymd_hms(phone$TIMESTAMP) v1 <- difftime(v2[-length(v2)], v2[-1], unit = "hour")
Output of v1:
x -0.0532236 hours -0.0689547 hours -0.0501811 hours -0.0010039 hours -0.0181842 hours How can I calculate the intervals and arrange them in descending order of the length?
dput of my data is as follows:
structure(list(TIMESTAMP = c("2021-01-12 10:42:50.221", "2021-01-12 10:46:01.826", "2021-01-12 10:50:10.063", "2021-01-12 10:53:10.715", "2021-01-12 10:53:14.329", "2021-01-12 10:54:19.792", "2021-01-12 11:01:43.044", "2021-01-12 11:04:36.202", "2021-01-12 11:07:36.636", "2021-01-12 11:18:59.169", "2021-01-12 11:25:44.954", "2021-01-12 11:25:54.263", "2021-01-12 11:26:25.414", "2021-01-12 11:28:05.471", "2021-01-12 11:30:24.349"), time = c("10:42:50", "10:46:01", "10:50:10", "10:53:10", "10:53:14", "10:54:19", "11:01:43", "11:04:36", "11:07:36", "11:18:59", "11:25:44", "11:25:54", "11:26:25", "11:28:05", "11:30:24"), date = structure(c(18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639, 18639), class = "Date")), row.names = c(NA, 15L), class = "data.frame")
-
Using tidyverse to "unnest" a data.frame column inside a tibble
I'm working with some data which is returned from a www call which
jsonlite
andas_tibble
somehow convert into adata.frame
column.This
result
data has anId
integer column and anActionCode
data.frame column with two internal columns. these show in the console as:> result # A tibble: 117 x 2 Id ActionCode$Code $Name <int> <chr> <chr> 1 A1 First Code 2 A2 Second Code 3 A3 Third Code 4 A4 Fourth Code ...
and this can be inspected with
str()
as:> result %>% str() tibble [117 x 2] (S3: tbl_df/tbl/data.frame) $ Id : int [1:117] 1 2 3 4 ... $ ActionCode:'data.frame': 117 obs. of 2 variables: ..$ Code: chr [1:117] "A1" "A2" "A3" "A4" ... ..$ Name: chr [1:117] "First Code" "Second Code" "Third Code" "Fourth Code" ...
I've seen from e.g. https://tibble.tidyverse.org/articles/types.html that this sort of
data.frame
column is perfectly legal, but I'm struggling to work out how to access the data in this column from tidy dplyr pipelines - e.g. I can'tselect(ActionCode$Code)
Is there a way to work with these columns in
dplyr
pipelines? Or is there a way to somehow flatten these columns similar to howunnest
can be used onlist
columns (although I realise here that I'm not creating extra rows - I'm just flattening the column hierarchy).i.e. I'm trying to find a function
foo
which can output:> result %>% foo() %>% str() tibble [117 x 2] (S3: tbl_df/tbl/data.frame) $ Id : int [1:117] 1 2 3 4 ... $ Code: chr [1:117] "A1" "A2" "A3" "A4" ... $ Name: chr [1:117] "First Code" "Second Code" "Third Code" "Fourth Code" ...
I can't provide the www call as a sample, but as a working example I think the sort of data I am presented with is something like:
sample_data <- tibble( Id = 1:10, ActionCode = tibble( Code = paste0("Id", 1:10), Name = paste0("Name ", 1:10), ) )
-
Using case_when, how to mutate a new list-column that nests a vector within?
I'm trying to use
dplyr
'scase_when()
to mutate a new column based on conditions in other columns. However, I want the new column to be nesting a vector.Example
Consider the following toy data. Based on it, I want to summarize the geographical territory of the UK.
library(tibble) set.seed(1) my_mat <- matrix(sample(c(TRUE, FALSE), size = 40, replace = TRUE), nrow = 10, ncol = 4) colnames(my_mat) <- c("England", "Wales", "Scotland", "Northern_Ireland") my_df <- as_tibble(my_mat) > my_df ## # A tibble: 10 x 4 ## England Wales Scotland Northern_Ireland ## <lgl> <lgl> <lgl> <lgl> ## 1 TRUE TRUE TRUE FALSE ## 2 FALSE TRUE TRUE FALSE ## 3 TRUE TRUE TRUE TRUE ## 4 TRUE TRUE TRUE FALSE ## 5 FALSE TRUE TRUE TRUE ## 6 TRUE FALSE TRUE TRUE ## 7 TRUE FALSE FALSE FALSE ## 8 TRUE FALSE TRUE TRUE ## 9 FALSE FALSE TRUE FALSE ## 10 FALSE TRUE FALSE FALSE
I want to mutate a new
collective_geo_territory
column.- if both
England
,Scotland
,Wales
, andNorthern_Ireland
areTRUE
, then we say this isUnited_Kingdom
. - otherwise, if only
England
,Scotland
, andWales
areTRUE
, then we say this isGreat_Britain
- any other combination would simply return a vector with the names of countries that are
TRUE
.
My attempt
So far, I know how to address conditions (1) and (2) detailed above, using the following code
library(dplyr) my_df %>% mutate(collective_geo_territory = case_when(England == TRUE & Wales == TRUE & Scotland == TRUE & Northern_Ireland == TRUE ~ "United_Kingdom", England == TRUE & Wales == TRUE & Scotland == TRUE ~ "Great_Britain"))
Desired Output
However, I want to achieve an output with
collective_geo_territory
column that looks like the following:## # A tibble: 10 x 5 ## England Wales Scotland Northern_Ireland collective_geo_territory ## <lgl> <lgl> <lgl> <lgl> <list> ## 1 TRUE TRUE TRUE FALSE <chr [1]> # c("Great_Britain") ## 2 FALSE TRUE TRUE FALSE <chr [2]> # c("Wales", "Scotland") ## 3 TRUE TRUE TRUE TRUE <chr [1]> # c("United_Kingdom") ## 4 TRUE TRUE TRUE FALSE <chr [1]> # c("Great_Britain") ## 5 FALSE TRUE TRUE TRUE <chr [3]> # c("Wales", "Scotland", "Northern_Ireland") ## 6 TRUE FALSE TRUE TRUE <chr [3]> # c("England", "Scotland", "Northern_Ireland") ## 7 TRUE FALSE FALSE FALSE <chr [1]> # c("England") ## 8 TRUE FALSE TRUE TRUE <chr [3]> # c("England", "Scotland", "Northern_Ireland") ## 9 FALSE FALSE TRUE FALSE <chr [1]> # c("Scotland") ## 10 FALSE TRUE FALSE FALSE <chr [1]> # c("Wales")
- if both
-
R: Loop through columns in tibble to find differences between each and create new for each difference
I have been working on this for a while now, but I can't seem to figure it out. I'm looking for a solution that can: calculate difference between col1 and col2 and create colA based on this; then calculate difference between col2 and col3 and create colB based on this, etc. etc. I have about 70 rows and 42 of these columns so it's not something I want to do by hand (at this point I am almost desperate enough).
To give a note also, some of the cells in the rows are empty (NA). An emergency solution would be to fill these with zeroes, but I'd rather not.
Also, the dataframe I use is a tibble, however, I am not bound to this so much that I can't change it to a real dataframe.
My data looks like this: testdata
As you can see, the columns have annoyingly long names I did not know how to change also :). I use the column numbers usually, which are 77:119. I hope this is complete enough. Sorry for the noob-ness and possibly unclear explanation, this is my first question on here and I'm not that craftsy in R!
Finally, to create the 'user/intermittent_answers/n_length' columns I used the following loop, so I thought it'd be possible to reuse this for the calculations that I need now.
#loop through PARTS of testdata to create _length's for(i in names(testdata[34:76])) testdata[[paste(i, 'length', sep="_")]] <- str_length(testdata[[i]])
Then I tried something similar which I found here: FOR loop to calculate difference on dates in R
for (j in 2:length(testdata$`user/intermittant_answers/42_length`)) + testdata$lag[j] <- as.numeric(difftime(testdata$`user/intermittant_answers/42_length`[j], testdata$`user/intermittant_answers/42_length`[j-1], units=c("difference")), units = "days") Error in as.POSIXct.numeric(time1) : 'origin' must be supplied
I figured this was because I am not working with anything time related, but I don't know/don't know how to find another 'diff' related function that is not bound to matrixes like the one from matrixStats package.
I hope someone can push me in the right direction!
Thank you!!