counting rows between two specific rows with a condition
df < structure(list(inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"), ass = c("x", "x", "x", "x", "x"), datetime = c("20100101",
"20100102", "20100103", "20100108", "20100119"), portfolio = c(10,
0, 5, 2, 0)), operation = c(10, 10, 5, 3, 2), class = "data.frame", row.names = c(NA, 5L))
So I have 4000
investors with 6000
different assets, for each investor I have his trading operations in two different variables: operation tells me if he is buying/selling; portfolio tells me how much he has in the portfolio.
What I want to do is computing the number of days a position stays open in the portfolio, so I though about computing the difference between the day in which the portfolio goes back to zero and the day in which the portfolio went positive (it is not possible to get negative portfolio).
so in the dataset above I would count row2  row1 ==> 20100102  20100101
and row 5  row 3 ==> 20100119  20100103
and so on...
I want to do this computation for all the investor & asset I have in my dataset for all the rows in which I find that portfolio > 0
.
So my dataset will have a further column called duration
which would be equal, in this case to c(0,1,0,5,16)
(so of course i also had to compute raw1  raw1
and raw3  raw3
)
Hence my problem is to restart the count everytime portfolio
goes back to zero.
2 answers

library(dplyr) df %>% mutate(datetime = as.Date(datetime, "%Y%m%d")) %>% group_by(investor, asset) %>% arrange(datetime) %>% mutate(grp.pos = cumsum(lag(portfolio, default = 1) == 0)) %>% group_by(investor, asset, grp.pos) %>% mutate(`Open (#days)` = datetime  datetime[1]) #> # A tibble: 5 x 6 #> # Groups: investor, asset, grp.pos [2] #> investor asset datetime portfolio grp.pos `Open (#days)` #> <chr> <chr> <date> <dbl> <int> <drtn> #> 1 INV_1 x 20100101 10 0 0 days #> 2 INV_1 x 20100102 0 0 1 days #> 3 INV_1 x 20100103 5 1 0 days #> 4 INV_1 x 20100108 2 1 5 days #> 5 INV_1 x 20100119 0 1 16 days
Data:
df < structure(list(investor = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"), asset = c("x", "x", "x", "x", "x"), datetime = c("20100101", "20100102", "20100103", "20100108", "20100119"), portfolio = c(10, 0, 5, 2, 0)), operation = c(10, 10, 5, 3, 2), class = "data.frame", row.names = c(NA, 5L))

Here is a way how we could do it, that is expandable if necessary for
ass
First we group by
inv
to use for the original dataset. Then transformdatetime
to date format to do calculations easily (here we useymd()
function).The next step could be done in different ways:
Main idea is to group the column
portfolio
indicated by the last row of the group that is 0. For this we arrangedatetime
in descending form to easily apply the grouping id withcumsum == 0
.After rearranging
datetime
we can calculate the last from the first as intended:library(dplyr) library(lubridate) df %>% group_by(inv) %>% mutate(datetime = ymd(datetime)) %>% arrange(desc(datetime)) %>% group_by(position_Group = cumsum(portfolio==0)) %>% arrange(datetime) %>% mutate(position_open = last(datetime)first(datetime)) %>% ungroup()
inv ass datetime portfolio operation id_Group position_open <chr> <chr> <date> <dbl> <dbl> <int> <drtn> 1 INV_1 x 20100101 10 10 2 1 days 2 INV_1 x 20100102 0 10 2 1 days 3 INV_1 x 20100103 5 5 1 16 days 4 INV_1 x 20100108 2 3 1 16 days 5 INV_1 x 20100119 0 2 1 16 days
do you know?
how many words do you know