how to create a new column by grouping values in an existing column
How to create a new column with variables low if cyl <6 and "high" if cyl>6 and the rest as NAs?
input
head(mtcars) %>% select(cyl)
cyl
Mazda RX4 6
Mazda RX4 Wag 6
Datsun 710 4
Hornet 4 Drive 6
Hornet Sportabout 8
Valiant 6
output
cyl new_column
Mazda RX4 6 NA
Mazda RX4 Wag 6 NA
Datsun 710 4 low
Hornet 4 Drive 6 NA
Hornet Sportabout 8 high
Valiant 6 NA
See also questions close to this topic

How to make a function that loops over two lists
I have an event A that is triggered when the majority of coin tosses in a series of tosses comes up heads. I have an unfair coin and I'd like to see how the likelihood of A changes as the number of tosses change and the probability in each toss changes.
This is my function assuming 3 tosses
n < 3 #victory requires majority of tosses heads #tosses only occur in odd intervals k < seq(n/2+.5,n) victory < function(n,k,p){ for (i in p) { x < 0 for (i in k) { x < x + choose(n, k) * p^k * (1p)^(nk) } z < x } return(z) } p < seq(0,1,.1) victory(n,k,p)
My hope is the
victory()
function would1  find the probability of each of the outcomes where the majority of tosses are heads, given a particular value p
2  sum up those probabilities and add them to a vector z
3  go back and do the same thing given another probability pI tested this with
n < 3, k < c(2,3)
andp < (.5,.75)
and the output was 0.75000, 0.84375. I know that the output should've been 0.625, 0.0984375. 
Exponentiation of Log Transformed Values in Mixed Effects Model
I have run a linear mixedeffects model in R using the nlme package in which my response variable (Proximal_Lead_Bowing) was transformed to log10 scale (Log_Bowing) due to a non normal distribution of values. The estimated differences in Log_Bowing between different Deep Brain Stimulation Electrodes (DBS_Electrode) as estimated by the model using the "glht" function for multiple comparisons of means (Tukey contrasts) are as follows: (View screenshot for full glht() output: https://imgur.com/WVJ9KM6)
Linear Hypothesis: Medtronic 3389  Boston Scientific Versice == 0 Estimate: 0.5766* St. Jude Medical Infinity  Boston Scientific Versice == 0 Estimate: 0.2208 St. Jude Medical Infinity  Medtronic 3389 == 0 Estimate:0.3558* *Denotes significance
Exponentiating these values (10^Abs(Estimate)) provide me with the following estimates for true differences in Proximal_Lead_Bowing as estimated by our mixedeffects model:
Linear Hypothesis: Medtronic 3389  Boston Scientific Versice == 0 3.77 (in millimeters) St. Jude Medical Infinity  Boston Scientific Versice == 0 1.66 St. Jude Medical Infinity  Medtronic 3389 == 0 2.27
These values do not make sense considering that the the average Proximal_Lead_Bowing ± 95% CI for each DBS_Electrode in the sample is as follows:
Boston Scientific Versice: 2.10 ± 0.67 (in millimeters) Medtronic 3389: 2.95 ± 0.58 St. Jude Medical Infinity: 2.00 ± 0.35
Thus I would expect true differences in Proximal_Lead_Bowing as estimated by our linear mixed model to be estimated as approximately 1.0 mm between Medtronic 3389 and the other DBS_Electrode models but instead the exponentiated values I have calculated don't seem to make sense. Am I missing something in the process of exponentiation of log10 values and/or use of the "glht" function for multiple comparisons of means? Any feedback would be appreciated.

What kind of Statistic Method for enrichment or overrepresent should I used for a rank ordered vector with Binary status
I have a gene expression data from 1065 different cell lines, let's say "BRAF" gene. BRAF gene expression levels are ordered. Most TP53 mutated cell lines are high BRAF expression (see the figure below). So what kind of statistical method should I use to test the enrichment or overrepresent for TP53 status (WT vs Mutant) on BRAF expression?

How to use seq() to create column of date/times with increments of milliseconds (deciseconds)
My dataframe (df) looks like this:
ID CH_1 CH_2 CH_3 CH_4 date_time 1 10096 11940 9340 9972 20180724 10:45:01.1 2 10088 11964 9348 9960 <NA> 3 10084 11940 9332 9956 <NA> 4 10088 11956 9340 9960 <NA>
The last column, date_time is coded as POSIXct format. What I need to do is populate the rest of my dataframe column (it is quite large) with increasing deciseconds, or 100 milliseconds, so that the next column looks like this, and so on...
ID CH_1 CH_2 CH_3 CH_4 date_time 1 10096 11940 9340 9972 20180724 10:45:01.1 2 10088 11964 9348 9960 20180724 10:45:01.2 3 10084 11940 9332 9956 20180724 10:45:01.3 4 10088 11956 9340 9960 20180724 10:45:01.4
I have tried using the following,
startDate < df[["date_time"]][1] datasetname2$date_time = as.POSIXct(startDate) + seq.POSIXt(datasetname2[6,1], units="seconds", by=.1)
but it returns an error (see below)
Error in seq.POSIXt(datasetname2[6, 1], units = "seconds", by = 0.1) : 'from' must be a "POSIXt" object

How to split columns with 3 data attributes inname into several columns, then collapse across different levels of aggregation?
I have a large dataframe that takes the below form, where each column labels year, commodity, and unit. Each observation corresponds to a mine, and each value is amount produced.
library(tibble) rdf < tribble( ~`1997_Silver_oz`, ~`1998_Diamonds_ct`, ~`1999_Coal_lbs`, ~`1999_Copper_tonnes`, 150000, 20000, NA_integer_, NA_integer_, NA_integer_, 50000, NA_integer_, 1, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 40000, 205000, NA_integer_, NA_integer_ )
I want to collapse these data down to two levels of aggregation, to see where there's nonzero production for each Year and Commodity/Year.
What is the intermediate step I need to take to split my existing columns into multiple, like the below?
rdf_gathered < tribble( ~year, ~commodity, ~unit, ~amount, 1997, 'Silver', 'oz', 150000, 1997, 'Silver', 'oz', NA_integer_, 1997, 'Silver', 'oz', NA_integer_, 1997, 'Silver', 'oz', 40000, 1998, 'Diamonds', 'ct', 20000, 1998, 'Diamonds', 'ct', 50000, 1998, 'Diamonds', 'ct', NA_integer_, 1998, 'Diamonds', 'ct', 205000, 1999, 'Coal', 'lbs', NA_integer_, 1999, 'Coal', 'lbs', NA_integer_, 1999, 'Coal', 'lbs', NA_integer_, 1999, 'Coal', 'lbs', NA_integer_, 1999, 'Copper', 'tonnes', NA_integer_, 1999, 'Copper', 'tonnes', 1, 1999, 'Copper', 'tonnes', NA_integer_, 1999, 'Copper', 'tonnes', NA_integer_ )
And after that step, what step should I take to collapse this dataframe into one that measures nonzero production, like the below? [NA > 0, else 1]
# Collapse rdf_collapsed_v1 < tribble( ~`1997_Silver`, ~`1998_Diamonds`, ~`1999_Coal`, ~`1999_Copper`, 1, 1, 0, 1 ) rdf_collapsed_v2 < tribble( ~`1997`, ~`1998`, ~`1999`, 1, 1, 1 )
I use/mostly prefer tidyverse functions but interested in any elegant base solution as well.

structuring binary data for sankey plot
I am having trouble figuring out how to make a sankey plot for data where there are multiple opportunities of success (1) or failure (0). You can generate my sample with the following code:
# example library(networkD3) library(tidyverse) library(tidyr) set.seed(900) n=1000 example.data<data.frame("A" = rep(1,n), "B" = sample(c(0,1),n,replace = T), "C" = rep(NA,n), "D" = rep(NA,n), "E" = rep(NA,n), "F" = rep(NA,n), "G" = rep(NA,n)) for (i in 1:n){ example.data$C[i]< ifelse(example.data$B[i]==1, sample(c(0,1),1,prob = c(0.3,0.7),replace = F), sample(c(0,1),1,prob = c(0.55,0.45),replace = F)) example.data$D[i]<ifelse(example.data$C[i]==1, sample(c(0,1),1,prob = c(0.95,0.05),replace = F), sample(c(0,1),1,prob = c(0.65,0.35),replace = F)) example.data$E[i]<ifelse(example.data$C[i]==0 & example.data$D[i]==0, sample(c(0,1),1,prob = c(.9,.1),replace = F), ifelse(example.data$C[i]==0 & example.data$D[i]==1, sample(c(0,1),1,prob = c(.3,.7),replace = F), ifelse(example.data$C[i]==1 & example.data$D[i]==0, sample(c(0,1),1,prob = c(.9,.1),replace = F), sample(c(0,1),1,prob = c(.1,.9),replace = F)))) example.data$F[i]<ifelse(example.data$E==1, sample(c(1,0),1,prob=c(.85,.15),replace = F), sample(c(1,0),1,prob = c(.01,.99),replace = F)) example.data$G[i]<sample(c(1,0),1,prob = c(.78,.22),replace = F) } example.data.1<example.data%>% gather()%>% mutate(ORDER = c(rep(0,n),rep(1,n),rep(2,n),rep(3,n),rep(4,n),rep(5,n),rep(6,n)))%>% dplyr::select("Event" = key, "Success" = value, ORDER)%>% group_by(ORDER)%>% summarise("YES" = sum(Success==1), "NO" = sum(Success==0))
The tricky part for me is how I can generate the links data without having to manually specify the source targets and values.
I used the sankey example from this website, and proceeded to muscle my own example data in the least elegant way possible:
links<data.frame("source" = sort(rep(seq(0,10,1),2)), "target" = c(1,2,3,4,3,4,5,6,5,6,7,8,7,8,9,10,9,10,11,12,11,12), "value" = c(sum(example.data$A==1 &example.data$B==1), #1 sum(example.data$A==1 & example.data$B==0),#2 sum(example.data$B==1 & example.data$C==1),#3 sum(example.data$B==1 & example.data$C==0),#4 sum(example.data$B==0 & example.data$C==1),#5 sum(example.data$B==0 & example.data$C==0),#6 sum(example.data$C==1 & example.data$D==1),#7 sum(example.data$C==1 & example.data$D==0),#8 sum(example.data$C==0 & example.data$D==1),#9 sum(example.data$C==0 & example.data$D==0),#10 sum(example.data$D==1 & example.data$E==1),#11 sum(example.data$D==1 & example.data$E==0),#12 sum(example.data$D==0 & example.data$E==1),#13 sum(example.data$D==0 & example.data$E==0),#14 sum(example.data$E==1 & example.data$F==1),#15 sum(example.data$E==1 & example.data$F==0),#16 sum(example.data$E==0 & example.data$F==1),#17 sum(example.data$E==0 & example.data$F==0),#18 sum(example.data$F==1 & example.data$G==1),#19 sum(example.data$F==1 & example.data$G==0),#20 sum(example.data$F==0 & example.data$G==1),#21 sum(example.data$F==0 & example.data$G==0)))#22 nodes<data.frame("name" = names(example.data)) example.list<list(nodes,links) names(example.list)<c("nodes","links")
My problem is this. 1) trying to use this data in the sankeyNetwork function does not actually produce a plot at all, and 2) Obviously this method will be prone to a lot of error especially if there are more than 2 targets per node.
I found an example on stack where the person used the match call in a dplyr::mutate function that looked promising for what I'm trying to accomplish, but the data had a slightly different structure and I did't really know how to get the match call to work with my own data.
The output I'm going for is a sankey plot that shows the number of observations moving between each of the events/outcomes [A:F]. So imagine each of the columns represent an event either successful or not successful. The sakey plot would illustrate a summary of total successes and failures of each event. So all 1000 observations starting at A with 493 going to a node of B = 1, and the remaining 507 going to the node indicating B = 0. Of the 493 in B = 1, 345 go to the node indicating C = 1, and 148 go to the node C = 0. Of the 507 in B = 0 263 go to C = 1 and 244 go to C = 0, and so on for the rest of the event A through F. I hope I've made this clear enough. Any help on this would be greatly appreciated.