How to transform column with lists of strings into a new column of unique identifiers

Apologies if a similar question has been asked before, I was unable to find one, possibly because of wording of the question.

Some current sample data looks like this, where the first column is a list of identifiers (genes) and the second column is a set of descriptors (gene ontology IDs):

Gene    Gene_Ontology_ID
Gene1   GO1, GO2, GO4, GO6
Gene2   GO2, GO3, GO4
Gene3   GO5, GO7

I am wondering if there is an efficient way to transform a large table formatted similarly so that the "Gene_Ontology_ID" column now serves as the indentifying column, and the "Gene" column is now a list of genes with that Gene_Ontology_ID, like this:

Gene_Ontology_ID    Gene
GO1                 Gene1
GO2                 Gene1,Gene2
GO3                 Gene2
GO4                 Gene1,Gene2
GO5                 Gene3
GO6                 Gene1
GO7                 Gene3

Is there a solution to this, preferably using Unix, Python, or R? Any help would be greatly appreciated, thank you.

4 answers

  • answered 2018-01-11 20:57 Jason

    library(dplyr)
    library(tidyr)
    out <- df %>% separate(Gene_Ontology_ID, into=paste("genes", 1:7, sep = "_"),sep =", ",fill="right") %>% 
      gather(key,Gene_Ontology_ID, -Gene,na.rm=TRUE) %>%
      arrange(Gene_Ontology_ID,Gene) %>%
      group_by(Gene_Ontology_ID) %>% 
      summarise(Gene= paste(Gene,collapse =", "))
    
    out
    # A tibble: 7 x 2
      Gene_Ontology_ID         Gene
                 <chr>        <chr>
    1              GO1        Gene1
    2              GO2 Gene1, Gene2
    3              GO3        Gene2
    4              GO4 Gene1, Gene2
    5              GO5        Gene3
    6              GO6        Gene1
    7              GO7        Gene3
    

  • answered 2018-01-11 21:02 tyluRp

    A similar option with strsplit and unnest:

    library(dplyr)
    library(tidyr)
    
    df %>% 
      mutate(Gene_Ontology_ID = strsplit(Gene_Ontology_ID, ", ")) %>%
      unnest(Gene_Ontology_ID) %>% 
      group_by(Gene_Ontology_ID) %>% 
      summarise(Gene = paste(Gene, collapse = ", "))
    
    # A tibble: 7 x 2
      Gene_Ontology_ID Gene        
      <chr>            <chr>       
    1 GO1              Gene1       
    2 GO2              Gene1, Gene2
    3 GO3              Gene2       
    4 GO4              Gene1, Gene2
    5 GO5              Gene3       
    6 GO6              Gene1       
    7 GO7              Gene3      
    

  • answered 2018-01-11 21:02 samadhi

    Using data.table and cSplit from splitstackshape:-

    library(data.table)
    library(splitstackshape)
    df <- data.frame(Gene = c("Gene1", "Gene2", "Gene3"), Gene_Ontology_ID = c("GO1, GO2, GO4, GO6", "GO2, GO3, GO4", "GO5, GO7"))
    df <- cSplit(df, 'Gene_Ontology_ID', ',', 'long', drop = FALSE)
    setDT(df)
    df[, Gene := as.character(Gene)]
    df[, Gene := paste0(Gene, collapse = ", "), by = Gene_Ontology_ID]
    setcolorder(df, c("Gene_Ontology_ID", "Gene"))
    df <- unique(df)
    

    You'll get:-

     Gene_Ontology_ID         Gene
    1:              GO1        Gene1
    2:              GO2 Gene1, Gene2
    3:              GO4 Gene1, Gene2
    4:              GO6        Gene1
    5:              GO3        Gene2
    6:              GO5        Gene3
    7:              GO7        Gene3
    

  • answered 2018-01-11 21:03 snoram

    Here is a solution using only base R (probably not the most efficient):

    # Obtain a vector of unique "gene ontology ids"
    all_genes_id <- paste0(df$Gene_Ontology_ID, collapse = ", ")
    all_genes_id <- unique(strsplit(all_genes_id, ", ")[[1]])
    
    # Initalize and fill vector of genes per each "gene ontology ids"
    genes_per_id <- vector(mode = "character", length(all_genes_id))
    for(i in 1:length(all_genes_id)) {
      rows_df <- grepl(all_genes_id[i], df$Gene_Ontology_ID)
      genes_per_id[i] <- paste0(df$Gene[rows_df], collapse = ",")
    }
    
    # New data frame
    df2 <- data.frame(Gene_Ontology_ID = all_genes_id,
                      Gene = genes_per_id)
    df2
    # Result
      Gene_Ontology_ID        Gene
    1              GO1       Gene1
    2              GO2 Gene1,Gene2
    3              GO4 Gene1,Gene2
    4              GO6       Gene1
    5              GO3       Gene2
    6              GO5       Gene3
    7              GO7       Gene3