Extract specific string from values in a column and excluding values that match specific string

I have a column with values like this

01_PLAGL1
02_PLAGL1
03_GRB10
04_GRB10
05_H19
06_H19
07_H19
control_11
control_12
# Actually it is longer that this but same pattern 

And a need a list with this

PLAGL1, GRB10,H19

I don't need control

How can I do this?

3 answers

  • answered 2022-05-04 09:57 Tim Biegeleisen

    Use str.extract:

    df["output"] = df["col"].str.extract(r'([^_]+)$')
    

    Or maybe use str.replace:

    df["output"] = df["col"].str.replace(r'.*_', '')
    

  • answered 2022-05-04 10:10 Zero

    This can remove all the rows with control in the dataframe.

    new_df = pd.DataFrame(df["Column"].str.split("_").tolist())
    invalid_index = new_df[new_df[0] == "control"].index
    df.drop(index = invalid_index, inplace = True)
    
    Column
    0 01_PLAGL1
    1 02_PLAG1
    2 03_GRB10
    3 04_GRB10
    4 05_H19
    5 06_H19
    6 07_H19

    Would appreciate some suggestions on improving this code!

  • answered 2022-05-04 10:10 mozway

    Building on @Tim's answer, do you want something like?

    out = (df
       .loc[~df['col'].str.startswith('control'), 'col']
       .str.extract(r'([^_]+)$', expand=False)
       .drop_duplicates().to_list()
    )
    

    output: ['PLAGL1', 'GRB10', 'H19']

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum