Binary encoding in data processing

I want to do binary encoding of income column of a dataframe which has two categories "<=50k" and ">50k" as 0 and 1 respectively. How should I do that?

2 answers

  • answered 2018-11-08 06:25 jezrael

    Create boolean mask and convert to integers - Trues are 1s and Falses are 0s:

    df['binary'] = (df['col'] > 50000).astype(int)
    

    Performance:

    np.random.seed(423)
    
    df = pd.DataFrame({'col':np.random.randint(100000, size=1000)})
    
    
    In [30]: %timeit df['income']=df['col'].apply(lambda x: 1 if x>50000 else 0)
    762 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    In [31]: %timeit df['binary'] = (df['col'] > 50000).astype(int)
    357 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    In [43]: %timeit df["income"] = np.where(df["col"] <50000, 0, 1)
    375 µs ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

  • answered 2018-11-08 07:17 Abhinav Vajpeyi

    You can use "apply":

    df['income']=df['income'].apply(lambda x: 1 if x>50000 else 0)
    

    Edit 1:

    I think this would be much faster than my previous answer:

    df["income"] = np.where(df["col"] <50000, 0, 1)
    

    Performance:

    %timeit df["income"] = np.where(df["col"] <50000, 0, 1)
    1000 loops, best of 3: 256 µs per loop
    
    %timeit df['income']=df['col'].apply(lambda x: 1 if x>50000 else 0)
    1000 loops, best of 3: 477 µs per loop
    
    %timeit df['binary'] = (df['col'] > 50000).astype(int)
    1000 loops, best of 3: 275 µs per loop