Finding euclidean distance from multiple mean vectors

This is what I am trying to do - I was able to do steps 1 to 4. Need help with steps 5 onward

Basically for each data point I would like to find euclidean distance from all mean vectors based upon column y

  1. take data
  2. separate out non numerical columns
  3. find mean vectors by y column
  4. save means
  5. subtract each mean vector from each row based upon y value
  6. square each column
  7. add all columns
  8. join back to numerical dataset and then join non numerical columns
import pandas as pd

data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()

For each row of means, subtract that row from each row of df_numeric. then take square of each column in the output and then for each row add all columns. Then join this data back to df_numeric and df_non_numeric

--------------update1

added code as below. My questions have changed and updated questions are at the end.

def calculate_distance(row):
    return (np.sum(np.square(row-means.head(1)),1))

def calculate_distance2(row):
    return (np.sum(np.square(row-means.tail(1)),1))


df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)-means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)

print(df_numeric2)

final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]

could anyone confirm that these is a correct way to achieve the results? i am mainly concerned about the last two statements. Would the second last statement do a correct join? would the final statement assign the original class? i would like to confirm that python wont do the concat and class assignment in a random order and that python would maintain the order in which rows appear

final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]

2 answers

  • answered 2019-03-13 19:10 mortysporty

    I think this is what you want

    import pandas as pd
    import numpy as np
    data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
    df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float) 
    print (df)
    df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
    # Make df_non_numeric a copy and not a view
    df_non_numeric=df.select_dtypes(exclude='number').copy()
    
    # Subtract mean (calculated using the transform function which preserves the 
    # number of rows) for each class  to create distance to mean
    df_dist_to_mean =  df_numeric[['Age', 'weight']] - df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean')
    # Finally calculate the euclidean distance (hypotenuse)
    df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight'])
    df_non_numeric['class'] = df_numeric['class']
    # If you want a separate dataframe named 'final' with the end result
    df_final = df_non_numeric.copy()
    print(df_final)
    

    It is probably possible to write this even denser but this way you'll see whats going on.

  • answered 2019-03-13 19:57 Colton Neary

    I'm sure there is a better way to do this but I iterated through depending on the class and follow the exact steps.

    1. Assigned the 'class' as the index.
    2. Rotated so that the 'class' was in the columns.
    3. Performed that operation of means that corresponded with df_numeric
    4. Squared the values.
    5. Summed the rows.
    6. Concatenated the dataframes back together.

      data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
      df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
      #print (df)
      
      
      df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
      df_non_numeric=df.select_dtypes(exclude='number')
      
      means=df_numeric.groupby('class').mean().T
      
      
      import numpy as np
      # Changed index 
      df_numeric.index = df_numeric['class']
      df_numeric.drop('class' , axis = 1 , inplace = True)
      
      # Rotated the Numeric data sideways so the class was in the columns
      df_numeric = df_numeric.T
      
      #Iterated through the values in means and seen which df_Numeric values matched
      store = [] # Assigned an empty array
      for j in means:
          sto = df_numeric[j]
          if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type
              sto = sto.to_frame() # Need to convert ot dataframe type
          store.append(sto-j) # append the various values to the array
      
      
      
      values = np.array(store)**2 # Squaring the values
      
      # Summing the rows
      summed = []
      for i in values:
          summed.append((i.sum(axis = 1)))
      
      
      
      df_new = pd.concat(summed , axis = 1)
      df_new.T