Finding euclidean distance from multiple mean vectors
This is what I am trying to do  I was able to do steps 1 to 4. Need help with steps 5 onward
Basically for each data point I would like to find euclidean distance from all mean vectors based upon column y
 take data
 separate out non numerical columns
 find mean vectors by y column
 save means
 subtract each mean vector from each row based upon y value
 square each column
 add all columns
 join back to numerical dataset and then join non numerical columns
import pandas as pd
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()
For each row of means
, subtract that row from each row of df_numeric
. then take square of each column in the output and then for each row add all columns. Then join this data back to df_numeric
and df_non_numeric
update1
added code as below. My questions have changed and updated questions are at the end.
def calculate_distance(row):
return (np.sum(np.square(rowmeans.head(1)),1))
def calculate_distance2(row):
return (np.sum(np.square(rowmeans.tail(1)),1))
df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)
print(df_numeric2)
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
could anyone confirm that these is a correct way to achieve the results? i am mainly concerned about the last two statements. Would the second last statement do a correct join? would the final statement assign the original class
? i would like to confirm that python wont do the concat and class assignment in a random order and that python would maintain the order in which rows appear
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
2 answers

I think this is what you want
import pandas as pd import numpy as np data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]] df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float) print (df) df_numeric=df.select_dtypes(include='number')#, exclude=None)[source] # Make df_non_numeric a copy and not a view df_non_numeric=df.select_dtypes(exclude='number').copy() # Subtract mean (calculated using the transform function which preserves the # number of rows) for each class to create distance to mean df_dist_to_mean = df_numeric[['Age', 'weight']]  df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean') # Finally calculate the euclidean distance (hypotenuse) df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight']) df_non_numeric['class'] = df_numeric['class'] # If you want a separate dataframe named 'final' with the end result df_final = df_non_numeric.copy() print(df_final)
It is probably possible to write this even denser but this way you'll see whats going on.

I'm sure there is a better way to do this but I iterated through depending on the class and follow the exact steps.
 Assigned the 'class' as the index.
 Rotated so that the 'class' was in the columns.
 Performed that operation of means that corresponded with df_numeric
 Squared the values.
 Summed the rows.
Concatenated the dataframes back together.
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]] df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float) #print (df) df_numeric=df.select_dtypes(include='number')#, exclude=None)[source] df_non_numeric=df.select_dtypes(exclude='number') means=df_numeric.groupby('class').mean().T import numpy as np # Changed index df_numeric.index = df_numeric['class'] df_numeric.drop('class' , axis = 1 , inplace = True) # Rotated the Numeric data sideways so the class was in the columns df_numeric = df_numeric.T #Iterated through the values in means and seen which df_Numeric values matched store = [] # Assigned an empty array for j in means: sto = df_numeric[j] if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type sto = sto.to_frame() # Need to convert ot dataframe type store.append(stoj) # append the various values to the array values = np.array(store)**2 # Squaring the values # Summing the rows summed = [] for i in values: summed.append((i.sum(axis = 1))) df_new = pd.concat(summed , axis = 1) df_new.T