The one-way ANOVA function I'm using keeps spitting out F values that don't make sense

I'm working on a project for college and it's kicking my ass.

I downloaded a data file from https://www.kaggle.com/datasets/majunbajun/himalayan-climbing-expeditions

I'm trying to use an ANOVA to see if there's a statistically significant difference in time taken to summit between the seasons.

The F value I'm getting back doesn't seem to make any sense. Any suggestions?

#import pandas
import pandas as pd

#import expeditions as csv file
exp = pd.read_csv('C:\\filepath\\expeditions.csv')

#extract only the data relating to everest
exp= exp[exp['peak_name'] == 'Everest']

#create a subset of the data only containing 
exp_peaks = exp[['peak_name', 'member_deaths', 'termination_reason', 'hired_staff_deaths', 'year', 'season', 'basecamp_date', 'highpoint_date']]

#extract successful attempts
exp_peaks = exp_peaks[(exp_peaks['termination_reason'] == 'Success (main peak)')]

#drop missing values from basecamp_date & highpoint_date
exp_peaks = exp_peaks.dropna(subset=['basecamp_date', 'highpoint_date'])

#convert basecamp date to datetime
exp_peaks['basecamp_date'] = pd.to_datetime(exp_peaks['basecamp_date'])
#convert basecamp date to datetime
exp_peaks['highpoint_date'] = pd.to_datetime(exp_peaks['highpoint_date'])

from datetime import datetime

exp_peaks['time_taken'] = exp_peaks['highpoint_date'] - exp_peaks['basecamp_date']

#convert seasons from strings to ints
exp_peaks['season'] = exp_peaks['season'].replace('Spring', 1)
exp_peaks['season'] = exp_peaks['season'].replace('Autumn', 3)
exp_peaks['season'] = exp_peaks['season'].replace('Winter', 4)
#remove summer and unknown
exp_peaks = exp_peaks[(exp_peaks['season'] != 'Summer')]
exp_peaks = exp_peaks[(exp_peaks['season'] != 'Unknown')]

#subset the data according to the season
exp_peaks_spring = exp_peaks[exp_peaks['season'] == 1]
exp_peaks_autumn = exp_peaks[exp_peaks['season'] == 3]
exp_peaks_winter = exp_peaks[exp_peaks['season'] == 4]

#calculate the average time taken in spring
exp_peaks_spring_duration = exp_peaks_spring['time_taken']
mean_exp_peaks_spring_duration = exp_peaks_spring_duration.mean()

#calculate the average time taken in autumn
exp_peaks_autumn_duration = exp_peaks_autumn['time_taken']
mean_exp_peaks_autumn_duration = exp_peaks_autumn_duration.mean()

#calculate the average time taken in winter
exp_peaks_winter_duration = exp_peaks_winter['time_taken']
mean_exp_peaks_winter_duration = exp_peaks_winter_duration.mean()

# Turn the season column into a categorical
exp_peaks['season'] = exp_peaks['season'].astype('category')
exp_peaks['season'].dtypes


from scipy.stats import f_oneway

# One-way ANOVA
f_value, p_value = f_oneway(exp_peaks['season'], exp_peaks['time_taken'])
print("F-score: " + str(f_value))
print("p value: " + str(p_value))

1 answer

  • answered 2022-05-04 11:22 Stuart

    It seems that f_oneway requires the different samples of continuous data to be arguments, rather than taking a categorical variable argument. You can achieve this using groupby.

    f_oneway(*(group for _, group in exp_peaks.groupby("season")["time_taken"]))
    

    Or equivalently, since you have already created series for each season:

    f_oneway(exp_peaks_spring_duration, exp_peaks_autumn_duration, exp_peaks_winter_duration)
    

    I would have thought there would be an easier way to perform an ANOVA in this common case but can't find it.

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum