tvalue and pvalue seem wrong?
I have a dataframe. Downloaded from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. My dataset is from 2018 and the month of January. I keep these columns: trip_distance, fare_amount, pickup_time and dropoff_time.
The goal is to calculate 'price_per_mile'. Then, the mean of these values for each borough and then, applying the ttest to see if the differences among each pair of them are significant. The problem is that at the end I get tvalues=0 and pvalues=1 for all the pairs (just one exception). I don't understand what are the things I need to recheck or change? You can reach 'taxi_zone_lookup.csv' from this address too: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
this is my code:
df=pd.read_csv('yellow_tripdata_201801.csv',
usecols=['tpep_pickup_datetime', 'tpep_dropoff_datetime','trip_distance','PULocationID','fare_amount'])
#Data cleaning
df.drop(df[df['trip_distance']>3].index, inplace=True)
df.drop(df[df['trip_distance']<0.5].index, inplace=True)
df.drop(df[df['fare_amount']>10].index, inplace=True)
df.drop(df[df['fare_amount']<1].index, inplace=True)
df['trip_distance']=df['trip_distance'].astype(np.float16)
df['PULocationID']=df['PULocationID'].astype(np.uint16)
df['fare_amount']=df['fare_amount'].astype(np.float16)
df['price_per_mile'] = df['fare_amount']/df['trip_distance']
borough = pd.read_csv(r'taxi_zone_lookup.csv', usecols = ['LocationID', 'Borough'])
result = pd.merge(df,
borough,
left_on='PULocationID',
right_on='LocationID',
how='inner'
)
result.drop(result[(result.Borough == 'EWR')  (result.Borough == 'Unknown')].index, inplace=True)
df['price_per_mile'].describe()
#here I get mean=NaN???
#ttest
#Creating a dataframe with twolevel of indexes
boroughs = ['Bronx', 'Brooklyn', 'Manhattan', 'Staten Island', 'Queens']
iterables = [['Bronx', 'Brooklyn', 'Manhattan', 'Staten Island', 'Queens'], ['tvalue', 'pvalue', "H0 hypothesis"]]
my_index = pd.MultiIndex.from_product(iterables)
dt = pd.DataFrame(index=my_index, columns=boroughs)
for i in boroughs:
a = result.loc[result.Borough==i]["price_per_mile"]
for j in boroughs:
b = result.loc[result.Borough==j]["price_per_mile"]
t2, p2 = stats.ttest_ind(a,b)
dt.loc[(i,"tvalue"),j]=t2
dt.loc[(i,"pvalue"),j]=p2
if(p2>0.05):
dt.loc[(i,"H0 hypothesis"),j]='Fail to Reject H0'
else:
dt.loc[(i,"H0 hypothesis"),j]='Reject H0'
