tvalue and pvalue seem wrong?
I have a dataframe. Downloaded from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. My dataset is from 2018 and the month of January. I keep these columns: trip_distance, fare_amount, pickup_time and dropoff_time.
The goal is to calculate 'price_per_mile'. Then, the mean of these values for each borough and then, applying the ttest to see if the differences among each pair of them are significant. The problem is that at the end I get tvalues=0 and pvalues=1 for all the pairs (just one exception). I don't understand what are the things I need to recheck or change? You can reach 'taxi_zone_lookup.csv' from this address too: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
this is my code:
df=pd.read_csv('yellow_tripdata_201801.csv',
usecols=['tpep_pickup_datetime', 'tpep_dropoff_datetime','trip_distance','PULocationID','fare_amount'])
#Data cleaning
df.drop(df[df['trip_distance']>3].index, inplace=True)
df.drop(df[df['trip_distance']<0.5].index, inplace=True)
df.drop(df[df['fare_amount']>10].index, inplace=True)
df.drop(df[df['fare_amount']<1].index, inplace=True)
df['trip_distance']=df['trip_distance'].astype(np.float16)
df['PULocationID']=df['PULocationID'].astype(np.uint16)
df['fare_amount']=df['fare_amount'].astype(np.float16)
df['price_per_mile'] = df['fare_amount']/df['trip_distance']
borough = pd.read_csv(r'taxi_zone_lookup.csv', usecols = ['LocationID', 'Borough'])
result = pd.merge(df,
borough,
left_on='PULocationID',
right_on='LocationID',
how='inner'
)
result.drop(result[(result.Borough == 'EWR')  (result.Borough == 'Unknown')].index, inplace=True)
df['price_per_mile'].describe()
#here I get mean=NaN???
#ttest
#Creating a dataframe with twolevel of indexes
boroughs = ['Bronx', 'Brooklyn', 'Manhattan', 'Staten Island', 'Queens']
iterables = [['Bronx', 'Brooklyn', 'Manhattan', 'Staten Island', 'Queens'], ['tvalue', 'pvalue', "H0 hypothesis"]]
my_index = pd.MultiIndex.from_product(iterables)
dt = pd.DataFrame(index=my_index, columns=boroughs)
for i in boroughs:
a = result.loc[result.Borough==i]["price_per_mile"]
for j in boroughs:
b = result.loc[result.Borough==j]["price_per_mile"]
t2, p2 = stats.ttest_ind(a,b)
dt.loc[(i,"tvalue"),j]=t2
dt.loc[(i,"pvalue"),j]=p2
if(p2>0.05):
dt.loc[(i,"H0 hypothesis"),j]='Fail to Reject H0'
else:
dt.loc[(i,"H0 hypothesis"),j]='Reject H0'
See also questions close to this topic

how to display contents of text file one line at a time via timer using python on windows?
this is the code.
def wndProc(hWnd, message, wParam, lParam): if message == win32con.WM_PAINT: hdc, paintStruct = win32gui.BeginPaint(hWnd) dpiScale = win32ui.GetDeviceCaps(hdc, win32con.LOGPIXELSX) / 60.0 fontSize = 36 # http://msdn.microsoft.com/enus/library/windows/desktop/dd145037(v=vs.85).aspx lf = win32gui.LOGFONT() lf.lfFaceName = "Times New Roman" lf.lfHeight = int(round(dpiScale * fontSize)) #lf.lfWeight = 150 # Use nonantialiased to remove the white edges around the text. # lf.lfQuality = win32con.NONANTIALIASED_QUALITY hf = win32gui.CreateFontIndirect(lf) win32gui.SelectObject(hdc, hf) rect = win32gui.GetClientRect(hWnd) # http://msdn.microsoft.com/enus/library/windows/desktop/dd162498(v=vs.85).aspx win32gui.DrawText( hdc, **'Glory be to the Father, and to the son and to the Holy Spirit.',** 1, rect, win32con.DT_CENTER  win32con.DT_NOCLIP  win32con.DT_VCENTER ) win32gui.EndPaint(hWnd, paintStruct) return 0
.where it says the "glory be to the father.." prayer I would like that string to actually display a few different prayers on a timer. what I mean is I want to save short prayers to a text file and have the line where it says "glory be.." to change to a new prayer every 60 seconds cycling through a few prayers such as the serenity prayer etc.

How to plot the frequency of my data per day in an histogram?
I want to plot the number occurences of my data per day. y represent the id of my data. x represent the timestamp which I convert to time and day. But I can't make the correct plot. import matplotlib.pyplot as plt plt.style.use('ggplot') import time
y=['5914cce8fad645d1bec2e59e62823617', '1c2067e051734a1d8a75b18267ee4598', 'db6830fffa9c4aa5b71ef6da9333f357', '672cc9d5360e4451bb7c03e3d0bd8f0d', 'fb0f8122fffc47fea87ab2b749df173b', '558e96ca022240c7acc0e444f7663f53', 'c3f86fd5eac348d3a44cb325f30b6139', '21dd849f895f4cf5a16845a4c1a9fbf9', 'e3b4cd56e291467193b6d2226ee82ae7', '01346c48a8c443d1ac021efa33ca0f4e', '23b78b0f85be4ca799f41a5add76c12e', 'b1c036c00c2b4170a1708fd0add0dec2', '74737546e9c34126bcb24d34503421ca', '342991f5ec874c9d83eb9908f3e221aa', '4fdcd83aeb684e26b79b753c5e022a4e', 'b7fbeca9941643c49e909e71acc1eaba', '27c9d358a3ef4c69ba89eac16d8d3bdb', 'ef982c4ba11548a1aef12f672d7f1f00', 'efedede29bb44c5298b18b03070df3fd', 'eb03ae1b4cde409c8d342a16a8be30d2'] x=['1548143296750', '1548183033872', '1548346185194', '1548443373507', '1548446119319', '1548446239441', '1548446068267', '1548445962159', '1548446011209', '1548446259465', '1548446180380', '1548239985290', '1548240060367', '1548240045347', '1547627568993', '1548755333313', '1548673604016','1548673443843', '1548673503914', '1548673563975'] date=[] for i in x: print(i) print() i=i[:10] print(i) readable = time.ctime(int(i)) readable=readable[:10] date.append(readable) print(date) plt.hist(date,y) plt.show()

mysql.connector.errors.ProgrammingError: Error in SQL Syntax
I'm using the Python MySQL connector to add data to a table by updating the row. A user enters a serial number, and then the row with the serial number is added. I keep getting a SQL syntax error and I can't figure out what it is.
query = ("UPDATE `items` SET salesInfo = %s, shippingDate = %s, warrantyExpiration = %s, item = %s, WHERE serialNum = %s") cursor.execute(query, (info, shipDate, warranty, name, sn, )) conn.commit()
Error:
mysql.connector.errors.ProgrammingError: 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'WHERE serialNum = '1B0000021A974726'' at line 1
"1B0000021A974726" is a serial number inputted by the user and it is already present in the table.

How to merge pandas dataframe into existing reportlab table?
example_df = [[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5]]
I want to integrate example_df pandas data frame into an existing Reportlab table  where the number of rows is changing (could be 3 as shown in the example, or it could be 20):
rlab_table(['Mean','Max','Min','TestA','TestB'], ['','','','',''], ['','','','',''], ['','','','','']])
I have tried:
np.array(example_df).tolist()
but I get this error (AttributeError: 'int' object has no attribute 'wrapOn')
I am able to manually add each row into the report lab table by doing:
rlab_table(['Mean','Max','Min','TestA','TestB'], np.array(example_df).tolist()[0], np.array(example_df).tolist()[1], np.array(example_df).tolist()[2]])
However, the issue is that the number of rows in the dataframe is constantly changing, so I am seeking a solution similar to:
rlab_table(['Mean','Max','Min','TestA','TestB'], np.array(example_df).tolist()[0:X])] #Where X is the number of rows in the dataframe
 Pandas python Aggregation and Grouping  I want to show the sum on top of each different type

bins  Categorize column values using bins for ages
I have a .CSV file a snippet of which looks like this:
ID,SN, Age,Gender,Item ID,Item Name, Price 0,Lisim78, 20, Male, 108, Extraction Quickblade, 3.53 1,Lisovynya38, 40, Male, 143, Frenzied Scimitar, 1.56 2,Ithergue48, 24, Male, 92, Final Critic, 4.88 3,Chamassasya86, 24, Male, 100, Blindscythe, 3.27 4,Iskosia90, 23, Male, 131, Fury, 1.44 5,Yalae81, 22, Male, 81, Dreamkiss, 3.61 6,Itheria73, 36, Male, 169, Interrogator, 2.18 7,Iskjaskst81, 20, Male, 162, Abyssal Shard, 2.67 8,Undjask33, 22, Male, 21, Souleater, 1.1 9,Chanosian48, 35, Other, 136, Ghastly Adamantite, 3.58 10,Inguron55, 23, Male, 95, Singed Onyx Warscythe, 4.74
I need to establish bins for the 'Age' column which I have done like so:
bins = [0, 10, 15, 20, 25, 30, 35, 40, 45] names = ['<10', '1014', '1519', '2024', '2529', '3034', '3539', '40+'] df_bins = pd.cut(df['Age'], bins, labels=names)
How do I use the bins to categorize other columns like column 'SN'? I wanna be able to get a count of all players in 'SN' column who are <10, 1014, 1519 years... and so on.
Any help is greatly appreciated!

Recommendations for posthoc transformations after normalisations before generating PLSDA and VIP
Can anyone here recommend any posthoc normalisation steps following deseq2 normalisation before generating PLSDA and VIP plots?
I presume it's not appropriate to just perform this code and use it to generate a PLSDA with these counts?
dds < estimateSizeFactors(dds) Normalisedcounts<counts(dds, normalized=TRUE) write.csv(Normalisedcounts, file="27_Norm.csv")
Apologies if this seems like an uneducated question but I'm new to both 'big data' statistics and the deseq2 package.

Is this correctly normalised?
Will be created with a front end access also, just wondering if everything in the requirements can be accessed with this normalised er diagram? Any help is appreciated, thanks.

SQL  Normalizing a table containing multiple choices
I trying to create a database that includes a table, which will have the answers to a multiple choice quiz.. My problem is that, is it normal to create a column for every question?
I mean like if I have 100 questions do I create a column for every one of them?
What is the best approach for this?
Thank you

Optimizing data cleaning with multiple conditions
I have a big dataframe with about 25000 reviews and I am trying to clean data. The cleaning process is taking a long time and I am trying to optimize it. I have multiple statements checking for different conditions but that just means it goes through the dataframe multiple times which is probably causing it to be so slow. Here is my functions for cleaning data
data = data.str.replace('#EOF', '') data = data.str.replace('<br />', '') data = data.apply(lambda x: ' '.join([word for word in x.split() if word not in stop])) data = data.apply(lambda x: x.lower()) data = data.str.replace('[^\w\w]', ' ') data = data.apply(sentence_stem) data = data.apply(lambda x: ' '.join([word for word in x.split() if len(word) > 1]))
I want to reduce these statements as much as I can but not sure how to since there are multiple conditions. I'm new to machine learning and python so somewhat of a messy code

Pandas Filtering  Forcing Values to Zero
I have a Pandas DataFrame that looks like this:
Date Channel Sessions 0 1/1/2018 Branded Paid Search 1057 1 1/1/2018 Direct Traffic 4039 2 1/1/2018 Display 474 3 1/1/2018 Email 801 4 1/1/2018 Generic Paid Search 195 5 1/1/2018 Organic Search 6617 6 1/1/2018 Referral 563 7 1/1/2018 Social 7752 8 1/2/2018 Branded Paid Search 2172 9 1/2/2018 Direct Traffic 10444 10 1/2/2018 Display 613 11 1/2/2018 Email 1674 12 1/2/2018 Generic Paid Search 291 13 1/2/2018 Organic Search 14752 14 1/2/2018 Referral 1412 15 1/2/2018 Social 7858 16 1/3/2018 Branded Paid Search 2150 17 1/3/2018 Direct Traffic 9883 18 1/3/2018 Display 1201 19 1/3/2018 Email 2575 20 1/3/2018 Generic Paid Search 284 21 1/3/2018 Organic Search 15424 22 1/3/2018 Referral 2122 23 1/3/2018 Social 8513
The actual df is much bigger.
What I would like to is force values between a certain date range for a certain channel to zero. IE: everything between the 01/03/2018 and 11/03/2018 that is Branded Paid Search to zero [0]
The following works if i group the date and channel [and have the date as the index]:
sessions_df.loc[startdate:enddate, 'Branded Paid Search'] = 0
However, for the purpose of what I'm trying to achieve, i need the format to stay as above. I have tried the following:
df.loc[(df['Date'] >= startdate) & (df['Date'] <= enddate) & (df['Channel'] == 'Branded Paid Search')]['Sessions'] = 0
However I get the following error:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
I have tried the following [which works [partly]] but it forces the whole row to zero:
df.loc[(df['Date'] >= startdate) & (df['Date'] <= enddate) & (df['Channel'] == 'Branded Paid Search')] = 0
Any ideas on how I can force these values to zero?

How to create additional columns in a pandas dataframe inside a for loop
I am working with pandas and would like to add columns to my dataframe from a list. Ideally I would like to iterate through my list in a for loop creating a single column in each pass.
Example:
import pandas as pd d = { 'name':['Ken','Bobby'], 'age':[5,6], 'score':[1,2]} df = pd.DataFrame(d,columns=['name','age','score']) new_columns = ['col1', 'col2']
Output:
name age score Ken 5 1 Bobby 6 2
Desired output:
name age score col1 col2 Ken 5 1 1 1 Bobby 6 2 2 2
Corrected solution:
for i in new_columns: df[i] = pd.Series([1,2])
Edit:
I have corrected the code to fix a typo however there is a great additional solution that does not use for loops which I intend to use in the future.

Parametric or Nonparametric group test for 5 different groups
Problem Statement  Statistically prove that 5 groups are same or different
I am working on a problem with dataset size ~600,000.
There are 5 groups say [A,B,C,D,E] and corresponding salaries with around ~100k observations per group.
df['Salary']
is slightly right skewed. I tried ANOVA and Kruskal test.ANOVA Results
If I use all data  The p value indicates that groups are statistically different (p
If I use 10K random samples within each group p value increases to ~0.002333
If I use 1000 random samples within each group p value exceed 0.05 and is of the order of ~0.5
I am not sure how to evaluate these results? What should be the sample size to be considered and what other methods shall I consider
Mean and SD of 5 groups are below (when I consider 100,000 random sample for each group:
Group 1  (12.134831460674159, 5.1823701530849995)
Group 2  (11.64860907759883, 5.092876703946831)
Group 3  (11.660195118395315, 4.952100116921575)
Group 4  (12.052747507535358, 5.091383288751849)
Group 5  (11.468062169943916, 4.996349965883181)
KRUSKAL RESULTS
When sample size = 100
KruskalResult(statistic=34.20564125753886, pvalue=6.762162830091762e07)
When sample size 10,000
KruskalResult(statistic=179.39353155924363, pvalue=1.0064249109632168e37)

Formatting exam results to perform a ttest in R
Question Overview: I have a dataset containing the results to a 15 question preinstructional and postinstructional exam. I am looking to run a ttest on the results to compare the overall means but am having difficulty formatting the dataset properly. An example portion of the Dataset is given below:
1Pre 1Post 2Pre 2Post 3Pre 3Post 4Pre 4Post Correct B B A A B B C C 1 B B C D C B C C 2 C B B D C B C A 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 4 B B B A B B C C 5 B B B A B B C C 6 C B D A A D C B 7 C C D D E E C C 8 C A B B A A <NA> <NA>
Objective: I would like to match the "Correct" value to the values in the rows below for the test takers, such that a value of 1 is correct, and a value of 0 is incorrect. I have accomplished this using the following code:
for(j in 1:ncol(qDat)){ for(i in 1:nrow(qDat)){ if(qDat[i,j] == correctAns[1]){ qDat[i,j]=1 }else{ qDat[i,j]=0 } } }
I would then like to run a ttest comparing the pre and post means in addition to comparing the difference between the pre and post scores from each question, however, I need to omit any data points with NA. Currently, my method does not work with any NA values and thus replaces them with zero. Is there any method of running these tests and simply omitting NA values? Thank you!
The Desired Output:
1Pre 1Post 2Pre 2Post 3Pre 3Post Correct B B A A B B 1 1 1 0 0 0 1 2 0 1 0 0 0 1 3 <NA> <NA> <NA> <NA> <NA> <NA> 4 1 1 0 0 1 1 5 1 1 0 0 1 1 6 0 1 0 1 0 0 7 0 0 0 0 0 0 8 0 0 0 0 0 0

TTest multiple columns and one group
I have a question regarding multiple ttests for one group variable.
I have a gene expression data frame, just for an example assume sth like this:
df < data.frame(a=runif(100), b=runif(100), c=runif(100)+0.5, d=runif(100)+0.5) d < melt(df)
Then I have a data frame with a grouping variable. Like this:
dg < data.frame(c(rep("A",50), rep("B",50))) names(dg)<"Group"
I found several questions quite similar and I achieved to do multiple ttests with lapply for every column of the first data frame and the grouping variable.
However, I think that it is necessary to adjust the pvalue for the multiple comparisons (I have 15 columns overall). Here is my problem, how can I do this? For example with Bonferroni? Can I implement this in this Code?
Further, I think it would be good to test the assumptions for ttests also with this multiple data columns (normal distribution/ equal variances).
Anyone here who knows how to combine this all in a fancy way? Perfect would be a script where the statistics are also converted in a data frame or sth like this...
Thank you for any help!!