tvalue and pvalue seem wrong?
I have a dataframe. Downloaded from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. My dataset is from 2018 and the month of January. I keep these columns: trip_distance, fare_amount, pickup_time and dropoff_time.
The goal is to calculate 'price_per_mile'. Then, the mean of these values for each borough and then, applying the ttest to see if the differences among each pair of them are significant. The problem is that at the end I get tvalues=0 and pvalues=1 for all the pairs (just one exception). I don't understand what are the things I need to recheck or change? You can reach 'taxi_zone_lookup.csv' from this address too: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
this is my code:
df=pd.read_csv('yellow_tripdata_201801.csv',
usecols=['tpep_pickup_datetime', 'tpep_dropoff_datetime','trip_distance','PULocationID','fare_amount'])
#Data cleaning
df.drop(df[df['trip_distance']>3].index, inplace=True)
df.drop(df[df['trip_distance']<0.5].index, inplace=True)
df.drop(df[df['fare_amount']>10].index, inplace=True)
df.drop(df[df['fare_amount']<1].index, inplace=True)
df['trip_distance']=df['trip_distance'].astype(np.float16)
df['PULocationID']=df['PULocationID'].astype(np.uint16)
df['fare_amount']=df['fare_amount'].astype(np.float16)
df['price_per_mile'] = df['fare_amount']/df['trip_distance']
borough = pd.read_csv(r'taxi_zone_lookup.csv', usecols = ['LocationID', 'Borough'])
result = pd.merge(df,
borough,
left_on='PULocationID',
right_on='LocationID',
how='inner'
)
result.drop(result[(result.Borough == 'EWR')  (result.Borough == 'Unknown')].index, inplace=True)
df['price_per_mile'].describe()
#here I get mean=NaN???
#ttest
#Creating a dataframe with twolevel of indexes
boroughs = ['Bronx', 'Brooklyn', 'Manhattan', 'Staten Island', 'Queens']
iterables = [['Bronx', 'Brooklyn', 'Manhattan', 'Staten Island', 'Queens'], ['tvalue', 'pvalue', "H0 hypothesis"]]
my_index = pd.MultiIndex.from_product(iterables)
dt = pd.DataFrame(index=my_index, columns=boroughs)
for i in boroughs:
a = result.loc[result.Borough==i]["price_per_mile"]
for j in boroughs:
b = result.loc[result.Borough==j]["price_per_mile"]
t2, p2 = stats.ttest_ind(a,b)
dt.loc[(i,"tvalue"),j]=t2
dt.loc[(i,"pvalue"),j]=p2
if(p2>0.05):
dt.loc[(i,"H0 hypothesis"),j]='Fail to Reject H0'
else:
dt.loc[(i,"H0 hypothesis"),j]='Reject H0'
See also questions close to this topic

reducer to find the most popular movie for each age group in python
I am trying to write mapper reducer for Hadoop to find the movies with 5 rating "the popular movies" for each age group.
I write this
mapper.py
to join the tow data set with the user Id to get the age from user data and the rating with the movie name from the rating data set .!/usr/bin/env python:
import sys for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() line = line.split("::") rating = "1" movie = "1" user = "1" age = "1" if len(line) == 4 : #ratings data rating = line[2] movie = line[1] user = line[0] #print '%s %s %s' % (user,movie,rating) else: # users data user = line[0] age = line[2] print '%s\t%s\t%s\t%s' % (user,age,rating,movie)
this is the data structure rating data: userid, movieid, rating, timestamp user data: userid, gender, age, occupation
The reducer I wrote is not working at all it gave me 0 result.
I want the result to be the top popular movies for each age group:
1 2254 4567 18 8732 0987 0986 25 7654 8765 7658 35 6543 7645 7654 45 7654 8765 5433 50 7652 1876 7654 56 3986 3956

How to compare two columns from two DFs keeping some column constants and print row?
I'm working on a project where I have to find the changes done in second sheet (specific column) as compare to primary/Master sheet. after that I wanted to print or save the complete row in which changes are found. here are more details. both the excel sheets have many columns my master sheet has data something like as follows:
TID LOC HECI RR UNIT SUBD S EUSE INV ACT CAC FMT CKT DD SCID CUSTOMER F&E/SERVICE ID BVAP PORD AUTH RULE ST RGN CHCGILDTO3P050101D CHCGILDTO3P M3MSA0S1RA 0501.01D 1A1 IE D STR3RA8 S CL/HFFS/688898 /LGT 20180721 BLOOMBERG LP DS316668545 WMS881282 E.485339 IL N CHCGILDTO3P050101D CHCGILDTO3P M3MSA0S1RA 0501.01D 1A2 IE J DNA UNDER DECOM EID 2466 20190322 WMS881282 E.485339 IL N CHCGILDTO3P050101D CHCGILDTO3P M3MSA0S1RA 0501.01D 1A3 IE J DNA UNDER DECOM EID 2466 20190322 WMS881282 E.485339 IL N CHCGILDTO3P050101D CHCGILDTO3P M3MSA0S1RA 0501.01D 1A4 IE J DNA UNDER DECOM EID 2466 20190322 WMS881282 E.485339 IL N CHCGILDTO3P050101D CHCGILDTO3P M3MSA0S1RA 0501.01D 1A5 IE J DNA UNDER DECOM EID 2466 20190322 WMS881282 E.485339 IL N
and my second sheet has data as follows :
HECI UNIT INV SUB ACT CKT PACT DD LOC RR M3MSA0S1RA 1A1 IE $ CL/HFFS/688898 /LGT D 72118 CHCGILDTO3P 0501.01D M3MSA0S1RA 1A2 IE J DNA UNDER DECOM EID 2466 32219 CHCGILDTO3P 0501.01D M3MSA0S1RA 1A3 IE J DNA UNDER DECOM EID 2466 32219 CHCGILDTO3P 0501.01D M3MSA0S1RA 1A4 IE J DNA UNDER DECOM EID 2466 32219 CHCGILDTO3P 0501.01D M3MSA0S1RA 1A5 IE J DNA UNDER DECOM EID 2466 32219 CHCGILDTO3P 0501.01D
so first i want to check if the values of LOC HECI RR & UNIT are same in both the sheets I want to move forward and comapre ACT column and print the difference as output.
for example you can see row #1 in Master data ACT is 'D' and where as in second sheet its changes to '$'
so I want output something like related complete row which says its changes from 'D' to '$'
this seems very complicated to me as I'm at beginning stage of python and pandas.
I tried using loops but unable to execute also if I use too much loop that's not the pandas way I believe
here is my code:
import pandas as pd df1 = pd.read_excel("Master Database.xlsx") df2 = pd.read_excel("CHCGILDTO3P_0501.01D.xlsx") d1_act = df1['ACT'] d2_act = df2['ACT'] for index1, row1 in df1.iterrows(): for index2, row2 in df2.iterrows(): if(row1['LOC'],row1['HECI'],row1['RR']) ==(row2['LOC'],row2['HECI'],row2['RR']): for x in d1_act and y in d2_act: #print(x,y) if x != y: print (x, y) # not getting how to print complete respective row else: pass else: pass
I want ouput like:
M3MSA0S1RA 1A1 IE $ CL/HFFS/688898 /LGT D 72118 CHCGILDTO3P 0501.01D
changes from 'D to '$'
please assist ! thank you in advance!

merge duplicate cells of a column
My Current excel looks like:
  Type  Val    A  1    A  2    B  3    B  4    B  5    C  6 

This is the required excel:
  Type  Val  Sum    A  1  3       2     B  3  12       4        5     C  6  6  
Is it possible in python using pandas or any other module?

Pandas dataframe update keys
I'm unable to update a Pandas
Dataframe
usingpd.update()
function, I always get aNone
result. I'm using aDataframe
with keys which is the result of joining 2Dataframes
.I calculate the
z1 score
for onlyfloat32
columns, and then I update theDataframe
with the new values forfloat32
columns.class MySimpleScaler(object): def __init__(self): self._means = None self._stds = None def preprocess(self, data): """Calculate zscore for dataframe""" if self._means is None: # During training only self._means = data.select_dtypes('float32').mean() if self._stds is None: # During training only self._stds = data.select_dtypes('float32').std() if not self._stds.all(): raise ValueError('At least one column has standard deviation of 0.') z1 = (data.select_dtypes('float32')  self._means) / self._stds return data.update(z1)
all_x = pd.concat([train_x, eval_x], keys=['train', 'eval']) scaler = MySimpleScaler() all_x = scaler.preprocess(all_x) train_x, eval_x = all_x.xs('train'), all_x.xs('eval')
When I run the
data.update(z1)
it always returnsNone
.I need to reuse the scaler object later to calculate z score for new dataframes.

How to normalize open simplex noise values?
I want to make all the values generated by an open simplex 2D noise method fit in a range from zero to one, inclusive. I have tried the (value  min)/(max  min) equation I found here, but while the maximum value is correct, the minimum value found in my tests is 2.6123198888582956E9.
The OpenSimplexNoise class I use here can be found at it's page on GitHub.
How can I normalize the noise values here to the range [0, 1]?
public class NatureGenerator { private static final SplittableRandom Rand = new SplittableRandom(); public static double[][] generateNature() { double[][] worldMap = new double[900][1600]; // iterate through every pixel in world map for (int y = 0; y < worldMap.length; y++) { for (int x = 0; x < worldMap[0].length; x++) { // get noise value double noiseVal = noise.eval(x, y); // noise values without normalizing here are.. // maximum: 0.8643664287621713 // minimum: 0.8643664332781745 noiseVal = ((noiseVal + 0.8643664287621713) / (0.8643664287621713 + 0.8643664332781745)); // assign noise value to pixel worldMap[y][x] = noiseVal; } } // find minimum and maximum values in the 2D array worldMap double min = Double.MAX_VALUE; double max = Double.MIN_VALUE; for(double[] dd : worldMap) { for (double d : dd) { if (min > d) min = d; if (max < d) max = d; } } System.out.println("Maximum: " + max + " Minimum: " + min); return worldMap; } }

How to normalize the features extracted from Kinect C#?
I'm doing a project to extract the lip features with a Kinect in C#. My question is how to do a normalization with the features?
I think I have to do a function like this:
Nd = ( Cd * Fd ) / dFN
where Nd is the normalized distance, Cd is the current distance to the Kinect, Fd is the depth value of the forehead which every frame is normalized and dFN is the depth value for which every frame is normalized.
Also, a zscore normalization must be implemented:
z = (X  μ) / σ
where z is the result of the zscore normalization, X is the current feature value, μ is the mean obtained in one database and σ represents the variance.
How can I do this normalization based on the features?
Here is the function where I receive the features to normalize:
private void FaceReader_FrameArrived(object sender, HighDefinitionFaceFrameArrivedEventArgs e) { NewRecording(); using (var frame = e.FrameReference.AcquireFrame()) { if (frame != null) { frame.GetAndRefreshFaceAlignmentResult(_faceAlignment); UpdateFacePoints(); var vertices = _faceModel.CalculateVerticesForAlignment(_faceAlignment); var lipLeftCorner = vertices[(int)HighDetailFacePoints.MouthLeftcorner]; var lipRightCorner = vertices[(int)HighDetailFacePoints.MouthRightcorner]; var lipWidth = Math.Abs(lipRightCorner.X  lipLeftCorner.X); var lipTopCorner = vertices[(int)HighDetailFacePoints.MouthUpperlipMidtop]; var lipBottomCorner = vertices[(int)HighDetailFacePoints.MouthLowerlipMidbottom]; var lipHeight = Math.Abs(lipTopCorner.Y  lipBottomCorner.Y); var chinCenter = vertices[(int)HighDetailFacePoints.ChinCenter]; Console.WriteLine(lipLeftCorner.X.ToString()); Console.WriteLine(lipRightCorner.X.ToString()); Console.WriteLine(lipWidth.ToString()); Console.WriteLine(lipTopCorner.Y.ToString()); Console.WriteLine(lipBottomCorner.Y.ToString()); Console.WriteLine(lipHeight.ToString()); Console.WriteLine(lipTopCorner.Z.ToString()); Console.WriteLine(chinCenter.X.ToString()); Console.WriteLine(chinCenter.Y.ToString()); } } }

Device volume independent Android Visualizer Measurement
I want to normalize volume level of the
PCM
stream decoded by me that comes encoded over sockets by grouping short bursts ofsamples
. Instead of multiplying each sample with1(peakSample/32367.0)+1
, I useVisualizer
to get peak value and use it inLoudnessEnhancer
to add gain. (so, whenVisualizer
reports 3200 peak, I add 3200 astargetGain
withLoudnessEnhancer
)The problem isVisualizer
depends on the device volume even if I change the measurement scale ofVisualizer
. So, is there a way to get devicevolume independent measurement fromVisualizer
as I do not want to calculate rms and peak myself if there is already a well tested working android code. 
Map latitute and longitude fields by replacing the direction string with a "" accordingly
I have a dataset that contains a latitude and longitude written like 20.55E and 30.11N. I want to replace these direction strings with an appropriate  where required. So basically, I'll map based on the condition and change the value.
Currently, I have aSchema
and I'm trying to sort out theTransformProcess
My
Schema
is like this:new Schema.Builder() .addColumnTime("dt", DateTimeZone.UTC) .addColumnsDouble("AverageTemperature" , "AverageTemperatureUncertainty") .addColumnsInteger("City" , "Country") .addColumnsFloat("Latitude" , "Longitude") .build();
And I'm stuck with my
TransformProcess
like this:new TransformProcess.Builder(schema) .filter(new FilterInvalidValues("AverageTemperature" , "AverageTemperatureUncertainty")) .stringToTimeTransform("dt","yyyyMMdd", DateTimeZone.UTC) . // map currentLatitude > remove direction string and put sign
I am trying to follow this code from a tutorial and after the
TransformProcess
, I'll do theSpark
stuff and save the data.My question is:
How can I perform the mapping?
From the API docs ofTansformProcess
, I cannot make sense of anything that will help me solve my problem.
I am using the Datavec library in Deeplearning4J 
Remove the version number in the data frame in R column
This is my original data.frame:
cell counts gene TGCTACC1 10 ALKBH5 TACACGA1 20 KDM5C TCCTTGG1 30 EZH2 TACGGTC1 30 PRMT2
I want to remove the trailing numbers and "" from the
cell
column. How can I do this?My desired output likes this:
cell counts gene TGCTACC 10 ALKBH5 TACACGA 20 KDM5C TCCTTGG 30 EZH2 TACGGTC 30 PRMT2

How to extract unique rows from a python pandas groupby object and save it in another dataframe?
I have a dataset of black Friday sales. The columns are User_ID, Product_ID, Gender, Occupation, Product_Category, Purchase, Marital_Status, etc. After analyzing the data, I found that the attribute User_ID has redundant entries (i.e. Single customer buying multiple goods). The total number of entries is 537377 and after I apply
df = df.groupby('User_ID')
, the number of entries is reduced to 5891. I want to extract all the unique rows (i.e. a unique row per customer)from the pandas groupby object. Is there any way to do so?I tried summing up each purchase amount corresponding to unique User_ID, but that does not help.
df = df.groupby('User_ID') df['Purchase'].transform('sum') for key, item in df: print(df.get_group(key), "\n\n")
After executing the above code, the result I get is:
User_ID Gender Age Occupation City_Category \ 0 1000001 F 017 10 A 1 1000001 F 017 10 A 2 1000001 F 017 10 A 3 1000001 F 017 10 A 39180 1000001 F 017 10 A 4 1000002 M 55+ 16 C 39181 1000002 M 55+ 16 C 39182 1000002 M 55+ 16 C 39183 1000002 M 55+ 16 C 39184 1000002 M 55+ 16 C 78147 1000002 M 55+ 16 C Product_Category_2 Product_Category_3 Purchase 0 0.0 0.0 8370 1 6.0 14.0 15200 2 0.0 0.0 1422 3 14.0 0.0 1057 39180 4.0 8.0 12842 4 0.0 0.0 7969 39181 17.0 0.0 6187 39182 16.0 0.0 10074 39183 8.0 14.0 5260 39184 16.0 0.0 7927 78147 16.0 0.0 7791
What I actually want is after dropping product_category_2, 3 is with attribute purchase containing the sum of the total money spent
User_ID Gender Age Occupation City_Category Purchase 0 1000001 F 017 10 A 38891 1 1000002 M 55+ 16 C 37239

Trying a TTest between two models but got an error "out of bounds"
I am trying to perform a KFold test between a Multinomial Naive Bayes and a Stochastic Gradient Descent but I do not get any output, I have an index error saying that index 3805 is out of bounds for axis 0.
from mlxtend.evaluate import paired_ttest_kfold_cv t, p = paired_ttest_kfold_cv(estimator1=clf_MNB, estimator2=clf_SGD, X=X, y=y, random_seed=1) print('t statistic: %.3f' % t) print('p value: %.3f' % p)

Randomization check
I feel a little stupid for asking this question, but somehow I can't figure it out!
I am trying to see if participants are equally randomly assigned to two groups, so the contingency table looks like the below
Condition A Condition B 30 40
The main data look like the one below
Variable 1 P1 Condition A P2 Condition B P3 Condition A P4 Condition A P5 Condition B P6 Condition A . . . . . .
What would be the best way/function to check (using R) that they were equally distributed to either of the conditions?
Thank you a lot!

Choosing Correct statistical test
I have a data of 2 groups that is 15 members on each group. our dependent values are time and score. I am sure that my data is not normally distributed. I am confused which test to use for my effective result. Could someone please help me on this problem
when we tried test the normality we got signification values of time( Shapirowilk result: (1group: .227) (2 group : .0009)) score ( shapirwilk result: (1 group : .000) (2 group: .001)).
Could someone recommend me good statistical test to perform and please guide if i missed something to test.