tvalue and pvalue seem wrong?
I have a dataframe. Downloaded from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. My dataset is from 2018 and the month of January. I keep these columns: trip_distance, fare_amount, pickup_time and dropoff_time.
The goal is to calculate 'price_per_mile'. Then, the mean of these values for each borough and then, applying the ttest to see if the differences among each pair of them are significant. The problem is that at the end I get tvalues=0 and pvalues=1 for all the pairs (just one exception). I don't understand what are the things I need to recheck or change? You can reach 'taxi_zone_lookup.csv' from this address too: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
this is my code:
df=pd.read_csv('yellow_tripdata_201801.csv',
usecols=['tpep_pickup_datetime', 'tpep_dropoff_datetime','trip_distance','PULocationID','fare_amount'])
#Data cleaning
df.drop(df[df['trip_distance']>3].index, inplace=True)
df.drop(df[df['trip_distance']<0.5].index, inplace=True)
df.drop(df[df['fare_amount']>10].index, inplace=True)
df.drop(df[df['fare_amount']<1].index, inplace=True)
df['trip_distance']=df['trip_distance'].astype(np.float16)
df['PULocationID']=df['PULocationID'].astype(np.uint16)
df['fare_amount']=df['fare_amount'].astype(np.float16)
df['price_per_mile'] = df['fare_amount']/df['trip_distance']
borough = pd.read_csv(r'taxi_zone_lookup.csv', usecols = ['LocationID', 'Borough'])
result = pd.merge(df,
borough,
left_on='PULocationID',
right_on='LocationID',
how='inner'
)
result.drop(result[(result.Borough == 'EWR')  (result.Borough == 'Unknown')].index, inplace=True)
df['price_per_mile'].describe()
#here I get mean=NaN???
#ttest
#Creating a dataframe with twolevel of indexes
boroughs = ['Bronx', 'Brooklyn', 'Manhattan', 'Staten Island', 'Queens']
iterables = [['Bronx', 'Brooklyn', 'Manhattan', 'Staten Island', 'Queens'], ['tvalue', 'pvalue', "H0 hypothesis"]]
my_index = pd.MultiIndex.from_product(iterables)
dt = pd.DataFrame(index=my_index, columns=boroughs)
for i in boroughs:
a = result.loc[result.Borough==i]["price_per_mile"]
for j in boroughs:
b = result.loc[result.Borough==j]["price_per_mile"]
t2, p2 = stats.ttest_ind(a,b)
dt.loc[(i,"tvalue"),j]=t2
dt.loc[(i,"pvalue"),j]=p2
if(p2>0.05):
dt.loc[(i,"H0 hypothesis"),j]='Fail to Reject H0'
else:
dt.loc[(i,"H0 hypothesis"),j]='Reject H0'
See also questions close to this topic

Selenium Python Unable to scroll down, while fetching google reviews
I am trying to fetch google reviews with the help of selenium in python. I have imported webdriver from selenium python module. Then I have initialized self.driver as follows:
self.driver = webdriver.Chrome(executable_path="./chromedriver.exe",chrome_options=webdriver.ChromeOptions())
After this I am using the following code to type the company name on google homepage whose reviews I need, for now I am trying to fetch reviews for "STANLEY BRIDGE CYCLES AND SPORTS LIMITED ":
company_name = self.driver.find_element_by_name("q") company_name.send_keys("STANLEY BRIDGE CYCLES AND SPORTS LIMITED ") time.sleep(2)
After this to click on the google search button, using the following code:
self.driver.find_element_by_name("btnK").click() time.sleep(2)
Then finally I am on the page where I can see results. Now I want to click on the View on google reviews button. For that using the following code:
self.driver.find_elements_by_link_text("View all Google reviews")[0].click() time.sleep(2)
Now I am able to get reviews, but only 10. I need at least 20 reviews for a company. For that I am trying to scroll the page down using the following code:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(5)
Even while using the above code to scroll the down the page, I am still getting only 10 reviews. I am not getting any error though.
Need help on how to scroll down the page to get atleast 20 reviews. As of now I am able to get only 10 reviews. Based on my online search for this issue, people have mostly used: "driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")" to scroll the page down whenever required. But for me this is not working. I checked the the height of the page before and after ("driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")") is the same.

Create a new column from itereated rows of timedate data
I am attempting to create a downward velocity model for offshore drilling which uses the variables Depth (which increases every 1 foot) and DateTime data which is more intermittent and is only updated every foot of depth:
Dept DateTime 1141 5/24/2017 04:31 1142 5/24/2017 04:32 1143 5/24/2017 04:40 1144 5/24/2017 04:42 1145 5/25/2017 04:58
I am trying to get something like this:
Where Velocity iterated down dept/(DateTime gap)

one to one mapping in shell script
I am in process of migration. Migrating from old set of servers to new set of servers, where there is no logical relationship in the server names between the 2 sets. I have a script that runs on old server, takes all necessary backups and the run another script to copy the backups to new server and execute it.
I can combine both scripts(taking backup and copying to new server), if I can include a logic to map the old server to new server. Is there a way I can do this.
Old server New server King Queen Bat Ball water fire sand rock
What I am expecting is, if the script is run on server 'King', I want the script to identify that the corresponding new server is 'Queen' and copy the backups to Queen.

Hardware Requirements for Pandas Dataframe, Tensorflow and LightGBM
I ask for help with information on the minimum hardware requirements, such as HD space, memory, processor, video card, for the use of Pandas Dataframe, Tensorflow and LightGBM libraries.
Thank you!

select a table from variable explorer using Sqlite3 in python
I have a data frame I read into python using pandas. I would like to use sql to get information about this table. The name of the table is data and it is in the variable explorer. I tried using the following code
connection = sqlite3.connect("company.db") cursor = connection.cursor() cursor.execute("SELECT closing_rate from data")
I get the following error "no such table: data" Is there a way to use sql on tables saved in the variable explorer without creating them in the database?
I tried installing pandasql which I read can help in this situation but I get an error message No module named 'pandasql'

fill dataframe by matching its rows with the multiple level of another dataframes
I have two dataframes df1, a multilevel dataframes and df2 do not have any levels. I want to add the columns to the df2 by matching the multilevels of df1 with the rows of df2.
Code below works fine, but it converts df2 to to multilevel dataframe.
import pandas as pd df1 = pd.DataFrame({'step 0': {('D1', 'E1', 'S1'): 0.372621, ('D1', 'E1', 'S2'): 0.10471400000000002, ('D1', 'E1', 'S3'): 0.0, ('D1', 'E1', 'S4'): 0.144627, ('D1', 'E1', 'Unknown'): 0.49122200000000005, ('D1', 'E2', 'S1'): 0.08583099999999999, ('D1', 'E2', 'S2'): 0.3366, ('D1', 'E2', 'S3'): 0.0, ('D1', 'E2', 'S4'): 0.0, ('D1', 'E2', 'Unknown'): 0.235332, ('D2', 'E1', 'S1'): 0.030488, ('D2', 'E1', 'S2'): 0.0, ('D2', 'E1', 'S3'): 0.0, ('D2', 'E1', 'S4'): 0.827896, ('D2', 'E1', 'Unknown'): 0.0, ('D2', 'E2', 'S1'): 0.061280999999999995, ('D2', 'E2', 'S2'): 0.124464, ('D2', 'E2', 'S3'): 0.0, ('D2', 'E2', 'S4'): 0.0, ('D2', 'E2', 'Unknown'): 0.0}, 'step 1': {('D1', 'E1', 'S1'): 0.21143499999999998, ('D1', 'E1', 'S2'): 0.10622899999999999, ('D1', 'E1', 'S3'): 0.270593, ('D1', 'E1', 'S4'): 0.065209, ('D1', 'E1', 'Unknown'): 0.18825799999999998, ('D1', 'E2', 'S1'): 0.328942, ('D1', 'E2', 'S2'): 0.18970499999999998, ('D1', 'E2', 'S3'): 0.448532, ('D1', 'E2', 'S4'): 0.0, ('D1', 'E2', 'Unknown'): 0.371369, ('D2', 'E1', 'S1'): 0.272635, ('D2', 'E1', 'S2'): 0.251659, ('D2', 'E1', 'S3'): 0.381712, ('D2', 'E1', 'S4'): 0.0, ('D2', 'E1', 'Unknown'): 0.189613, ('D2', 'E2', 'S1'): 0.223804, ('D2', 'E2', 'S2'): 0.252529, ('D2', 'E2', 'S3'): 0.045514, ('D2', 'E2', 'S4'): 0.034437999999999996, ('D2', 'E2', 'Unknown'): 0.239879}, 'step 2': {('D1', 'E1', 'S1'): 0.162299, ('D1', 'E1', 'S2'): 0.119725, ('D1', 'E1', 'S3'): 0.5406270000000001, ('D1', 'E1', 'S4'): 0.060129999999999996, ('D1', 'E1', 'Unknown'): 0.158279, ('D1', 'E2', 'S1'): 0.233738, ('D1', 'E2', 'S2'): 0.314877, ('D1', 'E2', 'S3'): 0.5514680000000001, ('D1', 'E2', 'S4'): 0.24836799999999998, ('D1', 'E2', 'Unknown'): 0.171224, ('D2', 'E1', 'S1'): 0.190137, ('D2', 'E1', 'S2'): 0.30941399999999997, ('D2', 'E1', 'S3'): 0.351985, ('D2', 'E1', 'S4'): 0.172104, ('D2', 'E1', 'Unknown'): 0.611961, ('D2', 'E2', 'S1'): 0.171979, ('D2', 'E2', 'S2'): 0.388104, ('D2', 'E2', 'S3'): 0.125909, ('D2', 'E2', 'S4'): 0.0, ('D2', 'E2', 'Unknown'): 0.25806399999999996}, 'step 3': {('D1', 'E1', 'S1'): 0.149502, ('D1', 'E1', 'S2'): 0.172926, ('D1', 'E1', 'S3'): 0.18878, ('D1', 'E1', 'S4'): 0.272958, ('D1', 'E1', 'Unknown'): 0.162242, ('D1', 'E2', 'S1'): 0.242986, ('D1', 'E2', 'S2'): 0.15881800000000001, ('D1', 'E2', 'S3'): 0.0, ('D1', 'E2', 'S4'): 0.751632, ('D1', 'E2', 'Unknown'): 0.22207399999999997, ('D2', 'E1', 'S1'): 0.153442, ('D2', 'E1', 'S2'): 0.43892700000000007, ('D2', 'E1', 'S3'): 0.266302, ('D2', 'E1', 'S4'): 0.0, ('D2', 'E1', 'Unknown'): 0.198426, ('D2', 'E2', 'S1'): 0.271795, ('D2', 'E2', 'S2'): 0.23490300000000003, ('D2', 'E2', 'S3'): 0.190519, ('D2', 'E2', 'S4'): 0.0, ('D2', 'E2', 'Unknown'): 0.502057}, 'step 4': {('D1', 'E1', 'S1'): 0.104143, ('D1', 'E1', 'S2'): 0.49640500000000004, ('D1', 'E1', 'S3'): 0.0, ('D1', 'E1', 'S4'): 0.45707600000000004, ('D1', 'E1', 'Unknown'): 0.0, ('D1', 'E2', 'S1'): 0.108503, ('D1', 'E2', 'S2'): 0.0, ('D1', 'E2', 'S3'): 0.0, ('D1', 'E2', 'S4'): 0.0, ('D1', 'E2', 'Unknown'): 0.0, ('D2', 'E1', 'S1'): 0.353298, ('D2', 'E1', 'S2'): 0.0, ('D2', 'E1', 'S3'): 0.0, ('D2', 'E1', 'S4'): 0.0, ('D2', 'E1', 'Unknown'): 0.0, ('D2', 'E2', 'S1'): 0.27114, ('D2', 'E2', 'S2'): 0.0, ('D2', 'E2', 'S3'): 0.638058, ('D2', 'E2', 'S4'): 0.965562, ('D2', 'E2', 'Unknown'): 0.0}}) df2 = pd.DataFrame({'DT':['D1','D1','D2','D2','D1','D2'], 'RE':['E1','E1','E1','E2','E1','E1'], 'DS':['S1','S2','S2','S3','S1','S2']}) df2 = df2[['DT', 'RE', 'DS']] print(df1) print(df2) m_idx = pd.MultiIndex.from_arrays(df2.T.values) m = pd.DataFrame(index=m_idx, columns=df1.columns) m.update(df1) print(m)
output of
print(m)
:step 0 step 1 step 2 step 3 step 4 D1 E1 S1 0.372621 0.211435 0.162299 0.149502 0.104143 S2 0.104714 0.106229 0.119725 0.172926 0.496405 D2 E1 S2 0 0.251659 0.309414 0.438927 0 E2 S3 0 0.045514 0.125909 0.190519 0.638058 D1 E1 S1 0.372621 0.211435 0.162299 0.149502 0.104143 D2 E1 S2 0 0.251659 0.309414 0.438927 0
I want to add the coliumns in df2 like this:
DE RE DS step 0 step 1 step 2 step 3 step 4 0 D1 E1 S1 0.372621 0.211435 0.162299 0.149502 0.104143 1 D1 E1 S2 0.104714 0.106229 0.119725 0.172926 0.496405 2 D2 E1 S2 0 0.251659 0.309414 0.438927 0 3 D2 E2 S3 0 0.045514 0.125909 0.190519 0.638058 4 D1 E1 S1 0.372621 0.211435 0.162299 0.149502 0.104143 5 D2 E1 S2 0 0.251659 0.309414 0.438927 0

Normalising images before learning in pytorch
I want to apply a transform to standardise the images in my dataset before learning in pytorch. I hear this improves learning dramatically. I think Pytorch by default divides all image pixel values by 255 before puttint them in tensors, does this pose a problem for standardization?. The online guide recommends we proceed in the following way.
transform = transforms.Compose( [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
However, how the 0.5 here is just an example I found, it is not the mean or variance for the channels of my data.
So my question is how is the mean and standard deviation derived? Do we need to flatten and append all the green pixel values of the input pictures then calculate the mean and standard deviation? Then repeat for all the other colours. is that how it's done?
I heard there was another approach that tries to calculate an "average picture" to standardise with. What is the difference in result?

Multiple Users in er diagrams
I need to do a library management database project, in this system I'll have students and faculty that can borrow books, but the catch is that the number of books the faculty member can borrow is different than the one the student can borrow, and the same goes to the duration they can borrow the book for. HELP, I don't have any idea how can I design this. I figured that I need a book table, but the question is do I need a separate table for the student and the faculty, or do I combine them into one table called members?

Stock data linear regression by sklearn
I am using the Sklearn to do the linear regression for a set of stock price data, after I normalized the data, the MSE all becomes 0.
Why I get all MSE 0? and please help me, somebody said it's because the model problem.. but I am python newbie, and really need help, thanks in advance!here is one row example in the dataset:
tdate , stock_id , open , close , high , low , volume 04/01/2000 , 1 , 100 , 98 , 101 , 98 , 283100
The code:
from sklearn.linear_model import LinearRegression from sklearn import cross_validation from sklearn.model_selection import train_test_split from sklearn.preprocessing import Normalizer from sklearn.model_selection import train_test_split stock1= file[['open','close','high','low','volume']].where(file['stock_id'] == 1) X_stock1 = stock1.drop(['close'],axis=1) y_stock1 = stock1['close'] X_stock1_train, X_stock1_test, y_stock1_train, y_stock1_test = train_test_split(X_stock1, y_stock1, train_size=0.8, random_state=42)
fill in missing value with median
X_stock1_train= Imputer(missing_values='NaN', strategy='median', axis=0).fit_transform (X_stock1_train) y_stock1_train=y_stock1_train.reshape(1,1) y_stock1_train=Imputer(missing_values='NaN', strategy='median', axis=0 ).fit_transform (y_stock1_train)
normalized the stock data
transformer=Normalizer().fit(X_stock1_train, y_stock1_train) X_stock1_train=transformer.transform(X_stock1_train) y_stock1_train=transformer.transform(y_stock1_train) LinearRegression=LinearRegression() scores = cross_validation.cross_val_score(LinearRegression, X_stock1_train, y_stock1_train, scoring= 'neg_mean_ squared_error' , cv=10)
result:
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Average accuracy for Linear Regression: 0.0

Text mining of online news data in tm shows different data in the final output
Why is this error occuring in text mining of online news data? I web scrapped some online news (downloaded as text file) by using
rvest
package and cleaned the data by usingtm
in R, but when I checked the result which is something different. Then I manually copy pasted the content from web in txt file and cleaned, but it was showing the same. I attached the screenshot of the result of the data after cleaned intm
. Please anybody tell me where I am wrong. 
pandas cleaning Dataframe
i am currently learning pandas and have an issue cleaning my Dataframe:
"TIMESTAMP","RECORD","WM1_u_ms","WM1_v_ms","WM1_w_ms","WM2_u_ms","WM2_v_ms","WM2_w_ms","WS1_u_ms","WS1_v_ms" "20180406 14:31:11.5",29699805,2.628,4.629,0.599,3.908,7.971,0.47,2.51,7.18 "20180406 14:31:11.75",29699806,3.264,4.755,0.095,2.961,6.094,0.504,2.47,7.18 "20180406 14:31:12",29699807,1.542,5.793,0.698,4.95,4.91,0.845,2.18,7.5 "20180406 14:31:12.25",29699808,2.527,5.207,0.012,4.843,6.285,0.924,2.15,7.4 "20180406 14:31:12.5",29699809,3.511,4.528,1.059,2.986,5.636,0.949,3.29,5.54 "20180406 14:31:12.75",29699810,3.445,3.957,0.075,3.127,6.561,0.259,3.85,5.45 "20180406 14:31:13",29699811,2.624,5.238,0.166,3.451,7.199,0.242,3.94,6.24 df = pd.read_csv(FilePath,parse_dates=True) #read the csv file and save it into a variable df = df.drop(['RECORD'],axis=1)
I do not understand why pandas recognizes parts as float64 and others as object. Do you guys have any clue? Because of this, i started trying to convert the columns on my own:
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP']) df['WM1_u_ms':] = df.iloc[:, df.columns != 'TIMESTAMP'].values.astype(float)
But i get an error:
cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'> with these indexers [WM1_u_ms] of <class 'str'>
Why does pandas cant read the .dat file correct from the start and what is my fault converting it. In the next stemp i want to interpolate via df.interpolate() to clear the nan's
thanks for any help!

Dictionary not handling multiple values
I am trying to create a dataframe of states and cities.
Each state name in the table I am reading from ends with the letters [edit],city on the other hand either end with (text)[number]
I have used regex to remove the text within the parentheses and square brackets, saved states in a list for states and cities in another list for cities.
I then converted these two lists into a dictionary with the state as the key and city as the value.
However there are 517 cities and when I do this I lose 467 cities. I'm guessing because as it currently stands I am not allowing my dictionary to handle multiple values. My goal is to create a dataframe of 517x2 dimensions with a state column and city column (city matching it's state). If I create a dataframe from this dictionary I would therefore only get 50x2 as opposed to 512x2 dimensions.
My question is; i.) is my reasoning correct, ii.) how should i think about solving this problem/how should I solve it, iii.) is the code that I have written the most efficient way of reaching my end goal
import pandas as pd import numpy as np import re state = [] city = [] with open("university_towns.txt","r") as i: uni = i.readlines() for st in uni: if "[edit]"in st: state.append(re.sub("[\\[].*?[\\]]\s", "", st)) else: city.append(re.sub("[\(\[].*?[\)\]]\s", "", st)) city_st = dict(zip(state,city)) #need to take the keyvalue pairs/items from the dictionary s = pd.Series(city_st, name ='RegionName') s.index.name = 'State' s = s.reset_index() s
ADD: not quite sure how to add the relevant data for this question

test for each factor level
I need to compare the median and IQR (or mean +SD) for the value of a bloodsample between two groups for each level of a timeinterval factor. Ideally, it should ultimately look like the output from the Tableone package. For instance:
Stratified by group 1 2 p test n 158 154 time (mean (sd)) 2015.62 (1094.12) 1996.86 (1155.93) 0.883 Time interval A val_BS mean 2.4(0.9) 1.3 ( 0.5) !!! B val_BS mean 1.3 (0.3) 1.9 (0.8) !!! C val_BS mean 2.8 (0.1) 2.9 ( 1.0) !!!
Therefore, I need a method to divide a vector (estra) by each level of a factor of time intervals (gacat) and compare these data between two levels in an outcome factor variable (EPL (yes/no)) ...
I get sort of what I need with:
d_l %>% group_by(gacat, EPL) %>% summarise(mean=mean(na.omit(estra)), sd = sd(na.omit(estra)), n = n_distinct(patientid))
And for the respective tests by each level of gacat:
d_l %>% filter(gacat=="<6 weeks") %>% summarise(pval = t.test(estra ~ EPL)$p.value) d_l %>% filter(gacat=="68 weeks") %>% summarise(pval = t.test(estra ~ EPL)$p.value) d_l %>% filter(gacat=="810 weeks") %>% summarise(pval = t.test(estra ~ EPL)$p.value) d_l %>% filter(gacat=="1012 weeks") %>% summarise(pval = t.test(estra ~ EPL)$p.value) d_l %>% filter(gacat=="12+ weeks") %>% summarise(pval = t.test(estra ~ EPL)$p.value)
However, as I have many samples to evaluate, I would like a more automatic output.
I've looked at multiple solutions in this forum, but fail to find an elegant way to solve this. I'd suspect there was a oneliner to do such comparisons, but cannot seem to find it.

How to perform paired ttest without whole sample data?
I have several data summaries for paired ttest from three difference sources (hospitals), which means I don't have the whole sample data. What I've got to perform the paired ttest is the mean (difference of each paired sample data).
The data looks like the table below (each row is one patient) and I have themean(diff)
andsd(diff)
.
Is there any function with which I could easily perform paired ttest and get the 95% CI? There is just no way of getting the whole data set, due to patient privacy concerns.3 months 6 months diff 1 3 2 2 1 1 5 9 4

optimal sample size for control/test group for Ttest
Recently we launched a feature on one of our website pages.I have all the historical data worth 6 months about the page including impressions and CTR .It has been 20 days since we launched the feature and now we want to know if there is any significant lift in CTR postlaunch. Is there a way to determine how many impressions is statistically significant to conduct a Ttest to check lift in CTR assuming page views before launch was control group and post launch is test group. How much historical data do I need to look at for control group and evaluate the required sample size for test group based on that. Any lead OR different approach is highly appreciated.