How to get previous row value after a shift compare?
I now have a dataframe like this, the desired outcome is what I trying to get.
If I wanna get the daily price difference, which is "price - estimate_price". But the "estimate_price" here should be yesterday's estimate_price. For example, for id a, if I wanna get "price difference" for 11/2/20, It should be "11/2/20's price - 11/1/20's estimate_price", which is " 7-7 = 0 ".
I have tried using this code to match the day, the match is successful, but the "estimate_price" here is from the current row, not the previous row value, is there any way that I can get the previous row value after I complete a shift compare?
df['price'] - df[df['done_date'].eq((df['estimate_date'] + pd.Timedelta(days =1).shift())]['estimate_price']
Thanks in advance for any help!
See also questions close to this topic
-
Sparse Matrix Creation : KeyError: 579 for text datasets
I am trying to use the make_sparse_matrix function to create a sparse matrix for my text dataset, and I face KeyError: 579. Does anyone has any leads on the root of the error.
def make_sparse_matrix(df, indexed_words, labels): """ Returns sparse matrix as dataframe. df: A dataframe with words in the columns with a document id as an index (X_train or X_test) indexed_words: index of words ordered by word id labels: category as a series (y_train or y_test) """ nr_rows = df.shape[0] nr_cols = df.shape[1] word_set = set(indexed_words) dict_list = [] for i in range(nr_rows): for j in range(nr_cols): word = df.iat[i, j] if word in word_set: doc_id = df.index[i] word_id = indexed_words.get_loc(word) category = labels.at[doc_id] item = {'LABEL': category, 'DOC_ID': doc_id, 'OCCURENCE': 1, 'WORD_ID': word_id} dict_list.append(item) return pd.DataFrame(dict_list) make_sparse_matrix( X_train, word_index, y_test )
X_train is a DF that contains one single word in each cell, word_index contains all the index of words and y_test stores all labels.
The Key Error I am facing is:
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~\New folder\envs\geo_env\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3079 try: -> 3080 return self._engine.get_loc(casted_key) 3081 except KeyError as err:
pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 579
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last) in
in make_sparse_matrix(df, indexed_words, labels) 20 doc_id = df.index[i] 21 word_id = indexed_words.get_loc(word) ---> 22 category = labels.at[doc_id] 23 24 item = {'LABEL': category, 'DOC_ID': doc_id,
~\New folder\envs\geo_env\lib\site-packages\pandas\core\indexing.py in getitem(self, key) 2154 return self.obj.loc[key] 2155 -> 2156 return super().getitem(key) 2157 2158 def setitem(self, key, value):
~\New folder\envs\geo_env\lib\site-packages\pandas\core\indexing.py in getitem(self, key) 2101 2102 key = self._convert_key(key) -> 2103 return self.obj._get_value(*key, takeable=self._takeable) 2104 2105 def setitem(self, key, value):
~\New folder\envs\geo_env\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable) 959 960 # Similar to Index.get_value, but we do not fall back to positional --> 961 loc = self.index.get_loc(label) 962 return self.index._get_values_for_loc(self, loc, label) 963
~\New folder\envs\geo_env\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3080 return self._engine.get_loc(casted_key) 3081 except KeyError as err: -> 3082 raise KeyError(key) from err 3083 3084 if tolerance is not None:
KeyError: 579
-
Finding part of string in list of strings
GCM = ([519,520,521,522,533],[534,525],[526,527,530,531], [4404]) slice = int(str(df["CGM"][row_count])[:3])
I am looking through a row in a csv file and taking out the number I want. i want the number that starts with the number I have in
GCM
. since they represent info I want in other columns. this has working fine with the slice function because all the number i wanted started with 3 digits. now that i need to look for any number that starts with4404
and later on going to probably need to look for57052
the slice function no longer work.is there a way I can, instead of slicing and comparing to list, can take 5 digit number and see if part of it is in list. preferably look for it starting 3 or more same digits. the real point of that part of code is finding out which list in
GCM
list the number is. it need to be able to have the number44042
and know that the part of it a care about is inGCM[3]
, but on the other side do not want it to say that32519
is inDCM[0]
since I only care about number that start with519
not ends with it.ps. I am norwegian and have been learning programming by myself. been some long nights. so something here can be lost in translation.
-
How to forecast a time series out-of-sample using an ARIMA model in Python?
I have seen similar questions at Stackoverflow. But, either the questions were different enough or if similar, they actually have not been answered. I gather it is something that modelers run into often, and have a challenge solving.
In my case I am using two variables, one Y and one X with 50 time series sequential observations. They are both random numbers representing % changes (they could be anything you want, their true value does not matter. This is just to set up an example of my coding problem). Here are my basic codes to build this ARIMAX(1,0,0) model.
import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf df = pd.read_excel('/Users/gaetanlion/Google Drive/Python/Arima/df.xlsx', sheet_name = 'final') from statsmodels.tsa.arima_model import ARIMA endo = df['y'] exo = df['x']
Next, I build the ARIMA model, using the first 41 observations
modelho = sm.tsa.arima.ARIMA(endo.loc[0:40], exo.loc[0:40], order =(1,0,0)).fit() print(modelho.summary())
So far everything works just fine.
Next, I attempt to forecast or predict the next 9 observations out-of-sample. Here I want to use the X values over these 9 observations to predict Y. And, I just can't do it. I am showing below just the one code, that I think gets me the closest to where I need to go.
modelho.predict(exo.loc[41:49], start = 41, end = 49, dynamic = False) TypeError: predict() got multiple values for argument 'start'
-
Create a new column based on one specific cell of data
I am posting to request help on formatting a list of excel sheets, creating a new column based on one particular cell of data.
My DF looks similar to the following:
1 2 3 4 5 6 7 8 NA NA NA NA NA NA NA NA NA NA Oct 2020 NA NA NA NA NA NA NA Total NA Consumer NA Commercial NA Spending State Metro Area Sales Transaction Sales Transaction Sales Transaction AK Anchorage, AK 9000 120 2000 60 7000 60 AL Montgomery, AL 8000 130 2000 30 6000 1000 I have a list of files which import similarly to this. So far I have processed them as such to form a list:
#Copying files to R working directory #OneDrive location (source) DF_Onedrive <- "C:/Users/-----" #R Project (working directory) DF <- "C:/Users/-----" #List of files to be copied list_of_DF <- list.files(DF_Onedrive, "*.xls") #Copying over to WD file.copy(file.path(DF_Onedrive, list_of_DF), DF) #Reading data from R Project inputs data_DF <- list.files(path = "C:/Users/-----", pattern = '*.xls', full.names = TRUE)
I now want to compile the list together as one file. The source files are quarterly, but have tabs separated by months in the quarter such as for month 1, month 2, month 3.
The approach I was going for was similar to:
for (file in data_DF) { # read in the xls and clean M1_MSA <- read_excel(file, sheet = 10) }
Where M1 represents month 1 and pulls from sheet 10-- then I would run a subsequent loop for M2 in sheet 11, and M3 in sheet 12. I would have a single output file for each month, which I would later append together.
My question is on the cleaning I would need to do during this loop: particularly, for each file I need to place the Date (here Oct 2020) in a column which repeats that value for the sheet being read in, looped for each sheet. I need something similar for Total, Consumer, and Commercial in a new "Segment" column, which merges the 'Sales' columns and 'Transactions' columns.
The data in the end should look like:
Date Spending State Metro Area Segment Sales Transaction Oct 2020 AK Anchorage, AK Total 9000 120 Oct 2020 AK Anchorage, AK Consumer 2000 60 Oct 2020 AK Anchorage, AK Commercial 7000 60 -
How to extract rows from a dataframe that contain only certain values
I have this data set:
| Country |Languages Spoken | | Afghanistan | Dari Persian, Pashtu (both official), other Turkic and minor languages | Algeria | Arabic (official), French, Berber dialects |Andorra | Catalán (official), French, Castilian, Portuguese |Angola | Portuguese (official), Bantu and other African languages |Antigua and Barbuda | English (official), local dialects |Australia | English 79%, native and other languages
and I want to extract all the english speeaking countries, I think the easiest way would be to extract all the countries that have the word 'English' in the languages, ideally i want to have a new dataframe with the column english speaking and with values true or false.
-
The pandas value error still shows, but the code is totally correct and it loads normally the visualization
I really wanted to use
pd.options.mode.chained_assignment = None
, but I wanted a code clean of error.My start code:
import datetime import altair as alt import operator import pandas as pd s = pd.read_csv('../../data/aparecida-small-sample.csv', parse_dates=['date']) city = s[s['city'] == 'Aparecida']
Based on @dpkandy's code:
city['total_cases'] = city['totalCases'] city['total_deaths'] = city['totalDeaths'] city['total_recovered'] = city['totalRecovered'] tempTotalCases = city[['date','total_cases']] tempTotalCases["title"] = "Confirmed" tempTotalDeaths = city[['date','total_deaths']] tempTotalDeaths["title"] = "Deaths" tempTotalRecovered = city[['date','total_recovered']] tempTotalRecovered["title"] = "Recovered" temp = tempTotalCases.append(tempTotalDeaths) temp = temp.append(tempTotalRecovered) totalCases = alt.Chart(temp).mark_bar().encode(alt.X('date:T', title = None), alt.Y('total_cases:Q', title = None)) totalDeaths = alt.Chart(temp).mark_bar().encode(alt.X('date:T', title = None), alt.Y('total_deaths:Q', title = None)) totalRecovered = alt.Chart(temp).mark_bar().encode(alt.X('date:T', title = None), alt.Y('total_recovered:Q', title = None)) (totalCases + totalRecovered + totalDeaths).encode(color=alt.Color('title', scale = alt.Scale(range = ['#106466','#DC143C','#87C232']), legend = alt.Legend(title="Legend colour"))).properties(title = "Cumulative number of confirmed cases, deaths and recovered", width = 800)
This code works perfectly and loaded normally the visualization image, but it still shows the pandas error, asking to try to set
.loc[row_indexer,col_indexer] = value instead
, then I was reading the documentation "Returning a view versus a copy" whose linked cited and also tried this code, but it still shows the same error. Here is the code withloc
:# 1st attempt tempTotalCases.loc["title"] = "Confirmed" tempTotalDeaths.loc["title"] = "Deaths" tempTotalRecovered.loc["title"] = "Recovered" # 2nd attempt tempTotalCases["title"].loc = "Confirmed" tempTotalDeaths["title"].loc = "Deaths" tempTotalRecovered["title"].loc = "Recovered"
Here is the error message:
<ipython-input-6-f16b79f95b84>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy tempTotalCases["title"] = "Confirmed" <ipython-input-6-f16b79f95b84>:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy tempTotalDeaths["title"] = "Deaths" <ipython-input-6-f16b79f95b84>:12: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy tempTotalRecovered["title"] = "Recovered"
Jupyter and Pandas version:
$ jupyter --version jupyter core : 4.7.1 jupyter-notebook : 6.3.0 qtconsole : 5.0.3 ipython : 7.22.0 ipykernel : 5.5.3 jupyter client : 6.1.12 jupyter lab : 3.1.0a3 nbconvert : 6.0.7 ipywidgets : 7.6.3 nbformat : 5.1.3 traitlets : 5.0.5 $ pip show pandas Name: pandas Version: 1.2.4 Summary: Powerful data structures for data analysis, time series, and statistics Home-page: https://pandas.pydata.org Author: None Author-email: None License: BSD Location: /home/gus/PUC/.env/lib/python3.9/site-packages Requires: pytz, python-dateutil, numpy Required-by: ipychart, altair