Python Pandas: Assign particular value of Group to All Entries of That Group in dataframe
I have a data frame with below columns, using which need to create new columns by grouping Name. Status should be updated as active to entire group if it is active for any one St.
data = {'Name':['Tom', 'Tom', 'krish', 'jack','jack','Nam','sue'],
'St':['S-123', 'S-290', 'S-123', 'S-147','S-98','S-123','S-38'],
'Status':['Inactive','active','active','inactive','active','Inactive','Inactive']}
df = pd.DataFrame(data)
The desired output is
Name St Status New Status
0 Tom S-123 Inactive active
1 Tom S-290 active active
2 krish S-123 active active
3 jack S-147 inactive active
4 jack S-98 active active
5 Nam S-123 Inactive Inactive
6 sue S-38 Inactive Inactive
1 answer
-
answered 2021-04-08 04:21
Marco
you don't need to group, in one line you could do it with
df['New Status'] = df.apply(lambda row: "active" if any(df.query('Name == @row.Name').Status == 'active') else "inactive", axis=1)
See also questions close to this topic
-
Sparse Matrix Creation : KeyError: 579 for text datasets
I am trying to use the make_sparse_matrix function to create a sparse matrix for my text dataset, and I face KeyError: 579. Does anyone has any leads on the root of the error.
def make_sparse_matrix(df, indexed_words, labels): """ Returns sparse matrix as dataframe. df: A dataframe with words in the columns with a document id as an index (X_train or X_test) indexed_words: index of words ordered by word id labels: category as a series (y_train or y_test) """ nr_rows = df.shape[0] nr_cols = df.shape[1] word_set = set(indexed_words) dict_list = [] for i in range(nr_rows): for j in range(nr_cols): word = df.iat[i, j] if word in word_set: doc_id = df.index[i] word_id = indexed_words.get_loc(word) category = labels.at[doc_id] item = {'LABEL': category, 'DOC_ID': doc_id, 'OCCURENCE': 1, 'WORD_ID': word_id} dict_list.append(item) return pd.DataFrame(dict_list) make_sparse_matrix( X_train, word_index, y_test )
X_train is a DF that contains one single word in each cell, word_index contains all the index of words and y_test stores all labels.
The Key Error I am facing is:
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~\New folder\envs\geo_env\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3079 try: -> 3080 return self._engine.get_loc(casted_key) 3081 except KeyError as err:
pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 579
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last) in
in make_sparse_matrix(df, indexed_words, labels) 20 doc_id = df.index[i] 21 word_id = indexed_words.get_loc(word) ---> 22 category = labels.at[doc_id] 23 24 item = {'LABEL': category, 'DOC_ID': doc_id,
~\New folder\envs\geo_env\lib\site-packages\pandas\core\indexing.py in getitem(self, key) 2154 return self.obj.loc[key] 2155 -> 2156 return super().getitem(key) 2157 2158 def setitem(self, key, value):
~\New folder\envs\geo_env\lib\site-packages\pandas\core\indexing.py in getitem(self, key) 2101 2102 key = self._convert_key(key) -> 2103 return self.obj._get_value(*key, takeable=self._takeable) 2104 2105 def setitem(self, key, value):
~\New folder\envs\geo_env\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable) 959 960 # Similar to Index.get_value, but we do not fall back to positional --> 961 loc = self.index.get_loc(label) 962 return self.index._get_values_for_loc(self, loc, label) 963
~\New folder\envs\geo_env\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3080 return self._engine.get_loc(casted_key) 3081 except KeyError as err: -> 3082 raise KeyError(key) from err 3083 3084 if tolerance is not None:
KeyError: 579
-
Finding part of string in list of strings
GCM = ([519,520,521,522,533],[534,525],[526,527,530,531], [4404]) slice = int(str(df["CGM"][row_count])[:3])
I am looking through a row in a csv file and taking out the number I want. i want the number that starts with the number I have in
GCM
. since they represent info I want in other columns. this has working fine with the slice function because all the number i wanted started with 3 digits. now that i need to look for any number that starts with4404
and later on going to probably need to look for57052
the slice function no longer work.is there a way I can, instead of slicing and comparing to list, can take 5 digit number and see if part of it is in list. preferably look for it starting 3 or more same digits. the real point of that part of code is finding out which list in
GCM
list the number is. it need to be able to have the number44042
and know that the part of it a care about is inGCM[3]
, but on the other side do not want it to say that32519
is inDCM[0]
since I only care about number that start with519
not ends with it.ps. I am norwegian and have been learning programming by myself. been some long nights. so something here can be lost in translation.
-
How to forecast a time series out-of-sample using an ARIMA model in Python?
I have seen similar questions at Stackoverflow. But, either the questions were different enough or if similar, they actually have not been answered. I gather it is something that modelers run into often, and have a challenge solving.
In my case I am using two variables, one Y and one X with 50 time series sequential observations. They are both random numbers representing % changes (they could be anything you want, their true value does not matter. This is just to set up an example of my coding problem). Here are my basic codes to build this ARIMAX(1,0,0) model.
import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf df = pd.read_excel('/Users/gaetanlion/Google Drive/Python/Arima/df.xlsx', sheet_name = 'final') from statsmodels.tsa.arima_model import ARIMA endo = df['y'] exo = df['x']
Next, I build the ARIMA model, using the first 41 observations
modelho = sm.tsa.arima.ARIMA(endo.loc[0:40], exo.loc[0:40], order =(1,0,0)).fit() print(modelho.summary())
So far everything works just fine.
Next, I attempt to forecast or predict the next 9 observations out-of-sample. Here I want to use the X values over these 9 observations to predict Y. And, I just can't do it. I am showing below just the one code, that I think gets me the closest to where I need to go.
modelho.predict(exo.loc[41:49], start = 41, end = 49, dynamic = False) TypeError: predict() got multiple values for argument 'start'
-
How to download and silent install .exe file with given URL using Python 3
I have a URL which is a download link to a software .exe file (Dynamic but the url function fetches it correctly each time). The intended operation is to use Python 3 to download the said file and then do a silent installation.
url = get_internal_link(check_if_inserted) url = str(url) httpurl = re.sub("ftp://","https://",url) downloadurl = httpurl.replace("'","").replace(',','').replace(":2100/FTP Folders/Software","") downloadurl = downloadurl.strip("(").strip(")")
-
The pandas value error still shows, but the code is totally correct and it loads normally the visualization
I really wanted to use
pd.options.mode.chained_assignment = None
, but I wanted a code clean of error.My start code:
import datetime import altair as alt import operator import pandas as pd s = pd.read_csv('../../data/aparecida-small-sample.csv', parse_dates=['date']) city = s[s['city'] == 'Aparecida']
Based on @dpkandy's code:
city['total_cases'] = city['totalCases'] city['total_deaths'] = city['totalDeaths'] city['total_recovered'] = city['totalRecovered'] tempTotalCases = city[['date','total_cases']] tempTotalCases["title"] = "Confirmed" tempTotalDeaths = city[['date','total_deaths']] tempTotalDeaths["title"] = "Deaths" tempTotalRecovered = city[['date','total_recovered']] tempTotalRecovered["title"] = "Recovered" temp = tempTotalCases.append(tempTotalDeaths) temp = temp.append(tempTotalRecovered) totalCases = alt.Chart(temp).mark_bar().encode(alt.X('date:T', title = None), alt.Y('total_cases:Q', title = None)) totalDeaths = alt.Chart(temp).mark_bar().encode(alt.X('date:T', title = None), alt.Y('total_deaths:Q', title = None)) totalRecovered = alt.Chart(temp).mark_bar().encode(alt.X('date:T', title = None), alt.Y('total_recovered:Q', title = None)) (totalCases + totalRecovered + totalDeaths).encode(color=alt.Color('title', scale = alt.Scale(range = ['#106466','#DC143C','#87C232']), legend = alt.Legend(title="Legend colour"))).properties(title = "Cumulative number of confirmed cases, deaths and recovered", width = 800)
This code works perfectly and loaded normally the visualization image, but it still shows the pandas error, asking to try to set
.loc[row_indexer,col_indexer] = value instead
, then I was reading the documentation "Returning a view versus a copy" whose linked cited and also tried this code, but it still shows the same error. Here is the code withloc
:# 1st attempt tempTotalCases.loc["title"] = "Confirmed" tempTotalDeaths.loc["title"] = "Deaths" tempTotalRecovered.loc["title"] = "Recovered" # 2nd attempt tempTotalCases["title"].loc = "Confirmed" tempTotalDeaths["title"].loc = "Deaths" tempTotalRecovered["title"].loc = "Recovered"
Here is the error message:
<ipython-input-6-f16b79f95b84>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy tempTotalCases["title"] = "Confirmed" <ipython-input-6-f16b79f95b84>:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy tempTotalDeaths["title"] = "Deaths" <ipython-input-6-f16b79f95b84>:12: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy tempTotalRecovered["title"] = "Recovered"
Jupyter and Pandas version:
$ jupyter --version jupyter core : 4.7.1 jupyter-notebook : 6.3.0 qtconsole : 5.0.3 ipython : 7.22.0 ipykernel : 5.5.3 jupyter client : 6.1.12 jupyter lab : 3.1.0a3 nbconvert : 6.0.7 ipywidgets : 7.6.3 nbformat : 5.1.3 traitlets : 5.0.5 $ pip show pandas Name: pandas Version: 1.2.4 Summary: Powerful data structures for data analysis, time series, and statistics Home-page: https://pandas.pydata.org Author: None Author-email: None License: BSD Location: /home/gus/PUC/.env/lib/python3.9/site-packages Requires: pytz, python-dateutil, numpy Required-by: ipychart, altair
-
SQLAlchemy Core GROUP BY calculated field
I literally can't find this in the documentation anywhere nor can I find a related question (my guess is because I don't know how to ask it) so I'm really hoping this is low hanging fruit.
I'm trying to rewrite this in SQLAlchemy Core:
SELECT A.COLUMN1, CASE WHEN A.COLUMN2 = "IN" THEN "HELLO", WHEN A.COLUMN2 = "OUT" THEN "WORLD" ELSE "!" END AS MYCASE, sum(A.COLUMN3) FROM MY_FUN_TABLE A GROUP BY A.COLUMN1, MYCASE;
This is all super simple except for the
MYCASE
part. How do I put the calculated fieldMYCASE
into thegroupby
statement?sql = select(a.c.COLUMN1, case([(a.c.COLUMN2 == "IN", "HELLO"), (a.c.COLUMN2 == "OUT", "WORLD")], else_="!").label("MYCASE"), func.sum(a.c.COLUMN3) ) sql = sql.groupby(a.c.COLUMN1, **?????**)
Thank you all in advance for your thoughts.
For reference:
python.__version__ == 3.9.4
,sqlalchemy.__version__ == 1.4.7