How to perform a differences in differences regression in R?
I have a data frame containing the change in spending over last week for 50 states (in percents), then the date of reopening, and a boolean column of whether the state is reopened or no.
# A tibble: 1,377 x 5
# Groups: state [51]
state statefips day_month_year R2to1 changeVsLastWeekSpend
<chr> <chr> <date> <lgl> <dbl>
1 AK 02 20200112 FALSE NA
2 AK 02 20200119 FALSE 0.0219
3 AK 02 20200126 FALSE 0.0262
4 AK 02 20200202 FALSE 0.00165
5 AK 02 20200209 FALSE 0.0271
6 AK 02 20200216 FALSE 0.0258
7 AK 02 20200223 FALSE 0.0409
8 AK 02 20200301 FALSE 0.0517
9 AK 02 20200308 FALSE 0.0976
10 AK 02 20200315 FALSE 0.0160
I would like to perform a DID regression on the data, but I am unsure if it is even possible, since each state reopened at a different time. If it were possible, how would you do it in R?
I was thinking of the following regression (should I use fixed effects, within?):
plm(data = filter(Affinity_State_Weekly.csv.p,WeekAfterReopening2to1 > 4 & WeekAfterReopening2to1 < 4, R2to1True == TRUE) ,
changeVsLastWeekSpend ~ R2to1, model = "within")
But I am unsure if the output is truly a DID regression. I am sorry if this question is easy, but I am a novice at R and econometrics.
See also questions close to this topic

How do I separate the columns imported from an excel spreadsheet in order to create multiple separate lists
So, I was transferring a spreadsheet into R and wound up with this:
read_excel("C:\Users\wsu\Downloads\Massachusetts Infections by County, Population Density, and Daily Temperature (Statistics Begin 3_9_20).xlsx",sheet="Sheet1")
New names:
 `` > ...2
 `` > ...3
 `` > ...4
 `` > ...5
 `` > ...6
 ...
A tibble: 51 x 15
`Infections per~ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11
1 "Time (in Days)~ Suff~ Midd~ Norf~ Essex Bris~ Hamp~ Worc~ Plym~ Hamp~ Barn~
2 "1" 10 15 10 0 0 0 1 0 0 0
3 "2" 20 41 22 1 0 0 1 0 0 0
4 "3" 19 44 23 1 0 0 1 0 0 0
5 "4" 22 49 24 2 0 0 1 0 0 0
6 "5" 26 60 24 2 0 0 2 0 0 0
7 "6" 27 65 28 5 1 0 2 0 0 1
8 "7" 31 75 31 6 1 1 6 1 0 1
9 "8" 36 83 36 8 2 1 6 3 0 1
10 "9" 42 89 43 8 5 1 8 5 0 2... with 41 more rows, and 4 more variables: ...12 , ...13 ,
...14 , ...15
Now, in order to create lists for use in an exponential regression, I need to split each column of this table into different lists saved under separate names. There are 50 numbers in each column to be put into a list. How might I go about programming this?

Split string to make df out of parts of the string
I have the following string
text < "Species\n9.1.1 Dog A2002 AKITA CHOW The Akita Chow is a mixed\n breed. Large/independent,\n strong and loyal\n A2003 AMERICAN BULLDOG (BULLDOG) The american Bulldog is\n stocky and musical, but also\n agile and built for chasing\n animals\n9.1.2.Flying (or gliding) B101 BIG EARED BAT Townsend’s bigeared bat\nanimals9.1.2.Flying (or (Corynorhinus townsendii) is a\ngliding) animals species of vesper bat.\n"
Which comes from reading a pdf that looks like
I wish to obtain a df like:
Species Animal 1 9.1.1 Dog A2002 AKITA CHOW 2 9.1.1 Dog A2003 AMERICAN BULLDOG (BULLDOG) 3 9.1.2. Flying (or gliding) animals B101 BIG EARED BAT
The only thing that seems consistent/has no errors is the uppercase column (animal) for example A2002 AKITA CHOW, that's why I thought the most logical thing to do is to split everything before and after the uppercase part.
I tried
# search for something with space before it, and starting with capital letter followed by integers strsplit(text, "(?<=\\s)(?=[AZ][09]+)", perl = TRUE)
Anybody have suggestions? Thanks in advance :)

Scrapping tables from Excel files into R
I have several excel files (*.xlsx) and I want to import them into R, but each file has 6 to 7 tables in a single sheet, separated by chunks of text, like the picture.
I know how to import several excel files using a loop, but my issue is I cannot figure out how select each of the tables distributed along each sheet, avoiding the rows with text, and bind them. Also, each table from each excel file starts in a different cell, so I cannot just define a coordinate (a specific cell) to import the tables. Every excel file is different in amount of rows. I'll appreciate any help.

Converting a JSON Link into a Pandas DataFrame
Please look at the following explanation for the problem. I have a JSON Data Source: https://data.cdc.gov/api/views/x8jftxib/rows.json and I want to convert this Data into a Pandas Data frame.
If you look at the JSON Dataset, it consists of MetaData and then the Actual Data. I would like to have a way in which I can store Metadata in a different file while the Dataset in a different file in my local System.
I have developed this method and I am not able to get it completely work for me:
from urllib.request import urlopen import json # Get the dataset url = "https://data.cdc.gov/api/views/x8jftxib/rows.json" response = urlopen(url) # Convert bytes to string type and string type to dict string = response.read().decode('utf8') json_obj = json.loads(string)
The above Step converts the JSON File in a Dictionary and when I try to convert it into Pandas Dataframe by using this:
pd.DataFrame([json_obj.items()])
I get the output as this:
Please help me for this! I appreciate it.

R: Create sample with at least one element from each category
For linear regression to predict house prices, I need to make train and test sample of 80% and 20% proportion.
However, some of the variables are factors of which few have just 1 observation under them.
Due to this, when performing random sampling, those factors are in test sample and not in train sample.
Hence when predicting the Sale Price in test set, the error comes:"Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor Exterior1st has new levels ImStucc"
Here is the summary of the train sample of Exterior1st variable:
> summary(train$Exterior1st) AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc MetalSd Plywood Stone Stucco 11 0 1 36 0 41 173 0 164 78 2 17 VinylSd Wd Sdng WdShing 389 140 17
Here is summary of the test sample of Exterior1st variable:
> summary(test$Exterior1st) AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc MetalSd Plywood Stone Stucco 4 0 0 8 1 11 37 1 37 22 0 4 VinylSd Wd Sdng WdShing 97 43 3
As you can see the ImStucc factor in this variable is present in the train sample but not in the test sample, due to which the predict function is throws the initial mentioned error.
In my pursuit for this solution, I had come across a function called "stratified". But that function does not seem to work in R.
There was another solution using dplyr group_by. But here we have to specify the number of observations for each group. This solution is not suitable for this dataset as it would require calculation for each factor.
Another solution provided was for sampling of vector alone and not the data frame. Hence, that solution does not help.t= sample(c(filtered_data$Exterior1st,sample(filtered_data$Exterior1st,size = 1000, replace = TRUE))) > table(t) t 1 3 4 5 6 7 8 9 10 11 12 13 14 15 26 2 74 2 91 375 1 345 168 3 37 848 329 36
The above sampling gives a total of 2337 entries, even though size given is 1000. Hence, this is perhaps not what I'm looking for.
Is there method to create a sample of 80% of the data such that at least 1 factor from each variable is present within this sample.
If there isn't, what is the workaround this situation?

Downsampling in Pandas DataFrame
Given a DataFrame having timestamp (ts), I'd like to these by the hour (downsample). Values that were previously indexed by ts should now be divided into ratios based on the number of minutes left in an hour. [note: divide data in ratios for NaN columns while doing resampling]
ts event produced 0 20200909 21:01:00 a 12 1 20200910 00:10:00 a 22 2 20200910 01:31:00 a 130 3 20200910 01:50:00 b 60 4 20200910 01:51:00 b 50 5 20200910 01:59:00 b 26 6 20200910 02:01:00 c 72 7 20200910 02:51:00 b 51 8 20200910 03:01:00 b 63 9 20200910 04:01:00 c 79 # Create Data def create_dataframe(): df = pd.DataFrame([{'a':12, 'b':'a', 'c':'Hello', 'ts':'20200909 21:01:00'}, {'a':22, 'b':'a', 'c':'Hello1', 'ts':'20200910 00:10:00'}, {'a':130, 'b':'a', 'c':'Hello2', 'ts':'20200910 00:31:00'}, {'a':60, 'b':'b', 'c':'Hello3', 'ts':'20200910 00:59:00'}, {'a':50, 'b':'b', 'c':'Hello4', 'ts':'20200910 01:01:00'}, {'a':26, 'b':'b', 'c':'Hello5', 'ts':'20200910 01:30:00'}, {'a':72, 'b':'c', 'c':'Hello6', 'ts':'20200910 02:01:00'}, {'a':51, 'b':'b', 'c':'Hello4', 'ts':'20200910 02:51:00'}, {'a':63, 'b':'b', 'c':'Hello5', 'ts':'20200910 03:01:00'}, {'a':79, 'b':'c', 'c':'Hello6', 'ts':'20200910 04:01:00'}, {'a':179, 'b':'c', 'c':'EVENT_3.5', 'ts':'20200910 06:05:00'}, ]) df.ts = pd.to_datetime(df.ts) return df
I want to estimate a produced based on the ratio of time spend and produced. this can be compared like how many lines of code have I completed and find how many actual lines per hour?
for example: at "20200910 00:10:00" we have 22. Then during the period from 21:01  00:10, we produced based on
59 min of 21:00 hours > 7 => =ROUND(22/189*59,0) 60 min of 22:00 hours > 7 => =ROUND(22/189*60,0) 60 min of 23:00 hours > 7 => =ROUND(22/189*60,0) 10 min of 00:00 hours > 1 => =ROUND(22/189*10,0)
the result should be something like.
ts event produced 0 20200909 20:00:00 a NaN 1 20200910 21:00:00 a 7 2 20200910 22:00:00 a 7 3 20200910 23:00:00 a 7 4 20200910 00:00:00 a 1 5 20200910 01:00:00 b .. 6 20200910 02:01:00 c ..

How to set a coefficient at a particular value, and retain the predictor in the model summary?
I am running a linear regression of the type below:
y < lm(x ~ z, data)
I want z set to 0.8, and then I want to be able to extract the resulting estimate for z from the model output using the tidy function. I have had a look at offset(), but I am unable to see the z estimate in the model output, which I need for a summary table. Does it suffice to simply include I(z*0.8)? This would result in the below code:
y < lm(x ~ I(z*0.8), data)
Any help would be much appreciated.

Is it possible to learn a neural networks for a constant target?
I saw a paper (https://arxiv.org/abs/1910.10147) where they learn L based on the following cost function:
D_1L(q(k1), q(k)) + D_2L(q(k),q(k+1)) = 0
Here D is the derivative with respect to ith argument of L. This can be obtained via automatic differentiation and it is fully compatible with backpropagation. So the target is always zero independently of the input of L but how does the optimization method knows the real values of L for a given input? I mean as long as the cost function is satisfied with a zero value the model is learning but it can be learning wrong values for L. For instance, if the real value of D_1L(q(k1), q(k)) is 10 then the real value of D_2L(q(k),q(k+1)) has to be 10. However the neural network may be learning wrong values for L as long as they are symmetric, for example 3 and 3. Am I missing something or the method in that paper simply doesn't make any sense? Is it possible to learn a neural network in regression task with constant target?

GridSearchCV model tuning for RegressorChain and supported algorithms
I am working on a two output regression problem and because of these two output features have correlation between each others, I am using RegressorChain. I tried for Baseestimator = LinearSVR using repeated k fold cross validation with average mean absolute error. However, I got
MAE:
991.290
with128.681
standard deviation.LSVR = LinearSVR() reg_model = RegressorChain(LSVR, order=[0,1]) reg_model.fit(X_train, y_train) cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=42) cv_r2_scores_lsvr = cross_val_score(estimator =reg_model, X= X_train, y= y_train,scoring='neg_mean_absolute_error', cv=cv) abs_cv_r2_scores_lsvr = absolute(cv_r2_scores_lsvr) print('MAE: %.3f (%.3f)' % (mean(abs_cv_r2_scores_lsvr), std(abs_cv_r2_scores_lsvr)))
Should I change the
scoring = eg_mean_absolute_error
or should I change thebaseestimator
for RegressorChain ? I could not find the baseestimator list of values in https://scikitlearn.org/stable/modules/generated/sklearn.multioutput.RegressorChain.htmlAlso, I want to use GridSearchCV for RegressorChain. Should there an difference for RegressionChain base estimator for GridSearch or code in below is convenient ?
#model tuning svr_params = {'base_estimator__tol' : [000.1, 00.1, 0.1], 'base_estimator__max_iter': [1000,2000,3000] } SVR = LinearSVR() svr_model = RegressorChain(SVR, order=[0,1]) cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=42) svr_cv_model = GridSearchCV(svr_model, svr_params, cv = cv, n_jobs = 1) svr_cv_model.fit(X_train, y_train) print("Best parameters are: " + str(svr_cv_model.best_params_))

Is there a way I can control for the time interval R uses for my panel regression?
I am working on a panel regression in R, using a differenceindifference approach.
I have a balanced sample, with 91 weekly data for 5378 bonds.
However, when I perform my regression, R only coveres the weeks from T=7691.
Do you know a reason for this anomaly? And is there a way for including all weeks manually?
Thanks!

entity specfic effects in plm
I estimated a
"within"
model withplm()
(from packageplm
). Now I am looking for a way to see the entityspecific effects. In the"pooling"
model it was possible to control for entities, but the"within"
it just ignores the variable.Is there a way to see the entityspecific effects?

How can I count observation (no duplicates)?
I'm an R user.
I have this dataset (Panel Data): Image_Link
I want to know how many times "BvD.ID.number" changes its value, it represents the number of enterprises in my dataset.
Thanks in advance, L.