How to handle Date field in linear regression
I've the below data and planning to implement Linear Regression out of it.
I've started scripting and came to a stop where it throws me an error because of the Date field (Independent Variable). Can someone help me to modify the code to convert the date field.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import datetime as dt
%matplotlib inline
dataset = pd.read_excel(r"Data containing Date Field.xlsx")
X = dataset['Date'].values.reshape(1,1)
y = dataset['Value'].values.reshape(1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
print("Y intercept is : ", regressor.intercept_)
print("Coefficient or slope is : ", regressor.coef_)
y_pred = regressor.predict(X_train)
Error Message:
TypeError: invalid type promotion
Regards,
Bharath Vikas
1 answer

First; As stated in the comments by @yuRa, you'll need to predict on
X_test
and notX_train
Second;
There is several things to think about.
In a linear regression we create a model
Y=x*beta
wherey
is our target (e.g age),x
is our independent variables (e.g weight) andbeta
a parameter (how much should we increaseY
when we increasex
by 1). Thebeta
are the ones we find when we "solve the linear regression".What you have is normally known as a "time series" i.e values that depend on time (roughly speaking). If you want to fit a linear regression right of the bat, you would then need to convert your times to just the numbers [1,2,3,4...] (since they are equally distributed). You would then get a regression with an intercept and one slope (1D).
What you normally would do when having time as a variabale is known as time series analysis. You can fit an ordinary linear regression to that but you then need to think of the following:
 How many values in the past does the current value depend on?
Lets ignore time series models like ARIMA and say you think the current value depends on the 3 previous days (that is what we call an AR(3) model). You would then need to construct a new data set where each row consists of the value three days in prior e.g
x3 x2 x1 value  300 301 302 303 301 302 303 304 302 303 304 305 . . . 311 312 230.367 269.032
where
x3
is the value three days back,x2
two days back andx1
is the value yesterday.Your regression is then
Y=x0 +beta_1*x1+beta_2*x2+beta_3*x3