How to handle Date field in linear regression
I've the below data and planning to implement Linear Regression out of it.
I've started scripting and came to a stop where it throws me an error because of the Date field (Independent Variable). Can someone help me to modify the code to convert the date field.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import datetime as dt %matplotlib inline dataset = pd.read_excel(r"Data containing Date Field.xlsx") X = dataset['Date'].values.reshape(-1,1) y = dataset['Value'].values.reshape(-1,1) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) regressor = LinearRegression() regressor.fit(X_train, y_train) print("Y intercept is : ", regressor.intercept_) print("Coefficient or slope is : ", regressor.coef_) y_pred = regressor.predict(X_train)
TypeError: invalid type promotion
First; As stated in the comments by @yuRa, you'll need to predict on
There is several things to think about.
In a linear regression we create a model
yis our target (e.g age),
xis our independent variables (e.g weight) and
betaa parameter (how much should we increase
Ywhen we increase
xby 1). The
betaare the ones we find when we "solve the linear regression".
What you have is normally known as a "time series" i.e values that depend on time (roughly speaking). If you want to fit a linear regression right of the bat, you would then need to convert your times to just the numbers [1,2,3,4...] (since they are equally distributed). You would then get a regression with an intercept and one slope (1D).
What you normally would do when having time as a variabale is known as time series analysis. You can fit an ordinary linear regression to that but you then need to think of the following:
- How many values in the past does the current value depend on?
Lets ignore time series models like ARIMA and say you think the current value depends on the 3 previous days (that is what we call an AR(3) model). You would then need to construct a new data set where each row consists of the value three days in prior e.g
x3 x2 x1 value ---------------------- 300 301 302 303 301 302 303 304 302 303 304 305 . . . 311 312 230.367 269.032
x3is the value three days back,
x2two days back and
x1is the value yesterday.
Your regression is then