How to handle Date field in linear regression

I've the below data and planning to implement Linear Regression out of it.

enter image description here

I've started scripting and came to a stop where it throws me an error because of the Date field (Independent Variable). Can someone help me to modify the code to convert the date field.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import datetime as dt
%matplotlib inline

dataset = pd.read_excel(r"Data containing Date Field.xlsx")

X = dataset['Date'].values.reshape(-1,1)
y = dataset['Value'].values.reshape(-1,1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

regressor = LinearRegression()  
regressor.fit(X_train, y_train)

print("Y intercept is : ", regressor.intercept_)
print("Coefficient or slope is : ", regressor.coef_)

y_pred = regressor.predict(X_train)

Error Message:

TypeError: invalid type promotion

Regards,

Bharath Vikas

1 answer

  • answered 2020-09-24 11:30 CutePoison

    First; As stated in the comments by @yuRa, you'll need to predict on X_test and not X_train

    Second;

    There is several things to think about.

    In a linear regression we create a model Y=x*beta where y is our target (e.g age), x is our independent variables (e.g weight) and beta a parameter (how much should we increase Y when we increase x by 1). The beta are the ones we find when we "solve the linear regression".

    What you have is normally known as a "time series" i.e values that depend on time (roughly speaking). If you want to fit a linear regression right of the bat, you would then need to convert your times to just the numbers [1,2,3,4...] (since they are equally distributed). You would then get a regression with an intercept and one slope (1D).

    What you normally would do when having time as a variabale is known as time series analysis. You can fit an ordinary linear regression to that but you then need to think of the following:

    1. How many values in the past does the current value depend on?

    Lets ignore time series models like ARIMA and say you think the current value depends on the 3 previous days (that is what we call an AR(3) model). You would then need to construct a new data set where each row consists of the value three days in prior e.g

    x3   x2   x1   value
    ----------------------
    300  301   302   303
    
    301  302   303   304
    302  303   304   305 
    .
    .
    .
    311  312  230.367  269.032
    

    where x3 is the value three days back, x2 two days back and x1 is the value yesterday.

    Your regression is then Y=x0 +beta_1*x1+beta_2*x2+beta_3*x3