Why would the LM Prediction function in R add a row to my output
I am trying to use the predict function in R based of a basic linear model. My test set has 1459 values, but when I use the predict function it is creating 1460. I tried removing the NAs from the test set and even tried keeping them in, but do not know where this value is coming from.
Any help would be greatly appreciated. Thanks!
MODEL < lm(train$SalePrice ~ train$LotArea * train$GarageArea *
factor(train$FullBath) * train$YearBuilt * factor(train$OverallQual))
test_final <read.csv("/Users/ERIC/Documents/HOUSING_PRICES/test.csv",
header = TRUE)
na.omit(test_final)
prediction < data.frame(predict(MODEL, test_final))
Warning messages:
1: 'newdata' had 1459 rows but variables found have 1460 rows
2: In predict.lm(MODEL, test_final) :
prediction from a rankdeficient fit may be misleading
Data via: https://www.kaggle.com/c/housepricesadvancedregressiontechniques/data
1 answer

First, a note: you have to reassign the output of
na.omit()
to get rid of missing values.See here:
df < data.frame(x = c(1, 2, 3), y = c(0, 10, NA)) df x y 1 1 0 2 2 10 3 3 NA na.omit(df) x y 1 1 0 2 2 10 df x y 1 1 0 2 2 10 3 3 NA
As you can see, the last call to
df
showed you the initial version including theNA
s. You will need to reassign usingdf < na.omit(df)
.
The actual issue:
As pointed out by @42 in the comments, using formulas correctly will resolve this issue, i.e. you will not have this error message any longer. You will however have a different one. First, let me show you:
#read in the data testdf < read.csv("test.csv") train < read.csv("train.csv") # run initial model, and run model as suggested by 42 model_original < lm(train$SalePrice ~ train$LotArea * train$GarageArea * factor(train$FullBath) * train$YearBuilt * factor(train$OverallQual)) mod_42 < lm(SalePrice ~ LotArea * GarageArea * factor(FullBath) * YearBuilt * factor(OverallQual), data = train)
Now, let us run predictions:
prediction < data.frame(predict(model_original, testdf)) Warning messages: 1: 'newdata' had 1459 rows but variables found have 1460 rows 2: In predict.lm(model_original, testdf) : prediction from a rankdeficient fit may be misleading
This led to the same error as you have. Now, let us run the predictions using the second approach:
prediction < data.frame(predict(mod_42, testdf)) Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor factor(FullBath) has new levels 4
Note that the error message is different now, and points to a more interesting problem.