Why would the LM Prediction function in R add a row to my output

I am trying to use the predict function in R based of a basic linear model. My test set has 1459 values, but when I use the predict function it is creating 1460. I tried removing the NAs from the test set and even tried keeping them in, but do not know where this value is coming from.

Any help would be greatly appreciated. Thanks!

MODEL <- lm(train$SalePrice ~ train$LotArea * train$GarageArea * 
factor(train$FullBath) * train$YearBuilt * factor(train$OverallQual))



test_final <-read.csv("/Users/ERIC/Documents/HOUSING_PRICES/test.csv", 
        header = TRUE)


    na.omit(test_final)


    prediction <- data.frame(predict(MODEL, test_final))


    Warning messages:
    1: 'newdata' had 1459 rows but variables found have 1460 rows 
    2: In predict.lm(MODEL, test_final) :
    prediction from a rank-deficient fit may be misleading

Data via: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

1 answer

  • answered 2018-11-08 00:08 coffeinjunky

    First, a note: you have to reassign the output of na.omit() to get rid of missing values.

    See here:

    df <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))
    df
      x  y
    1 1  0
    2 2 10
    3 3 NA
    na.omit(df)
      x  y
    1 1  0
    2 2 10
    df
      x  y
    1 1  0
    2 2 10
    3 3 NA
    

    As you can see, the last call to df showed you the initial version including the NAs. You will need to reassign using df <- na.omit(df).


    The actual issue:

    As pointed out by @42 in the comments, using formulas correctly will resolve this issue, i.e. you will not have this error message any longer. You will however have a different one. First, let me show you:

    #read in the data
    testdf <- read.csv("test.csv")
    train <- read.csv("train.csv")
    
    # run initial model, and run model as suggested by 42
    model_original <- lm(train$SalePrice ~ train$LotArea * train$GarageArea * factor(train$FullBath) * train$YearBuilt * factor(train$OverallQual))
    
    mod_42 <- lm(SalePrice ~ LotArea * GarageArea * factor(FullBath) * YearBuilt * factor(OverallQual), data = train)
    

    Now, let us run predictions:

    prediction <- data.frame(predict(model_original, testdf))
    Warning messages:
    1: 'newdata' had 1459 rows but variables found have 1460 rows 
    2: In predict.lm(model_original, testdf) :
      prediction from a rank-deficient fit may be misleading
    

    This led to the same error as you have. Now, let us run the predictions using the second approach:

    prediction <- data.frame(predict(mod_42, testdf))
    Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
      factor factor(FullBath) has new levels 4
    

    Note that the error message is different now, and points to a more interesting problem.