Do I inverse transform my predictions and test dataset before measuring a model's performance?

I've created a toy example of time-series forecasting with a series [1, 2, 3, ..., 999, 1000]. I split the series into training (2/3) and testing (1/3) sets, and transformed the training set with scikit's MinMaxScaler.

from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
# define dataset
series = pd.DataFrame([i+1 for i in range(1000)])
train = series[:int(len(series)*0.67)]
test = series[int(len(series)*0.67):]
# Scale
scaler = MinMaxScaler()
trainNorm = scaler.fit_transform(train)
testNorm = scaler.transform(test)
# TimeSeriesGenerator takes a funny shape and I don't know why
trainNorm = np.array(trainNorm).reshape(len(trainNorm))
testNorm = np.array(testNorm).reshape(len(testNorm))

I use TimeSeriesGenerator to convert the training set into a lagged training set according to the number of time steps I desire. I also construct a simple neural network.

# Number of steps to "look back" for forecasting
n_input = 5

# define generator
generator = TimeseriesGenerator(trainNorm, trainNorm, length=n_input, batch_size=795)

# define model
model = Sequential()
model.add(Dense(64, activation='relu', input_dim=(n_input)))
model.compile(optimizer='adam', loss='mse')

# fit model
model.fit_generator(generator, steps_per_epoch=1, epochs=200, verbose=2)

I then created a list of predictions to compare to the test set. I also use the training set to perform walk-forward validation.

"""This section creates a list of predictions and performs walk-forward validation."""
preds = []
history = [x for x in trainNorm]

# step over each time-step in the test set
for i in range(len(testNorm)):  
    # Forecast for predictions
    x_input = np.array(history[-n_input:]).reshape((1, n_input))
    y_pred = model.predict(x_input, verbose=0)
    # store forecast in list of predictions
    # and add actual observation to history for the next loop

# Reverse normalization to original values for scoring
preds = scaler.inverse_transform(preds)
history = np.array(history).reshape(-1, 1)
history = scaler.inverse_transform(history)
test = np.array(test)

# estimate prediction error
mse = mean_squared_error(test, preds)
predError = np.sqrt(mse)
print(f"Mean Square Error: {round(mse, 2)}")
print(f"Root Mean Square Error: {round(predError, 2)}")

I think my model trains properly and my scoring seems accurate, but I'm not sure. My questions concern the latter part of my code. I'm not sure when to introduce an inverse transformation for scoring my model, or whether I even do so in the first place.

Can I score my model without the inverse transformation? If I do need an inverse transformation, do I do it before the code for scoring and after the code for the walk-forward validation loop? Did I code the inverse transformation and reshape my data properly? I'd just like to know whether I'm on the right track with how to do things with my toy model.

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum