inconsistent training/validation loss when using Caffe library to resume training with snapshots
As we know, Caffe supports resuming training when the snapshot is given. An explanation of Caffe's training continuation scheme can be found here. However, I found the training loss and validation loss is inconsistent. I gives the following example to illustrate my point. Suppose, I am training a neural network with maximum iteration 1000, and every 100 training iteration it will keep a snapshot. This is done using the following command:
caffe train -solver solver.prototxt
where the batch size is selected to be 64, and in solver.prototxt we have:
test_iter: 4 max_iter: 1000 snapshot: 100 display: 100 test_interval: 100
test_iter=4 carefully so that it will perform testing on nearly all the validation dataset (there are 284 validation samples, a little larger than 4*64).
This will gives us a list of .caffemodel and .solverstate files. For example, we may have solver_iter_300.solverstate and solver_iter_300.caffemodel. When generating these two files, we can also see the training loss (13.7466) and validation loss (2.9385).
Now, if we use the snapshot solver_iter_300.solverstate to continue training:
caffe train -solver solver.prototxt -snapshot solver_iter_300.solverstate
We can see the training loss and validation loss are 12.6 and 2.99 respectively. They are different from before. Any ideas? Thanks.