Implementing a loss function (MSVE) in Reinforcement learning
I am trying to build a temporal difference learning agent for Othello. While the rest of my implementation seems to run as intended I am wondering about the loss function used to train my network. In Sutton's book "Reinforcement learning: An Introduction", the Mean Squared Value Error (MSVE is presented as the standard loss function. It is basically a Mean Square Error multiplied with the on policy distribution. (Sum over all states s ( onPolicyDistribution(s) * [V(s)  V'(s,w)]² ) )
My question is now: How do I obtain this on policy distribution when my policy is an egreedy function of a learned value function? Is it even necessary and what's the issue if I just use an MSELoss instead?
I'm implementing all of this in pytorch, so bonus points for an easy implementation there :)
See also questions close to this topic

OpenAI gym player mode
Does anyone know how to run one of the OpenAI gym environments as a player. As in letting a human player play a round of cart pole? I have seen that there is env.mode = 'human' but I have not been able to get it to run properly. I have tried to follow the example given here https://www.pinchofintelligence.com/gettingstartedopenaigym/ but it doesn't seem to work for me.
Any help you can give would be greatly appreciated.
Thanks

how reinforcement learning play SLG games?
I want to use the reinforcement learning to play SLG games ,and I find it's hard to do it. It's hard to define reward and too many actions (buildings to upgrade,resources constrains,recruiting,resource gathering,buying and using items,attacking...),It's also hard train,I mean SLG games take a lot of time to and there's no end for such SLG.any idea is welcome.

Hidden units saturate in a seq2seq model in PyTorch
I'm trying to write a very simple machine translation toy example in
PyTorch
. To simply the question, I turn the machine translation task into this one:Given a random sequence (
[4, 8, 9 ...]
), predict a sequence whose elements its elements plus 1 ([5, 9, 10, ...]
). Id:0, 1, 2
will be used aspad, bos, eos
, respectively.I observed the same problem in this toy task in my machine translation task. To debug, I use a very small data size
n_data = 50
, and find that the model can not even overfit these data. Looking into the model, I find that, the hidden state of theencoder/decoder
soon becomes saturated, namely, all units in the hidden state become very close to1/1
due to thetanh
.0.8987 0.9634 0.9993 ... 0.8930 0.4822 0.9960 0.9673 1.0000 0.8007 ... 0.9929 0.9992 0.9990 0.9457 0.9290 0.9260 ... 0.9932 0.9851 0.9980 ... ⋱ ... 0.9995 0.9997 0.9350 ... 0.9820 0.9942 0.9913 0.9951 0.9488 0.8894 ... 0.9842 0.9895 0.9116 0.9991 0.9769 0.5871 ... 0.7557 0.9049 0.9881
Also, no matter how I adjust the learning rate, the loss value seems to have a low bound even with
50
test samples. With more data, the model seems not converge at all.step: 0, loss: 2.313938 step: 10, loss: 1.435780 step: 20, loss: 0.779704 step: 30, loss: 0.395590 step: 40, loss: 0.281261 ... step: 480, loss: 0.231419 step: 490, loss: 0.231410
When I use
tensorflow
, I can overfit such a dataset using a seq2seq model easily, and have a very small loss value.How can I get rid of this? Any help will be appreciated.
Here's the code
#!/usr/bin/env python3 # * coding: utf8 * import numpy as np import torch import torch.nn.functional as F from torch import nn from torch.autograd import Variable np.random.seed(0) torch.manual_seed(0) _RECURRENT_FN_MAPPING = { 'rnn': torch.nn.RNN, 'gru': torch.nn.GRU, 'lstm': torch.nn.LSTM, } def get_recurrent_cell(n_inputs, num_units, num_layers, type_, dropout=0.0, bidirectional=False): cls = _RECURRENT_FN_MAPPING.get(type_) return cls( n_inputs, num_units, num_layers, dropout=dropout, bidirectional=bidirectional) class Recurrent(nn.Module): def __init__(self, num_units, num_layers=1, unit_type='gru', bidirectional=False, dropout=0.0, embedding=None, attn_type='general'): super(Recurrent, self).__init__() num_inputs = embedding.weight.size(1) self._num_inputs = num_inputs self._num_units = num_units self._num_layers = num_layers self._unit_type = unit_type self._bidirectional = bidirectional self._dropout = dropout self._embedding = embedding self._attn_type = attn_type self._cell_fn = get_recurrent_cell(num_inputs, num_units, num_layers, unit_type, dropout, bidirectional) def init_hidden(self, batch_size): direction = 1 if not self._bidirectional else 2 h = Variable( torch.zeros(direction * self._num_layers, batch_size, self._num_units)) if self._unit_type == 'lstm': return (h, h.clone()) else: return h def forward(self, x, h, len_x): # Sort by sequence lengths sorted_indices = np.argsort(len_x).tolist() unsorted_indices = np.argsort(sorted_indices).tolist() x = x[:, sorted_indices] h = h[:, sorted_indices, :] len_x = len_x[sorted_indices].tolist() embedded = self._embedding(x) packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, len_x) if self._unit_type == 'lstm': o, (h, c) = self._cell_fn(packed, h) o, _ = torch.nn.utils.rnn.pad_packed_sequence(o) return (o[:, unsorted_indices, :], (h[:, unsorted_indices, :], c[:, unsorted_indices, :])) else: o, hh = self._cell_fn(packed, h) o, _ = torch.nn.utils.rnn.pad_packed_sequence(o) return (o[:, unsorted_indices, :], hh[:, unsorted_indices, :]) class Encoder(Recurrent): pass class Decoder(Recurrent): pass class Seq2Seq(nn.Module): def __init__(self, encoder, decoder, num_outputs): super(Seq2Seq, self).__init__() self._encoder = encoder self._decoder = decoder self._out = nn.Linear(decoder._num_units, num_outputs) def forward(self, x, y, h, len_x, len_y): # Encode _, h = self._encoder(x, h, len_x) # Decode o, h = self._decoder(y, h, len_y) # Project o = self._out(o) return F.log_softmax(o) def load_data(size, min_len=5, max_len=15, min_word=3, max_word=100, epoch=10, batch_size=64, pad=0, bos=1, eos=2): src = [ np.random.randint(min_word, max_word  1, np.random.randint(min_len, max_len)).tolist() for _ in range(size) ] tgt_in = [[bos] + [xi + 1 for xi in x] for x in src] tgt_out = [[xi + 1 for xi in x] + [eos] for x in src] def _pad(batch): max_len = max(len(x) for x in batch) return np.asarray( [ np.pad( x, (0, max_len  len(x)), mode='constant', constant_values=pad) for x in batch ], dtype=np.int64) def _len(batch): return np.asarray([len(x) for x in batch], dtype=np.int64) for e in range(epoch): batch_start = 0 while batch_start < size: batch_end = batch_start + batch_size s, ti, to = (src[batch_start:batch_end], tgt_in[batch_start:batch_end], tgt_out[batch_start:batch_end]) lens, lent = _len(s), _len(ti) s, ti, to = _pad(s).T, _pad(ti).T, _pad(to).T yield (Variable(torch.LongTensor(s)), Variable(torch.LongTensor(ti)), Variable(torch.LongTensor(to)), lens, lent) batch_start += batch_size def print_sample(x, y, yy): x = x.data.numpy().T y = y.data.numpy().T yy = yy.data.numpy().T for u, v, w in zip(x, y, yy): print('') print('S: ', u) print('T: ', v) print('P: ', w) n_data = 50 min_len = 5 max_len = 10 vocab_size = 101 n_samples = 5 epoch = 100000 batch_size = 32 lr = 1e2 emb_size = 50 hidden_size = 50 num_layers = 1 max_length = 15 src_embed = torch.nn.Embedding(vocab_size, emb_size) tgt_embed = torch.nn.Embedding(vocab_size, emb_size) enc = Encoder(hidden_size, num_layers, embedding=src_embed) dec = Decoder(hidden_size, num_layers, embedding=tgt_embed) net = Seq2Seq(enc, dec, vocab_size) optimizer = torch.optim.Adam(net.parameters(), lr=lr) criterion = torch.nn.NLLLoss() loader = load_data( n_data, min_len=min_len, max_len=max_len, max_word=vocab_size, epoch=epoch, batch_size=batch_size) net.train() for i, (x, yin, yout, lenx, leny) in enumerate(loader): optimizer.zero_grad() logits = net(x, yin, enc.init_hidden(x.size()[1]), lenx, leny) loss = criterion(logits.view(1, vocab_size), yout.contiguous().view(1)) loss.backward() optimizer.step() if i % 10 == 0: print('step: {}, loss: {:.6f}'.format(i, loss.data[0])) if i % 200 == 0 and i > 0: net.eval() x, yin, yout, lenx, leny = (x[:, :n_samples], yin[:, :n_samples], yout[:, :n_samples], lenx[:n_samples], leny[:n_samples]) outputs = net(x, yin, enc.init_hidden(x.size()[1]), lenx, leny) _, preds = torch.max(outputs, 2) print_sample(x, yout, preds)

Efficiently find centroid of labelled image regions
I have a segmented image as a 2 dimensional matrix of unique labels 1 ... k. For example:
img = [1 1 2 2 2 2 2 3 3] [1 1 1 2 2 2 2 3 3] [1 1 2 2 2 2 3 3 3] [1 4 4 4 2 2 2 2 3] [4 4 4 5 5 5 2 3 3] [4 4 4 5 5 6 6 6 6] [4 4 5 5 5 5 6 6 6]
I am trying to determine the region centroids. That is, per label, what is the X,Y coordinate of the center of mass? For example, the centroid of label 1 is (1.25, 0.625). Just average up the row numbers (
(0 + 0 + 1 + 1 + 1 + 2 + 2 + 3) / 8 = 1.25
) and the column numbers ((0 + 0 + 0 + 0 + 1 + 1 + 1 + 2) / 8 = 0.625
)The only way I know how to do this is to use a for loop from 1 to k (or in the case of my example, 1 through 6), find the indices of the points for each label, and average their coordinates by indexing a meshgrid of the image.
However, I am looking to do this in a way optimized for GPU computations. Hence, the use of a for loop is less than ideal (takes about 1 sec per image on a nice GPU for a few hundred labels). I am using PyTorch, but really any numpy solution should suffice.
Is there a GPUefficient solution for this task?

PyTorch: How to get the shape of a Tensor as a list of int
In numpy,
V.shape
gives a tuple of ints of dimensions of V.In tensorflow
V.get_shape().as_list()
gives a list of integers of the dimensions of V.In pytorch,
V.size()
gives a size object, but how do I convert it to ints? 
Loss implementation in ImprovedGAN
Recently I am reading the paper Improved Techniques for Training GANs, the author define the loss as follows: Then I check the code of the article, the corresponding code of loss definition is:
output_before_softmax_lab = ll.get_output(disc_layers[1], x_lab, deterministic=False) output_before_softmax_unl = ll.get_output(disc_layers[1], x_unl, deterministic=False) output_before_softmax_gen = ll.get_output(disc_layers[1], gen_dat, deterministic=False) l_lab = output_before_softmax_lab[T.arange(args.batch_size),labels] l_unl = nn.log_sum_exp(output_before_softmax_unl) l_gen = nn.log_sum_exp(output_before_softmax_gen) loss_lab = T.mean(l_lab) + T.mean(T.mean(nn.log_sum_exp(output_before_softmax_lab))) loss_unl = 0.5*T.mean(l_unl) + 0.5*T.mean(T.nnet.softplus(l_unl)) + 0.5*T.mean(T.nnet.softplus(l_gen))
I recognize that the
l_lab
is for the classification loss on labeled data,so it should be minimized, and thel_unl
is the loss with regard to unlabeled data, and thel_gen
loss on generated images. My confusion is why the discriminator should minimize thel_unl
andl_gen
,which is the code0.5*T.mean(T.nnet.softplus(l_unl)) + 0.5*T.mean(T.nnet.softplus(l_gen))
tells us. Thanks in advance. 
How to solve “nan loss” in Keras NN, arising with high number of features?
I am training a neural network using Keras, but get a nan loss when adding more features. I have tried the following solutions, mentioned in other discussions: lower learning rate, a different optimizer, adding epsilon=1e08 / clipnorm=1. argument for optimizer. However, this did not solve it.
The dataset is small, with 1000 elements. I use different feature sets. When using 25 features there is no problem and good performance. But when using all 53 features, I get only nan losses. (When using the feature sets separately, there is no problem either so I think the problem is not in the feature sets themselves but in their high number). The network is below. Any suggestions to solve this?
embedding_layer = Embedding(nb_words, EMBEDDING_DIM, weights=[embedding_matrix], input_length=300, trainable=False) lstm_layer = LSTM(75, recurrent_dropout=0.2) sequence_1_input = Input(shape=(300,), dtype="int32") embedded_sequences_1 = embedding_layer(sequence_1_input) x1 = lstm_layer(embedded_sequences_1) sequence_2_input = Input(shape=(300,), dtype="int32") embedded_sequences_2 = embedding_layer(sequence_2_input) y1 = lstm_layer(embedded_sequences_2) features_input = Input(shape=(f_train.shape[1],), dtype="float32") features_dense = BatchNormalization()(features_input) features_dense = Dense(200, activation="relu")(features_dense) features_dense = Dropout(0.2)(features_dense) addition = add([x1, y1]) minus_y1 = Lambda(lambda x: x)(y1) merged = add([x1, minus_y1]) merged = multiply([merged, merged]) merged = concatenate([merged, addition]) merged = Dropout(0.4)(merged) merged = concatenate([merged, features_dense]) merged = BatchNormalization()(merged) merged = GaussianNoise(0.1)(merged) merged = Dense(150, activation="relu")(merged) merged = Dropout(0.2)(merged) merged = BatchNormalization()(merged) out = Dense(1, activation="sigmoid")(merged) optimizer = nadam(epsilon=1e08, lr = 0.00001) model = Model(inputs=[sequence_1_input, sequence_2_input, features_input], outputs=out) model.compile(loss="binary_crossentropy", optimizer=optimizer) early_stopping = EarlyStopping(monitor="val_loss", patience=5) best_model_path = "best_model" + str(model_count) + ".h5" model_checkpoint = ModelCheckpoint(best_model_path, save_best_only=True, save_weights_only=True) hist = model.fit([data_1_train, data_2_train, f_train], labels_train, validation_data=([data_1_val, data_2_val, f_val], labels_val), epochs=15, batch_size=BATCH_SIZE, shuffle=True, callbacks=[early_stopping, model_checkpoint], verbose=1)

Comparing deviance plots for different loss functions
I'm using the
GradientBoostingRegressor
from ScikitLearn and want to compare the deviance plots for different loss functions. Say I want to compare the Least Squares Loss and Least Absolute Deviance (LAD) loss. I believe that it's necessary to square the LAD loss (or take the square root of the LS loss), otherwise they would have loss of different powers. But when looking at the Huber loss in the source code, this is not taken into account. To me it seems strange to treat a squared loss the same way as a linear loss.Does anyone have thoughts on this?

How to prevent the eligibility trace in SARSA with lambda = 1 from exploding for stateaction pairs that are visited a huge number of times?
I was testing SARSA with lambda = 1 with Windy Grid World and if the exploration causes the same stateaction pair to be visited many times before reaching the goal, the eligibility trace gets incremented each time without any decay, therefore it explodes and causes everything to overflow. How can this be avoided?

How to choose action in TD(0) learning
I am currently reading Sutton's
Reinforcement Learning: An introduction
book. After reading chapter 6.1 I wanted to implement aTD(0)
RL algorithm for this setting:To do this, I tried to implement the pseudocode presented here:
Doing this I wondered how to do this step
A < action given by π for S
: I can I choose the optimal actionA
for my current stateS
? As the value functionV(S)
is just depending on the state and not on the action I do not really know, how this can be done.I found this question (where I got the images from) which deals with the same exercise  but here the action is just picked randomly and not choosen by an action policy
π
.Edit: Or this is pseudocode not complete, so that I have to approximate the
actionvalue function Q(s, a)
in another way, too? 
Alpha beta prune search algorithm only returns the next move, not the best
So, I have this reversi game where I am to implement an AI that a player can play against. To make it easier, the board is just 4x4. I am supposed to make the AI essentially unbeatable. The game kind of works but the problem is that my algorithm is always just returning the next possible move and not the best move.
When my makeMove method is called, it get's the current state of the board which is a 2D array with enums representing the board. It then calls my tree builder that builds a tree with all possible moves from that state until a certain depth. Then I call my minimax method to search and prune it. It traverses a lot of nodes but it always only returns the next possible move and not the best one. Been at this one for several days but I just don't know what is causing this. Am I missing something obvius? I might add that the NodeWrapper that get's returned has the evalutation of INFINITY. That is correct, INFINITY is indeed the biggest possible number so therefore it should be the best move but this can't be correct can it?
private TreeBuilder treeBuilder; private final int INFINITY = Integer.MAX_VALUE; private final int MAXIMUMDEPTH = 7; public AlphaBetaPruneSearch(Board board) { treeBuilder = new TreeBuilder(MAXIMUMDEPTH); } //When the player have made his/hers move, this method is called. public NodeWrapper makeMove(Board board) { System.out.println("AI making move..."); nodesExamined = 0; Tree tree = treeBuilder.buildTree(board.getCells()); NodeWrapper nodeWrapper = miniMaxSearch(tree.getRoot()); return new nodeWrapper; } //Entry point to start the minimax search. private NodeWrapper miniMaxSearch(Node node) { return maxValue(node, 0, INFINITY, INFINITY); } //Maximizes. private NodeWrapper maxValue(Node node, int depth, int alpha, int beta) { nodesExamined++; if (terminalTest(node)) { this.depth = depth; return utility(node); } NodeWrapper maxNodeWrapper = new NodeWrapper(INFINITY, node); for (Node child : node.getChildren()) { NodeWrapper minNodeWrapper = minValue(child, depth++, alpha, beta); maxNodeWrapper = max(maxNodeWrapper, minNodeWrapper); if (maxNodeWrapper.getEvaluation() >= beta) { return maxNodeWrapper; } alpha = Math.max(alpha, maxNodeWrapper.getEvaluation()); } return maxNodeWrapper; } //Minimizes private NodeWrapper minValue(Node node, int depth, int alpha, int beta) { nodesExamined++; if (terminalTest(node)) { this.depth = depth; return utility(node); } NodeWrapper minNodeWrapper = new NodeWrapper(INFINITY, node); for (Node child : node.getChildren()) { NodeWrapper maxNodeWrapper = minValue(child, depth++, alpha, beta); minNodeWrapper = min(minNodeWrapper, maxNodeWrapper); if (minNodeWrapper.getEvaluation() <= alpha) { return minNodeWrapper; } beta = Math.min(beta, minNodeWrapper.getEvaluation()); } return minNodeWrapper; } //Checks if a node is a leaf/terminal node. private boolean terminalTest(Node node) { return node.isLeaf(); } //Utility that returns the node and it's score. private NodeWrapper utility(Node node) { System.out.println("Util"); Board board = new Board(node.getCells()); int score = compareScore(board.getNumberOfWhiteCells(), board.getNumberOfBlackCells()); return new NodeWrapper(score, node); } private int compareScore(int first, int second) { return Integer.compare(first, second); } // //Returns the wrapper with the maximum score. private NodeWrapper max(NodeWrapper first, NodeWrapper second) { int firstScore = first.getEvaluation(); int secondScore = second.getEvaluation(); if (firstScore > secondScore) { return first; } else if (firstScore < secondScore) { return second; } return first; } //Returns the wrapper with the minimum score. private NodeWrapper min(NodeWrapper first, NodeWrapper second) { int firstScore = first.getEvaluation(); int secondScore = second.getEvaluation(); if (secondScore > firstScore) { return second; } else if (secondScore < firstScore) { return first; } return first; } //Class that encapsulates the game state and it's score. private class NodeWrapper { private int evaluation; private final Node node; NodeWrapper(int evaluation, Node node) { this.evaluation = evaluation; this.node = node; } NodeWrapper(Node node) { this.node = node; } int getEvaluation() { return evaluation; } Node getNode() { return node; } }

Minimax only goes down to leftmost leaf
So I got a small board game for my Othello game. In this game the AI should decide what to move to make with a Alpha Beta Prune search algorithm. I used the following Pseudocode form geeksforgeeks:
function minimax(node, depth, isMaximizingPlayer, alpha, beta): if node is a leaf node : return value of the node if isMaximizingPlayer : bestVal = INFINITY for each child node : value = minimax(node, depth+1, false, alpha, beta) bestVal = max( bestVal, value) alpha = max( alpha, bestVal) if beta <= alpha: break return bestVal else : bestVal = +INFINITY for each child node : value = minimax(node, depth+1, true, alpha, beta) bestVal = min( bestVal, value) beta = min( beta, bestVal) if beta <= alpha: break return bestVal
This is how I implemented it:
//Called when it's the AI's turn to make a move. public Board makeMove(Board board) { setRoot(board); int alpha = Integer.MIN_VALUE; int beta = Integer.MAX_VALUE; int val = alphaBetaSearch(tree.getRoot(), 0, true, alpha, beta); Board resultBoard = //Should be AI Board/move setRoot(resultBoard); return resultBoard; } private int alphaBetaSearch(Node node, int depth, boolean maximizing, int alpha, int beta) { currentDepth = depth; if (node.isLeaf()) { return evaluateUtility(node.getBoard()); } if (maximizing) { int bestValue = Integer.MIN_VALUE; for (Node child : node.getChildren()) { int value = alphaBetaSearch(child, depth + 1, false, alpha, beta); bestValue = Integer.max(bestValue, value); alpha = Integer.max(alpha, bestValue); if (beta <= alpha) { break; } return alpha; } } else { int bestValue = Integer.MAX_VALUE; for (Node child : node.getChildren()) { int value = alphaBetaSearch(child, depth + 1, true, alpha, beta); bestValue = Integer.min(bestValue, value); beta = Integer.min(beta, bestValue); if (beta <= alpha) { break; } return beta; } } return 0; //Not sure what should be returned here } private int evaluateUtility(Board board) { int whitePieces = board.getNumberOfWhiteCells(); int blackPieces = board.getNumberOfBlackCells(); int sum = (blackPieces  whitePieces); return sum; }
As my Board is quite small (4x4) I have been able to calculate the complete search tree before the game starts in about 20 seconds. This should improve my search as I don't have build anything when playing. Each Node in my tree contains a Board which themselves have a 2D array of cells. The root node/board looks like this:
EMPTY EMPTY EMPTY EMPTY EMPTY WHITE BLACK EMPTY EMPTY BLACK WHITE EMPTY EMPTY EMPTY EMPTY EMPTY
Now when I start the game, this is the starting board and I call the AI to make a move. When the minimax call has executed, it returns the value 2 at the depth of 12. Depth 12 is a leaf node/board in my tree. After running it with the debugger it seems that my implementation doesn't traverse back up the tree. All it does is going down to the leftmost tree and returns it's evaluation.