Implementing a loss function (MSVE) in Reinforcement learning
I am trying to build a temporal difference learning agent for Othello. While the rest of my implementation seems to run as intended I am wondering about the loss function used to train my network. In Sutton's book "Reinforcement learning: An Introduction", the Mean Squared Value Error (MSVE is presented as the standard loss function. It is basically a Mean Square Error multiplied with the on policy distribution. (Sum over all states s ( onPolicyDistribution(s) * [V(s)  V'(s,w)]² ) )
My question is now: How do I obtain this on policy distribution when my policy is an egreedy function of a learned value function? Is it even necessary and what's the issue if I just use an MSELoss instead?
I'm implementing all of this in pytorch, so bonus points for an easy implementation there :)
See also questions close to this topic

mxnet: save list of tuples of arrays to file
I'm using mxnet to do deep reinforcement learning. I have a simple generator that yields observations from a random walk through a game (from openai gym):
import mxnet as mx from mxnet import * from mxnet.ndarray import * import gym def random_walk(env_id): env, done = gym.make(env_id), True min_rew, max_rew = env.reward_range while True: if done: obs = env.reset() action = env.action_space.sample() obs, rew, done, info = env.step(action) # some preprocessing ommited... yield obs, rew, action # all converted to ndarrays now
I want to be able to save this data to a big file containing rows of
(observation, reward, action)
, so I can later easily load, shuffle, and batch them withmxnet
.Is it possible to do using
mxnet
if so, how? 
Handling actions with delayed effect (Reinforcement learning)
I am working on a problem where the action that I learn (using DQN) can be executed 'now' but it's effect on the environment is delayed by 'T' units of time.
The environment however is active in that time and there are other conditions based on which rewards are computed and returned. How is this handled?
I believe the Q value function (with gamma) handles 'delayed' effects of rewards but not actions.
This is similar to the inventory management use cases. As an analogy consider that I sell cakes. As customers walk into my shop I consume cakes off the shelf. I must reorder to stock my shelf BUT this reordering can take time to take effect.
I thought of just adding the quantity reordered to the shelf at a later time and let the agent learn it's effects. Will this suffice?
As another approach I thought of Experience and Replay as a mechanism to handle this delayed effect.
Appreciate the help.

Curiosity reinforcement learning
Anyone have idea about curiosity reinforcement learning and in which situation it is useful . As per my knowledge curiosity driven reinforcement learning used when external rewards from environment are sparse . If anyone explain with example thanks in advance.

The only supported types are: double, float, int64, int32, and uint8
import torch from torch import nn from torch.autograd import Variable import numpy as np import matplotlib.pyplot as plt import pandas as pd xy = pd.read_csv("all.csv",names=['pm2.5','pm10,s02','no2','co','o3','tem','hig_tem','low_tem','rain','qiya','shidu','zuixiaoshidu','fengsu_10min','fengsu','rizuidafengsu','rizhao']) x_data = xy.iloc[6:1,:1].values y_data = xy.iloc[:6,:1].values x_data = Variable(torch.from_numpy(x_data)) y_data = Variable(torch.from_numpy(y_data))
how can i make it available?

Understanding PyTorch Bernoulli distribution from the documention
So I was reading the pytorch document trying to learn and understand somethings(because I'm new to the machine learning ), I found the
torch.bernoulli()
and I understood (I miss understood it) that it approximates the tensors that the have the values between 1 and 0 to 1 or 0 depends on the value (like classic school less than 0.5 = 0 , more the than or equals 0.5 = 1)After some experimentations on my own that yeah it works as expected
>>>y = torch.Tensor([0.500]) >>>x >>> 0.5000 [torch.FloatTensor of size 1] >>> torch.bernoulli(x) >>> 1 [torch.FloatTensor of size 1]
But when I looked at the document something a bit weird
>>> a = torch.Tensor(3, 3).uniform_(0, 1) # generate a uniform random matrix with range [0, 1] >>> a 0.7544 0.8140 0.9842 **0.5282** 0.0595 0.6445 0.1925 0.9553 0.9732 [torch.FloatTensor of size 3x3] >>> torch.bernoulli(a) 1 1 1 **0** 0 1 0 1 1 [torch.FloatTensor of size 3x3]
in the example the 0.5282 got approximated to 0 , how did that happen ? or it's a fault in the document because I tried it and the 0.5282 got approximated as expected to 1.

Implementing Adam in Pytorch
I’m trying to implement Adam by myself for a learning purpose.
Here is my Adam implementation:
class ADAMOptimizer(Optimizer): """ implements ADAM Algorithm, as a preceding step. """ def __init__(self, params, lr=1e3, betas=(0.9, 0.99), eps=1e8, weight_decay=0): defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay) super(ADAMOptimizer, self).__init__(params, defaults) def step(self): """ Performs a single optimization step. """ loss = None for group in self.param_groups: #print(group.keys()) #print (self.param_groups[0]['params'][0].size()), First param (W) size: torch.Size([10, 784]) #print (self.param_groups[0]['params'][1].size()), Second param(b) size: torch.Size([10]) for p in group['params']: grad = p.grad.data state = self.state[p] # State initialization if len(state) == 0: state['step'] = 0 # Momentum (Exponential MA of gradients) state['exp_avg'] = torch.zeros_like(p.data) #print(p.data.size()) # RMS Prop componenet. (Exponential MA of squared gradients). Denominator. state['exp_avg_sq'] = torch.zeros_like(p.data) exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq'] b1, b2 = group['betas'] state['step'] += 1 # L2 penalty. Gotta add to Gradient as well. if group['weight_decay'] != 0: grad = grad.add(group['weight_decay'], p.data) # Momentum exp_avg = torch.mul(exp_avg, b1) + (1  b1)*grad # RMS exp_avg_sq = torch.mul(exp_avg_sq, b2) + (1b2)*(grad*grad) denom = exp_avg_sq.sqrt() + group['eps'] bias_correction1 = 1 / (1  b1 ** state['step']) bias_correction2 = 1 / (1  b2 ** state['step']) adapted_learning_rate = group['lr'] * bias_correction1 / math.sqrt(bias_correction2) p.data = p.data  adapted_learning_rate * exp_avg / denom if state['step'] % 10000 ==0: print ("group:", group) print("p: ",p) print("p.data: ", p.data) # W = p.data return loss
I think I implemented everything correct however the loss graph of my implementation is very spiky compared to that of torch.optim.Adam.
My ADAM implementation loss graph (below)
torch.optim.Adam loss graph (below) If someone could tell me what I am doing wrong, I’ll be very grateful.
For the full code including data, graph (super easy to run): https://github.com/byorxyz/AMS_pytorch/blob/master/AdamFails_1dConvex.ipynb

Domain adaptation  Adversarial Dropout Regularization  Keras implementation of custom loss
I am trying to implement in Keras domain adaptation method Adversarial Dropout Regularization from the paper https://arxiv.org/abs/1711.01575
Critic is classifier with dropouts and it is updated based on losses from more then one batch. Idea is to get two outputs for one target batch (dropout is included in test phase ), and measure KL divergence between outputs (they will not be same because of dropouts). Additionally to this value, critic loss is also depended on source batch loss. The final loss value is subtraction of source batch loss and discrepancy between outputs of target batch. (Eq 3 and 4, page 4),
I don't know how to implement this loss, due it depends on outputs of three batches (two target and one source).
I tried to make one batch consisted of one source batch and two target batches (same target batch), and then in custom loss to separate y_pred and calculate measures. Here is my custom loss :
def custom_loss(self,y_true,y_pred): pred_source = y_pred[:self.batchSize,:] pred_tgt_1 = y_pred[self.batchSize+1:2*self.batchSize,:] pred_tgt_2 = y_pred[2*self.batchSize+1:,:] discp = discrepancy(pred_tgt_1,pred_tgt_2) loss = keras.losses.categorical_crossentropy(y_true[:self.batchSize],pred_source) return loss  discp
I got error Incompatible shapes: [64] vs. [192] (batch size is 64 ).
Dropout layers are added like this :
cl = layers.Dropout(0.5)(cl,training=True)
Can someone help me with this?

Keras backend Custom Loss Function
I am Trying to calculate
(tp+tn)/total_samples
as my custom loss function. I know how to do this in list and list comprehension but there is no way i guess that i can converty_true
andy_pred
to list.The code I have written so far is:
def CustomLossFunction(y_true, y_pred): y_true_mask_less_zero = K.less(y_true, 0) y_true_mask_greater_zero = K.greater(y_true, 0) y_pred_mask_less_zero = K.less(y_pred, 0) y_pred_mask_greater_zero = K.greater(y_pred, 0) t_zeros = K.equal(y_pred_mask_less_zero, y_true_mask_less_zero) t_ones = K.equal(y_pred_mask_greater_zero, y_true_mask_greater_zero)
Now I need to sum total number of TRUES in t_zeros and t_ones and add them up and divide them by total samples
I got an error on this line:
sum_of_true_negatives = K.sum(t_zeros)
Value passed to parameter 'input' has DataType bool not in list of allowed values: float32, float64, int32, uint8, int16
Questions:
 is there any built in loss function for "(tp+tn)/total_samples"
 if not then how to calculate using Keras backend?

How can I implement marginal loss?
I am trying to implement the marginal loss introduced in the paper [1]. So far this is what I have done.
def marginal_loss(model1, model2, y, margin, threshold): margin_ = 1/(tf.pow(margin,2)margin) tmp = (1.  y) euc_dist = tf.sqrt(tf.reduce_sum(tf.pow(model1model2, 2), 1, keep_dims=True)) thres_dist = threshold  euc_dist mul_val = tf.multiply(tmp, thres_dist) sum_ = tf.reduce_sum(mul_val) return tf.multiply(margin_, sum_)
However, after some epochs, the value goes to nan. I am not sure what mistake I made. Furthermore, I have used 1 instead of epsilon (described in the paper) because its value was not clear. Similarly, the exact threshold value is also not known.
Thanks for any help.
[1] https://ibug.doc.ic.ac.uk/media/uploads/documents/deng_marginal_loss_for_cvpr_2017_paper.pdf

Value Function estimation with TD learning and Deep Neural Networks
I'm trying to predict a value function using TD Learning and Neural Networks.
Knowing that the TD Target is r_{t+1} + γV(S_{t+1}), does it make sense to use it as ay target
and the value predicted for state S_{t}, V(S_{t}), as they predicted
and use this settings to train a network? For example, using both this terms inmodel.fit()
, on the keras framework.Also, what kind of Deep Neural Network would you recommend for this kind of problems?

Is Monte Carlo Tree Search policy or value iteration (or something else)?
I am taking a Reinforcement Learning class and I didn’t understand how to combine the concepts of policy iteration/value iteration with Monte Carlo (and also TD/SARSA/Qlearning). In the table below, how can the empty cells be filled: Should/can it be binary yes/no, some string description or is it more complicated?

Markov Decision process with single actions
Hi guys I am trying to solve some questions about a MRP (i.e. a Markov Decision process with only one possible action at each state). The setup is as follows:
There are two states (a and b) stepping to a is terminal.
All rewards are zero, discount for stepping from b > b is 1, and discount for all other steps is zero
The possible scenarios are: a>b (probability 1), b>a (probability p) and b>b (probability 1p).
The first question I have is whether or not its true that the optimal values for each state here are zero? If not how did you derive this?
Second question I have is if we want a parameter lambda such that lambda * phi(a) and lambda * phi(b) approximate the optimal values at the states a and b and we attempt to approximate such a lambda by means of TD(0) how can I find the expected value of lambda after one episode of updating in terms of p? I.e. if we go back from state b to a the episode is over.
Thanks.

How to dynamically remove nodes in JavaFx
@FXML AnchorPane gamePane; public void gameStart() { if(!Started) { board = new Board(); stones = new Circle[8][8]; newTurn(); applyBoard(); Started = true; } else { DestroyBoard(); // < Erase all the stones gamePane.getChildren().remove(stones[3][3]); board = new Board(); stones = new Circle[8][8]; newTurn(); applyBoard(); } } public void applyBoard() { for(int i = 0; i < board.boardsize; i++) { for(int j = 0; j < board.boardsize; j++) { if(board.board[i][j] != board.EMPTY) { if(board.board[i][j] == board.BLACK) { gamePane.getChildren().remove(stones[i][j]); stones[i][j] = new Circle(155 + 90 * j, 85 + 90 * i, 40); stones[i][j].setFill(Color.BLACK); gamePane.getChildren().add(stones[i][j]); } else if(board.board[i][j] == board.WHITE) { stones[i][j] = new Circle(155 + 90 * j, 85 + 90 * i, 40); stones[i][j].setFill(Color.WHITE); gamePane.getChildren().add(stones[i][j]); } } } } } public void DestroyBoard() { // <Test Function and not worked!! gamePane.getChildren().remove(stones[3][3]); }
I Tried to make if press start button again then all stones on board erased and start a new game. As a first step I tried to erase one basic stone, but I can't delete any of stone on the board. What should I do to solve that?

Minimax Algorithm For Reversi Game From the Ground up
I've been working all day and night researching online how to program a minimax algorithm for Reversi (Othello) in python. I found a lot of sources not explaining some basics. Can anybody tell me how to program it as simply as possible. Assume the following: I already have functions that return the following: 1. List of valid moves 2. If the next move is correct (T/F)

Tensorflow loss is already low
I'm doing an AI with reinforcement learning and i'm getting weird results, the loss shows like this: Tensorflow loss: https://imgur.com/a/Twacm
And while it's training, after each game, it's playing against a random player and after a player with a weighted matrix, but it goes up and down: results: https://imgur.com/a/iGuu2
Basically i'm doing a reinforcement learning agent that learns to play Othello. Using Egreedy, Experience replay and deep networks using Keras over Tensorflow. Tried different architectures like sigmoid, relu and in the images shown above, tanh. All them have similar loss but the results are a bit different. In this exemple the agent is learning from 100k professional games. Here is the architecture, with default learning rate as 0.005:
model.add(Dense(units=200,activation='tanh',input_shape=(64,))) model.add(Dense(units=150,activation='tanh')) model.add(Dense(units=100,activation='tanh')) model.add(Dense(units=64,activation='tanh')) optimizer = Adam(lr=lr, beta_1=0.9, beta_2=0.999, epsilon=1e08, decay=0.0) model.compile(loss=LOSS,optimizer=optimizer)
Original code: https://github.com/JordiMD92/thellia/tree/keras
So, why i get these results? Now my input is 64 neurons (8*8 matrix), with 0 void square, 1 black square and 1 white square. Is it bad to use negative inputs?