Implementing a loss function (MSVE) in Reinforcement learning
I am trying to build a temporal difference learning agent for Othello. While the rest of my implementation seems to run as intended I am wondering about the loss function used to train my network. In Sutton's book "Reinforcement learning: An Introduction", the Mean Squared Value Error (MSVE is presented as the standard loss function. It is basically a Mean Square Error multiplied with the on policy distribution. (Sum over all states s ( onPolicyDistribution(s) * [V(s)  V'(s,w)]² ) )
My question is now: How do I obtain this on policy distribution when my policy is an egreedy function of a learned value function? Is it even necessary and what's the issue if I just use an MSELoss instead?
I'm implementing all of this in pytorch, so bonus points for an easy implementation there :)
See also questions close to this topic

How can there be a negative change in cumulative vehicle delay?
Currently i am doing a project on traffic signal control at an intersection using reinforcement learning and was going through many research papers on the topic. Most papers use as a reward to the agent, change in cumulative vehicle delay between actions.
A negative change is used as a punishment and a positive change is used as a reward. In the papers it is said that a positive change implies, on applying the action the total cumulative delay will reduce by that amount and a negative change will increase the total cumulative delay by the same amount.
I don't understand how a negative change in cumulative delay on applying an action can increase the total cumulative vehicle delay and viceversa. Can anyone please help me understand the concept.

Is it possible to use Neural Network as a evaluation function in board game like chess and training it by reinforcement learning
I am building a AI GOMOKU game (a board game) agent, I want to use neural network as the evaluation function for minimax algorithm. The problem is that I cannot get enough data from the internet, thus I want to use reinforcement learning method to train the network. While I am not an expert in reinforcement learning. I wonder if my idea is feasible and how to do it in reinforcement learning?

Describe state space in reinforcement learning
I'm doing some reinforcement learning task where I have environment (consisting of grass, forest, dirt and water) and predator and prey. My prey is trying to keep away from predator for as long as possible, meanwhile consume water and grass to survive. I have 2 functions I must edit,
getStateDesc < function(simData, preyId)
andgetReward < function(oldstate, action, newstate)
. I already have some states and rewards implemented by default, and my state space is keeping record ofc(distance to predator, direction to predator, and if prey is on border)
states for qlearning algorithm. In reward function, my prey is penalized based od distance to predator and if it is trying to move on border. I now want to add state to check if my prey is in forest(so it can hide) for which I have implemented function isPreyInForest. I want to keep two states for this, ifisPreyInForest==TRUE => state<1 if not state<2
and based on this reward my agent later. Problem is that I cannot change dimension of state space (c(distance, direction, border)
, because when I try to add state to this (c(distance,direction,border,state)
and later in qlearning when I run the simulation with qlearning(c(30, 4, 5,2), maxtrials=100)
(notice that 30 here represent max distance from predator, 4 is direction so 4 max directions and 5 is border, where first 4 numbers are borders and 5 is when agent is not on border state) I haveError in apply(Q, len + 1, "[", n) : dim(X) must have a positive length
. So any idea how to expand state space and give good argument to qlearning function? 
Problems of Pytorch installation on Ubuntu 17.10 (GPU)
I would like to use PyTorch and its GPU computations on my computer.
I have a computer running with Ubuntu 17.10. The computer (Alienware m17x) has two graphic cards:
 An integrated Intel Ivybridge Mobile
 A Nvidia Geforce 675M.
In order to install PyTorch, I followed the instructions on the PyTorch website pytorch.org
1) I installed CUDA 9 with the deb file: https://developer.nvidia.com/cudadownloads
=> Linux/x86_64/Ubuntu/17.04/deb (local)
2) I installed Pytorch using the conda command line: conda install pytorch torchvision cuda90 c pytorch
None of these two steps returned me any type of errors.
I restarted my computer. Apparently the two cards are detected:
$ lspci  grep i vga 00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09) 01:00.0 VGA compatible controller: NVIDIA Corporation GF114M [GeForce GTX 675M] (rev a1)
But apparently there is something wrong with the drivers or CUDA itself. nvidiadetector does not return me anything:
$ nvidiadetector none
And pytorch can not use cuda:
[1]: import torch In [2]: torch.cuda.is_available() Out[2]: False
Could you help me? I can provide additional informations if necessary, but I am not sure what could be relevant.

pytorch with exception  TypeError: 'Linear' object does not support indexing
I wanna to lean pytorch and I wrote the code as bellows:
import torch.nn as nn import torch import random import torch.nn.functional as F import numpy as np from torch.autograd import Variable import torch.optim # eacg line in xs contains a and b; z=2*a+(a+b)b def getZ(xs): zs = torch.ones(len(xs), 1) for i in range(len(xs)): z = 2 * xs[i, 0] * (xs[i, 0] + xs[i, 1])  xs[i, 1] zs[i, 0] = z return zs xs = torch.ones(10, 2) xs = xs + 10 * torch.rand(10, 2) zs = getZ(xs) print(zs) xs = Variable(xs) zs = Variable(zs) # define the network class CNN(nn.Module): def __init__(self): super(CNN, self).__init__() self.l1 = torch.nn.Linear(2, 10) self.l2 = torch.nn.Linear(10, 10) self.l3 = torch.nn.Linear(10, 1) def forward(self, x): x = self.l1(x) # x = F.relu(x) x = self.l2(x) # x = F.relu(x) x = self.l3 return x cnn = CNN() print(cnn) # define optimzer and cost function optimizer = torch.optim.SGD(cnn.parameters(), lr=0.1) cost_func = torch.nn.MSELoss() # beign training for i in range(1000): print("train times: ", i) predict = cnn(xs) loss = cost_func(predict, zs) print("loss:", loss) optimizer.zero_grad() loss.backward() optimizer.step()
My computer is Windows 8.1 x64, and my pytorch is installed via conda with python version is 3.6
I tried to run this code, but the IDE told me there is an exception as: TypeError: 'Linear' object does not support indexing
I dont know how to fix this, anyone could help me? thanks a lot...

Detaching only specific indices of hidden states in LSTM
I am using LSTM with the A2C algorithm. That means I have multiple instances of agents and environments, each one of them being reset (agent failed/succeeded) at different times. The problem is that I want to "repack" (as in the pytorch LM example) only specific indices of the hidden states. However trying to repack only part of the hidden state did not go well.
Here is some simple code to illustrate the problem:
async_reset = True class SimpleModel(nn.Module): def __init__(self): super(SimpleModel, self).__init__() self.lstm = nn.LSTM(10, 5, num_layers=1) def forward(self, input, hidden): return self.lstm(input, hidden) net = SimpleModel() optimizer = torch.optim.RMSprop(net.parameters()) input = Variable(torch.ones((1, 4, 10))) hidden = None last_time = time.time() odd = True for i in range(20000): output, hidden = net(input, hidden) res = torch.sum(output) # hidden = Variable(hidden[0].data), Variable(hidden[1].data) c,h = hidden if async_reset: if odd: c[:, :2] = Variable(c[:, :2].data) h[:, :2] = Variable(h[:, :2].data) else: c[:, 2:] = Variable(c[:, 2:].data) h[:, 2:] = Variable(h[:, 2:].data) else: c = Variable(c.data) h = Variable(h.data) print(odd) odd = not odd hidden = (c,h) print("==") optimizer.zero_grad() loss = torch.sum(output) loss.backward(retain_graph=True) print("%d: Took %.2f second" % (i, time.time()  last_time)) last_time = time.time()
Note that when async_reset is False, the model runs very fast (<0.01 seconds) since there aren't any previous timestamps to propogate. When async_reset is True, a different half of the batch is repacked each time. This means that there should always be only one timestamp the model needs to look backwards to, and it should still be very fast. But the results are different: It takes a lot of time to solve this (0.3 seconds after 230 runs).
I understand from this that the variable is still considering older timestamps. Probably because not the whole variable is detached from the gradients. Other methods such as starting with a zero hidden state and assigning the part of batch that should not be reset didn't help.
Anyway to prevent this?
Thanks so much!

tensorflow word2vec loss function doesn't decrease
Following the official example, I implemented word2vec in tensorflow using the tf.nce_loss() as the loss function.
During training, the loss function fluctuated all the time and i didn't see the loss decreasing, if this is normal or not?
Because my dataset is really big, i choose 1 as the epoch num. Should i increase the epoch num to see a decreasing loss?
If not, what can i do to make the loss decrease or this is not important to get a good version word vectors.
Thank you!

Small negative loss values with keras binary_crossentropy on tensorflow backend (tf.sigmoid_cross_entropy_with_logits)
I am training a model for multilabel classification using keras binary_crossentropy. I am using tensorflow backend so it would call tf.sigmoid_cross_entropy_with_logits. Model converges gradually with loss approaching almost zero and then on further training it goes into negative range. I am training it on Tesla K80 GPU on an Ubuntu machine. After googling, I found following explanations:
Target values outside [0, 1] interval: https://github.com/kerasteam/keras/issues/1917 and Keras: Binary_crossentropy has negative values. I verified that my target values are either 0 or 1. In fact, they are generated as:
yTrue = (yTrueContinuous > yThreshold).astype(int)
Also above links shows much larger negative values. I am getting values which are under 0.001.
Suspecting numerical stability issues with integer targets, I changed it to np.float32 but kept getting small negative loss
yTrue = (yTrueContinuous > yThreshold).astype(np.float32)
Does anyone know why this is happening?

Proving metrics for multi class classification.
I am unsure how to solve this problem. I am brand new to machine learning. Could someone help direct me down the right path?
Here are some of my definitions.
 Label Space: The output of the function
 Metric p: Defines how well the function did
 Multi class Classification: Place object into specific group based off its properties
I have no idea how to prove that p is a metric. Further, I don't really know how to define a metric in general.
Suppose X is a set that we’re using as our label space. Define a metric ρ on a set X that captures the loss function concept we described for multiclass classification (i.e., treat X as the set of classes and defineρas zero/one loss)and prove that ρ is a metric.

Reinforcement Learning: Q and Q(λ) speed difference on Windy Grid World environment
Preface:
I have attempted to solve this WindyGridWorld env. Having implemented both Q and Q(λ) algorithm, the results are pretty much the same (I am looking at steps per episode).
Problem:
From what I have read, I believe that a higher lambda parameter should update more states further back leading up to it; therefore, the amount of steps should decrease much more dramatically than regular Qlearning. This image shows what I am talking about.
Is this normal for this environment or have I implemented it wrong?
Code:
import matplotlib.pyplot as plt import numpy as np from lib.envs.windy_gridworld import WindyGridworldEnv from collections import defaultdict env = WindyGridworldEnv() def epsilon_greedy_policy(Q, state, nA, epsilon): ''' Create a policy in which epsilon dictates how likely it will take a random action. :param Q: links state > action value (dictionary) :param state: state character is in (int) :param nA: number of actions (int) :param epsilon: chance it will take a random move (float) :return: probability of each action to be taken (list) ''' probs = np.ones(nA) * epsilon / nA best_action = np.argmax(Q[state]) probs[best_action] += 1.0  epsilon return probs def Q_learning_lambda(episodes, learning_rate, discount, epsilon, _lambda): ''' Learns to solve the environment using Q(λ) :param episodes: Number of episodes to run (int) :param learning_rate: How fast it will converge to a point (float [0, 1]) :param discount: How much future events lose their value (float [0, 1]) :param epsilon: chance a random move is selected (float [0, 1]) :param _lambda: How much credit to give states leading up to reward (float [0, 1]) :return: x,y points to graph ''' # Link state to action values Q = defaultdict(lambda: np.zeros(env.action_space.n)) # Eligibility trace e = defaultdict(lambda: np.zeros(env.action_space.n)) # Points to plot # number of episodes x = np.arange(episodes) # number of steps y = np.zeros(episodes) for episode in range(episodes): state = env.reset() # Select action probs = epsilon_greedy_policy(Q, state, env.action_space.n, epsilon) action = np.random.choice(len(probs), p=probs) for step in range(10000): # Take action next_state, reward, done, _ = env.step(action) # Select next action probs = epsilon_greedy_policy(Q, next_state, env.action_space.n, epsilon) next_action = np.random.choice(len(probs), p=probs) # Get update value best_next_action = np.argmax(Q[next_state]) td_target = reward + discount * Q[next_state][best_next_action] td_error = td_target  Q[state][action] e[state][action] += 1 # Update all states for s in Q: for a in range(len(Q[s])): # Update Q value based on eligibility trace Q[s][a] += learning_rate * td_error * e[s][a] # Decay eligibility trace if best action is taken if next_action is best_next_action: e[s][a] = discount * _lambda * e[s][a] # Reset eligibility trace if random action taken else: e[s][a] = 0 if done: y[episode] = step e.clear() break # Update action and state action = next_action state = next_state return x, y
You can check out my Jupyter Notebook here if you would like to see the whole thing.

Replicating ntuple strategy for playing 2048 which significantly inferior results  any guidance? Full code inside
I have created an agent for playing 2048 using the ntuple approach (mimicked from the paper Temporal Difference Learning of NTuple Networks for the Game 2048. I have replicated their approach to the best of my knowledge but the results are extremely underwhelming. My agent has trained over 130,000 games but is still only hitting the 1024 tile about 7% of the time. I know training takes time but according to the paper their agent won the game >80% of the time using the strategy I am attempting to replicate after only 100,000 games. I am using the approach with the 2 3tuples and 2 2tuples with symmetric sampling as shown in the paper in Figure 10. Does anyone have any guidance as to what I may be doing wrong? Any and all help is appreciated.
Python Code:
# TDAfterstate from random import randint, random import numpy as np from copy import copy import sys import time from datetime import timedelta from openpyxl import Workbook, load_workbook import pickle import math import os.path class Game: def __init__(self, board): self.height = 4 self.width = 4 self.actions = [0, 1, 2, 3] self.total_score = 0 if board is not None: self.board = board def empty_tiles(self): # Identifies the coordinates of empty tiles return (self.board==0).nonzero() def add_tile(self): # Inserts a new tile into an empty space zero_coords = self.empty_tiles() if len(zero_coords[0]) > 0: # Determine new tile value new_tile = 2 if random() > .1 else 4 # Determine index to insert new tile rand_idx = randint(0, len(zero_coords[0])  1) # Insert new tile self.board[zero_coords[0][rand_idx], zero_coords[1][rand_idx]] = new_tile def starting_board(self): self.board = np.zeros((4,4), dtype=np.int32) self.add_tile() self.add_tile() def print_board(self,nl = True): # Displays the board print("+"*self.width + "+") for i in range(self.height): for j in range(self.width): if self.board[i][j] == 0: print(" ", end='') else: print("%4d "%(self.board[i][j]), end='') print("") print("+"*self.width + "+") if nl: print("") sys.stdout.flush() def swipe(self, direction): # The swipe method is designed to swipe left  Other directions must be rotated first rotated_board = np.rot90(self.board,direction) # Variable for tracking points to add to total score new_points = 0 for row in rotated_board: # tile1 and tile2 are adjacent tiles tile1 = 0 idx = 0 # Index nz = np.append(row[row != 0],1) # Nonzero values # Parse through nonzero tiles for tile2 in nz: if tile2 == 1: if tile1 != 0: row[idx] = tile1 idx += 1 # Compare next tiles if one is empty elif tile1 == 0: tile1 = tile2 # Tiles are identical elif tile1 == tile2: combined_tile = tile1 + tile2 row[idx] = combined_tile new_points += combined_tile self.total_score += combined_tile idx += 1 tile1 = 0 # Different tiles else: row[idx] = tile1 tile1 = tile2 idx += 1 # Mark empty tiles in the row while idx < 4: row[idx] = 0 idx += 1 return new_points def done(self): """ Determines if the board admits further moves. """ if (self.board == 0).any(): return False if (self.board[1:] == self.board[:1]).any(): return False if (self.board[:,1:] == self.board[:,:1]).any(): return False return True def tups(board): # Returns the exponent of the tiles for each tuple for symmetric sampling all_tups = [] exp_board = board.copy() nz = exp_board.nonzero() exp_board[nz] = np.log2(exp_board[nz]) boards = [] t_board = np.transpose(exp_board) for i in range(4): boards.append(np.rot90(exp_board,i)) boards.append(np.rot90(t_board,i)) for board in boards: tup1 = tuple(board[0:3,0:2].flatten()) tup2 = tuple(board[0:3,1:3].flatten()) tup3 = tuple(board[:,2]) tup4 = tuple(board[:,3]) all_tups.append((tup1,tup2,tup3,tup4)) return all_tups def f(state, V1, V2, V3, V4): # f approximates thevalue function V all_tups = tups(state) score = 0 for tup_set in all_tups: score = score + V1[tup_set[0]] score = score + V2[tup_set[1]] score = score + V3[tup_set[2]] score = score + V4[tup_set[3]] return score def choose_action(state, V1, V2, V3, V4): Vs = [] for action in range(4): state_copy = copy(state) swiped_game = Game(state_copy) reward = swiped_game.swipe(action) n_state = copy(swiped_game.board) if (n_state != state).any(): score = reward + f(n_state, V1, V2, V3, V4) Vs.append(score) else: Vs.append(math.inf) action_order = list(reversed(np.argsort(Vs))) action = action_order[0] return action def learn_evaluation(V1, V2, V3, V4, n_state, nn_state, learning_rate): n_action = choose_action(nn_state, V1, V2, V3, V4) temp_game = Game(nn_state) n_reward = temp_game.swipe(n_action) nnn_state = copy(temp_game.board) f_nnn_state = f(nnn_state, V1, V2, V3, V4) f_n_state = f(n_state, V1, V2, V3, V4) all_tups = tups(n_state) for tup_set in all_tups: V1[tup_set[0]] = V1[tup_set[0]] + learning_rate*(n_reward + f_nnn_state  f_n_state) V2[tup_set[1]] = V2[tup_set[1]] + learning_rate*(n_reward + f_nnn_state  f_n_state) V3[tup_set[2]] = V3[tup_set[2]] + learning_rate*(n_reward + f_nnn_state  f_n_state) V4[tup_set[3]] = V4[tup_set[3]] + learning_rate*(n_reward + f_nnn_state  f_n_state) def create_workbook(path): wb = Workbook() ws = wb.create_sheet(title="Model Output") wb.save(filename=path) # paths path_model_output = "C:...\\2048\\N_tuples_symm_reward.xlsx" path_base_V1 = "C:...\\2048\\V_symm\\V1r\\V1r_" path_base_V2 = "C:...\\2048\\V_symm\\V2r\\V2r_" path_base_V3 = "C:...\\2048\\V_symm\\V3r\\V3r_" path_base_V4 = "C:...\\2048\\V_symm\\V4r\\V4r_" if not os.path.isfile(path_model_output): create_workbook(path_model_output) # Learning old_time = time.time() V1 = np.zeros(shape=(11,11,11,11,11,11), dtype = np.int16) V2 = np.zeros(shape=(11,11,11,11,11,11), dtype = np.int16) V3 = np.zeros(shape=(11,11,11,11), dtype = np.int16) V4 = np.zeros(shape=(11,11,11,11), dtype = np.int16) n_2048 = 0 n_1024 = 0 n_512 = 0 n_256 = 0 num_games = 1000000 learning_enabled = True learning_rate = 0.0025 wb = load_workbook(filename=path_model_output) ws = wb.active for game_num in range(1,num_games+1): game = Game(None) game.starting_board() moves = 0 score = 0 while not game.done() and np.amax(game.board) < 2048: state = copy(game.board) # Choose action action = choose_action(state, V1, V2, V3, V4) # Find reward, n_state, nn_state reward = game.swipe(action) n_state = copy(game.board) game.add_tile() nn_state = copy(game.board) # Update LUTs if learning_enabled: #symmetric_learning(V, all_tups, n_state, nn_state, learning_rate) learn_evaluation(V1, V2, V3, V4, n_state, nn_state, learning_rate) score = score + reward state = nn_state max_tile = np.amax(game.board) if max_tile == 256: n_256 = n_256 + 1 elif max_tile == 512: n_512 = n_512 + 1 elif max_tile == 1024: n_1024 = n_1024 + 1 elif max_tile == 2048: n_2048 = n_2048 + 1 if game_num % 100 == 0: t = round(time.time()  old_time, 2) old_time = time.time() print("Game Number: {}; 2048: {}; 1024: {}; 512: {}, 256: {}; Time Elapsed: {}".format(\ game_num, str(n_2048)+"%", str(n_1024)+"%", str(n_512)+"%", str(n_256)+"%", str(timedelta(seconds=t)))) n_2048 = 0 n_1024 = 0 n_512 = 0 n_256 = 0 if game_num % 1000 == 0: path1 = path_base_V1 + str(game_num) + ".pickle" pickle.dump(V1,open(path1,"wb")) path2 = path_base_V2 + str(game_num) + ".pickle" pickle.dump(V2,open(path2,"wb")) path3 = path_base_V3 + str(game_num) + ".pickle" pickle.dump(V3,open(path3,"wb")) path4 = path_base_V4 + str(game_num) + ".pickle" pickle.dump(V4,open(path4,"wb")) results = (game_num, str(n_2048)+"%", str(n_1024)+"%", str(n_512)+"%", str(n_256)+"%") ws.append(results) wb.save(filename=path_model_output)

How do weights update for ntuple networks?
I am currently reading through Temporal Difference Learning of NTuple Networks for the Game 2048. I am trying to implement my own ntuple network but I do not understand how the weights/values from the look up table (LUT) update.
The paper states that V updates according to V(s) ← V(s) + α(r + V(s'') − V (s)) where V is approximated by V = sum(LUT_i) where LUT_i is the value of the ith ntuple from the look up table given the current state. V is a sum of values taken from a table. I don't understand how the values in the LUT itself are updated.
Any and all guidance is appreciated  thanks!

Tensorflow loss is already low
I'm doing an AI with reinforcement learning and i'm getting weird results, the loss shows like this: Tensorflow loss: https://imgur.com/a/Twacm
And while it's training, after each game, it's playing against a random player and after a player with a weighted matrix, but it goes up and down: results: https://imgur.com/a/iGuu2
Basically i'm doing a reinforcement learning agent that learns to play Othello. Using Egreedy, Experience replay and deep networks using Keras over Tensorflow. Tried different architectures like sigmoid, relu and in the images shown above, tanh. All them have similar loss but the results are a bit different. In this exemple the agent is learning from 100k professional games. Here is the architecture, with default learning rate as 0.005:
model.add(Dense(units=200,activation='tanh',input_shape=(64,))) model.add(Dense(units=150,activation='tanh')) model.add(Dense(units=100,activation='tanh')) model.add(Dense(units=64,activation='tanh')) optimizer = Adam(lr=lr, beta_1=0.9, beta_2=0.999, epsilon=1e08, decay=0.0) model.compile(loss=LOSS,optimizer=optimizer)
Original code: https://github.com/JordiMD92/thellia/tree/keras
So, why i get these results? Now my input is 64 neurons (8*8 matrix), with 0 void square, 1 black square and 1 white square. Is it bad to use negative inputs?

Why does my implementation of the negamax alphabeta not work?
I am working on a c++ negamax implementation with alphabeta pruning, but it seems to not be working properly, and I am stumped.\n It even returns different values from the starting position, which should have identical values for every move. The method is called by the following code:\n
score = alphaBeta(INFINITY, INFINITY, depth, testBoard, White);
. Here is my code:float alphaBeta(float alpha, float beta, int depthleft, BitBoards currentBoard, int color) { //Negamax implementation of alphabeta search: //Prepare copy of board currentBoards[depthleft] = currentBoard; //If at leaf node, return heuristic value if (depthleft == 0) { return evaluation(¤tBoard); } //Are there any moves that can be played? possibleMoves = FindMoves(color, ¤tBoard); if (possibleMoves == 0x0ull) { //If there are no moves that can be played, pass for a turn. color = color; possibleMoves = FindMoves(color, ¤tBoard); //Can the opponent play any moves? if (possibleMoves == 0x0ull) { //If the opponent cannot play anything either, return outcome of game. if (Hammingweight(currentBoard.BlackPieces) > Hammingweight(currentBoard.WhitePieces)) { return 100000; } if (Hammingweight(currentBoard.BlackPieces) < Hammingweight(currentBoard.WhitePieces)) { return 100000; } else { return 0; } } } //Find total possible moves and loop through them. moves = Hammingweight(possibleMoves); for (uint64b j = 0; j < moves; j++) { //Choose next move to test. moveSpot = FindNextMove(possibleMoves); possibleMoves &= ~moveSpot; //Prepare scratch work board testBoard = currentBoards[depthleft]; //Make the move being considered MakeMove(color, moveSpot, &testBoard); //Find alphabeta heuristic value of next node score = alphaBeta(beta, alpha, depthleft  1, testBoard, color); if (score >= beta) { //Fail hard beta cutoff return beta; } if (score > alpha) { alpha = score; } } return alpha; }

java reversi/othello AI, changed AI position can't be saved in array
the Gui is given, but can't be seen. Accoding to the order, myself can be KIPlayer or a person. This KIPlayer(AIPlayer) implements the Interface Player(). Inside of Player(),there are two methods: init() and nextMove().
my question is in the nextMove().
my logik of nextMove():
public Move nextMove(Move prevMove, long tOpponent, long t ): else{
save prevMove from rival in array
save changed position of prevMove in array // it works
save myself in array
save changed position of myself in array// here doesn't work, in the code it is: change(bestMove.x,bestMove.y,rival);
without the change(...) here, it works, but each time the changed position won't be saved, so this AIplayer will lose.if the change(...) code stays here, in the console, till this step, it doesn't work any move. how to fix it?Where should I put the change(for KIPlayer) in my code?} return bestMove;
package ki; import szte.mi.Player; import szte.mi.Move; import szte.mi.*; import java.util.ArrayList; import java.util.List; public class KIPlayer implements Player { public int[][] array; public int myself; public int rival; int sum=0; int bestX=1; int bestY=1; Move bestMove; public void init( int order, long t, java.util.Random rnd ) { this.array=new int [8][8]; array[3][3]=2; array[3][4]=1; array[4][3]=1; array[4][4]=2; if(order==0){ myself=1;//black rival=2; } else if(order==1){ myself=2;//white rival=1; } System.out.println("Init called"+"myself is"+myself); } public Move nextMove(Move prevMove, long tOpponent, long t ) { if(prevMove==null) { for(int i=0;i<8;i++) { for(int j=0;j<8;j++) { if(array[3][3]==2 && array[3][4]==1 && array[4][3]==1 && array[4][4]==2 && array[i][j]==0) { //the first move myself=1; rival=2; } else { switchPlayer(myself); } } } } else { // rival array[prevMove.x][prevMove.y]=this.rival; change(prevMove.x, prevMove.y, this.rival); // KI / AI bestMove= legalMove(rival,myself); change(bestMove.x,bestMove.y,rival);// probelm here!it doens't work //here is the question } return bestMove; //return legalMove(this.rival, this.myself); } public void switchPlayer(int myself) { if(myself==1) { myself=2; rival=1; } else if(myself==2) { myself=1; rival=2; } } public int getWeight(int x, int y) { int weight[][]= new int[][]{{90,60,10,10,10,10,60,90}, {60,80,5,5,5,5,80,60}, {10,5,1,1,1,1,5,10}, {10,5,1,1,1,1,5,10}, {10,5,1,1,1,1,5,10}, {10,5,1,1,1,1,5,10}, {60,80,5,5,5,5,80,60}, {90,60,10,10,10,10,60,90}}; return weight[x][y]; } public Move legalMove(int rival, int myself) { int Max= 500; for (int x=0;x<8;x++) { for(int y=0;y<8;y++) { if(this.array[x][y]==this.myself) {//position with same color of the player System.out.println("myself in legalMove: "+ x+"+"+y); int neighbour[][]= new int[3][3]; neighbour[1][1] = array[x][y]; // myself neighbour[1][0] = array[x][y1]; //up neighbour[1][2] = array[x][y+1]; //down neighbour[0][1] = array[x1][y]; //left neighbour[2][1] = array[x+1][y]; //right neighbour[0][0] = array[x1][y1];//left upper neighbour[2][0] = array[x+1][y1];//right upper neighbour[0][2] = array[x1][y+1];//left down neighbour[2][2] = array[x+1][y+1];//right down //} for(int j=0;j<3;j++) { for(int i=0; i<3; i++) { if(neighbour[i][j]!=20 && neighbour[i][j]!= myself ){ //3x3 int NeighborX= x+(i1); int NeighborY= y+(j1); System.out.println("3x3 X:"+NeighborX+",Y: "+NeighborY); if(array[NeighborX][NeighborY]==rival) { while(array[NeighborX][NeighborY] == rival) { //neighbor extending > edge NeighborX=NeighborX+(i1); NeighborY=NeighborY+(j1); System.out.println("neighborX"+NeighborX+"neighborY"+NeighborY +"array[X][Y]"+array[NeighborX][NeighborY]); if(array[NeighborX][NeighborY]==myself NeighborX>=7  NeighborY>=7  NeighborX<=0  NeighborY<= 0) { //?????? break; } } if(array[NeighborX][NeighborY]==0) { //whether the Extended neighbor is same as myself(m,n) if(getWeight(NeighborX,NeighborY)>=Max ) { Max=getWeight(NeighborX,NeighborY); bestMove = new Move(NeighborX, NeighborY); } } } } } } } } } array[bestMove.x][bestMove.y]=this.myself; return bestMove; } public void change(int x, int y, int rival) { int neighborX=10; int neighborY=10; int extendedNeighborX=10; int extendedNeighborY=10; for(int i=0;i<3;i++) { for(int j=0;j<3;j++) { neighborX= x+(i1); neighborY= y+(j1); if(neighborX>=0 && neighborY>=0 && neighborX<8 && neighborY<8) { if(array[neighborX][neighborY]==myself){ while(array[neighborX][neighborY]==myself) { neighborX= neighborX+(i1); neighborY= neighborY+(j1); if(neighborX>7  neighborY>7  neighborX<0  neighborY< 0) { break; } } extendedNeighborX= neighborX; extendedNeighborY= neighborY; while(extendedNeighborX<8 && extendedNeighborY<8 && extendedNeighborX>=0 && extendedNeighborY>=0 && array[extendedNeighborX][extendedNeighborY]==array[x][y] && !( extendedNeighborX==x && extendedNeighborY==y)) { // if a player enter() legally extendedNeighborX=extendedNeighborX(i1); extendedNeighborY=extendedNeighborY(j1); this.array[extendedNeighborX][extendedNeighborY]=this.array[x][y]; } } } } } }
}