Introduction. |
Project
that is made is Four In A Row agent, that is based on method named
“Reinforcement learning”, that is focused on goal-directed
learning from interaction. What we do is inspecting our environments: how it
reacts to our actions, which actions are more likely will help us to achieve
our goal and which are less. Then we choose the most attractable action (e-greedy
strategy) to reach the goal (win the game). To do that we use learning
method, to learn how our opponent plays, by inspecting what actions he does
and doing calculations based in this information to get values function,
which will help us to play more successfully. |
Learning
algorithm and game representation. |
As
learning algorithm we chose Dynamic Programming algorithm, named Value
Iteration. We saw it very natural to choose method that we are familiar with
from previous courses that that we took. From our point of view it was
natural to get values for every state. This algorithm is model based, so it
looks for optimal policy by iterating over values and computing them, by
using transition function, that is defined by inspecting opponent’s
turns. Value iteration uses next formula to do its calculations: Reward
function is calculated in the following way: we try every possible turn and
check which turn is most successful by counting how much points we get from
this turn, multiplied by two (to encourage making our position better, rather
than making opponent’s position worse), subtracting our points from
those, that our opponent got. Points are retrieved by counting the number of
adjacent circles of the same color (in row, column, or at an angle) and
returns only maximal number of adjacent occurrence, for example if we counted
two in a row and three in a column, we will return three, because it is
maximum. There is a special case, where our opponent gets three points, in
this case we multiply it’s points by three and encouraging our agent to
prevent opponent from winning. Transfer
function is represented by square matrix that has one side of size that is
equal to the number of states. It holds number of occurrences of transfers
from some state to some other state in a current game. Later, when we do
bootstrapping we use value from this matrix to count probability of the
transition from state ‘s’ to ‘s’’ and use it
for calculations. Constant
value g
is chosen to be 0.1, it suited for our purposes, when it was less the
affection of turns was too small and differences between values was too
small, and it is no larger, because affection of previous states is too large
from our point of view. We
have 6561 states, that represent all of the states, that are joined to groups
with one representative of each group. We count the states by the base of 3:
it made analogically to binary representation, we take into account only
topmost cells, that contain players or 0 if column is empty. In every
position we have 0,1,2, where 0-empty cell, 1-our player, 2-opponents player.
We relate every game state to our states that we defined. For example if we
have this position: Agent
plays e-greedy.
10% of turns are random, we need this to explore successful turns, that we
wouldn’t normally choose if we played greedy. We use this strategy only
for training, in the actual game we will play e-greedy
with constraint: random turn is chosen only if it differs from greedy turn by
less than 2.0 (chosen value). We need this to not allow really bad turns. Something
needs to be considered: we found, that there are positions, where opponent
can easily win only by knowing those positions. Two of them was distinguished
and actions are taken: first is the case that we have three adjacent circles
are in the same row, we predict this position by taking action when two
circles are in the same row and from one side one free position and from
another two free positions, so we add circle from the side, where there is
one free position and allowing our agent to handle this situation from the
point, from which he can do that. Another position is this: |
Unsuccessful
methods. |
We
tried to launch Value Iteration algorithm with lower g value (g=0.05). We
got less noticeable difference in computed values, it was decided to higher g value to
the level 0.1. |
Training. |
We
ran near 1000 games with a chance of random turn 10%, where both sides were
our implemented agent. We think it is enough for learning from gameplay from vast majority of sides.
We filled two arrays, which are placed in separate files: policy_black.dat
and policy_red.dat, one for black and one for red, with values from 0.0 to
1.0, where higher value means state participated in more successful games,
than state with lower value. We filled arrays according to the next formula:
if turn participated in successful game (where player wins), then new_value =
(1.0 – old_value)/2.0, so as we approach 1.0 we increment less. 1000
runs allowed us to reach 1.0 for some values, we decided that it is enough.
We made first turn to be random to allow us in real game to choose among more
successful turns. |
Related
background. |
·
Course lectures. ·
Reinforcement Learning: An
Introduction by Richard S. Sutton and Andrew G. Barto |
API |
Object CircleDraw: It extends Frame class and implements its paint method for painting. It also contains main method that is not used for the project. Object ManageField: Handles the field actions. Enum: private enum Dir – enumeration, that contains lists of possible directions for checkings. Methods: public static int retFromTurn(byte[][] field, int colNumber) – returns mark of the turn that is planned. Arguments: byte[][] field – array, that represents playing field int colNumber – column number from 1 to 8 public static void addTurn(byte field[][], int colNumber, boolean isMe) – adds turn to the playing field. Arguments: byte[][] field – playing field int colNumber – column number boolean isMe – true if my turn, false if opponent’s public static int stateNumber(byte[][] field) – evaluates state and gives it distinct number (there are states with the same numbers for space preservation). Arguments: byte[][] field – playing fild public static float[] valueIteration(float[][] delta) – preforms algorithm of value iteration and returns array of final values. Arguments: float[][[] delta – matrix of transition function public static int findFour(int lookFor, byte[][] field) – looks for possible occurrence of four adjacent circles of the same color. Returns four if it finds one. Arguments: int lookFor – number from zero to three 0-empty, 1-me, 2-opponent byte[][] field – playing field public static int maxReturn(byte[][] field, int lookFor) – estimate return from playing position. Arguments: byte[][] field – playing field int lookFor – number from zero to three 0-empty, 1-me, 2-opponent public static int directionReturn(byte[][] field, int accPoints, Dir direction, int lookFor, int row, int col) – recursive function for estimating return from one position in one direction. Arguments: byte[][] field – playing field int accPoints – accumulating points dir direction – direction for search int lookFor – number from zero to three 0-empty, 1-me, 2-opponent int row – row number int col – column number Object ImpFourInARow: Implements playing agent. public void fourStart(String role) – runs in the beginning of the game. Arguments: String role – the role, red/black. public void fourEnd(int score) – runs in the end of the game. Arguments: int score – total score. public int fourMove(List<Integer> moved, List<Integer> options, long arg2) – calculates next turn. Arguments: List<Integer> moved – last turn of the opponent. List<Integer> options – list of available turns. long arg2 – time limit. public String getName() – returns player’s name, impPlayer in our case. private void readPolicyFile(double succIndicator[], boolean is_black) – reads file with values. Arguments: double succIndicator[] – puts values In this array. boolean is_black – if we play red it is false, otherwise, true. private void writePolicyFile(boolean turnsArr[], double succIndicator[], boolean is_black) – used for training. Writes values to file from succIndicator[] array. Arguments: boolean turnsArr[] – array with true values with turns that were made. double succIndicator[] – puts values In this array. boolean is_black – if we play red it is false, otherwise, true. public static void main(String[] args) – empty function. *To run training need to uncomment blocks, commented as for training and comment block, that is signed as one that is needs to be commented in the comments. To initialize *.dat files need to uncomment all code in Test.java and when finished need to comment again what was commented. |