Introduction.

            Project that is made is Four In A Row agent, that is based on method named “Reinforcement learning”, that is focused on goal-directed learning from interaction. What we do is inspecting our environments: how it reacts to our actions, which actions are more likely will help us to achieve our goal and which are less. Then we choose the most attractable action (e-greedy strategy) to reach the goal (win the game). To do that we use learning method, to learn how our opponent plays, by inspecting what actions he does and doing calculations based in this information to get values function, which will help us to play more successfully.

 

Learning algorithm and game representation.

            As learning algorithm we chose Dynamic Programming algorithm, named Value Iteration. We saw it very natural to choose method that we are familiar with from previous courses that that we took. From our point of view it was natural to get values for every state. This algorithm is model based, so it looks for optimal policy by iterating over values and computing them, by using transition function, that is defined by inspecting opponent’s turns. Value iteration uses next formula to do its calculations: , where V(s) is a state value, R(s,a) is a reward, of doing action ‘a’ in state ‘s’, d(s,a,s’) is a transfer function and g is constant parameter, that defines how much we depend from previous positions.

            Reward function is calculated in the following way: we try every possible turn and check which turn is most successful by counting how much points we get from this turn, multiplied by two (to encourage making our position better, rather than making opponent’s position worse), subtracting our points from those, that our opponent got. Points are retrieved by counting the number of adjacent circles of the same color (in row, column, or at an angle) and returns only maximal number of adjacent occurrence, for example if we counted two in a row and three in a column, we will return three, because it is maximum. There is a special case, where our opponent gets three points, in this case we multiply it’s points by three and encouraging our agent to prevent opponent from winning.

                Transfer function is represented by square matrix that has one side of size that is equal to the number of states. It holds number of occurrences of transfers from some state to some other state in a current game. Later, when we do bootstrapping we use value from this matrix to count probability of the transition from state ‘s’ to ‘s’’ and use it for calculations.

                Constant value g is chosen to be 0.1, it suited for our purposes, when it was less the affection of turns was too small and differences between values was too small, and it is no larger, because affection of previous states is too large from our point of view.

                We have 6561 states, that represent all of the states, that are joined to groups with one representative of each group. We count the states by the base of 3: it made analogically to binary representation, we take into account only topmost cells, that contain players or 0 if column is empty. In every position we have 0,1,2, where 0-empty cell, 1-our player, 2-opponents player. We relate every game state to our states that we defined. For example if we have this position:  and we play black (assuming that all positions from right are empty), the state number will be 1*30+2*31+1*32+2*33=70. Of course we don’t use every state, because we got states like: row of adjacent 5 of the same color. We used this method, rather than usage of distinct states (manually enter states), because there are too many separate states and we can easily loose some from our consideration, so chosen method is more uniform.

                Agent plays e-greedy. 10% of turns are random, we need this to explore successful turns, that we wouldn’t normally choose if we played greedy. We use this strategy only for training, in the actual game we will play e-greedy with constraint: random turn is chosen only if it differs from greedy turn by less than 2.0 (chosen value). We need this to not allow really bad turns.

                Something needs to be considered: we found, that there are positions, where opponent can easily win only by knowing those positions. Two of them was distinguished and actions are taken: first is the case that we have three adjacent circles are in the same row, we predict this position by taking action when two circles are in the same row and from one side one free position and from another two free positions, so we add circle from the side, where there is one free position and allowing our agent to handle this situation from the point, from which he can do that. Another position is this: , where we play black, our agent couldn’t see that it needs to put a black circle in the space between two red circles, so we did that manually (similar cases were also considered, situation was taken more generally). Third position is diagonal, when we have three circles of our opponent and one hole between them, we simply fill that hole.

 

Unsuccessful methods.

                We tried to launch Value Iteration algorithm with lower g value (g=0.05). We got less noticeable difference in computed values, it was decided to higher g value to the level 0.1.

 

Training.

                We ran near 1000 games with a chance of random turn 10%, where both sides were our implemented agent. We think it is enough for learning from  gameplay from vast majority of sides. We filled two arrays, which are placed in separate files: policy_black.dat and policy_red.dat, one for black and one for red, with values from 0.0 to 1.0, where higher value means state participated in more successful games, than state with lower value. We filled arrays according to the next formula: if turn participated in successful game (where player wins), then new_value = (1.0 – old_value)/2.0, so as we approach 1.0 we increment less. 1000 runs allowed us to reach 1.0 for some values, we decided that it is enough. We made first turn to be random to allow us in real game to choose among more successful turns.

 

Related background.

·         Course lectures.

·         Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto

API

Object CircleDraw:

                It extends Frame class and implements its paint method for painting.

                It also contains main method that is not used for the project.

Object ManageField:

                Handles the field actions.

                Enum:

                private enum Dir – enumeration, that contains lists of possible directions for checkings.

                Methods:

                public static int retFromTurn(byte[][] field, int colNumber) – returns mark of the turn                                    that is planned.

                                Arguments:

                                byte[][] field – array, that represents playing field

                                int colNumber – column number from 1 to 8

                public static void addTurn(byte field[][], int colNumber, boolean isMe) – adds turn to the                             playing field.

                                Arguments:

                                byte[][] field – playing field

                                int colNumber – column number

                                boolean isMe – true if my turn, false if opponent’s

                public static int stateNumber(byte[][] field) – evaluates state and gives it distinct number (there are states with the same numbers for space preservation).

                                Arguments:

                                byte[][] field – playing fild

                public static float[] valueIteration(float[][] delta) – preforms algorithm of value iteration                 and        returns array of final values.

                                Arguments:

                                float[][[] delta – matrix of transition function

                public static int findFour(int lookFor, byte[][] field) – looks for possible occurrence of four                 adjacent circles of the same color. Returns four if it finds one.

                                Arguments:

                                int lookFor – number from zero to three 0-empty, 1-me, 2-opponent

                                byte[][] field – playing field

                public static int maxReturn(byte[][] field, int lookFor) – estimate return from playing                 position.

                                Arguments:

                                byte[][] field – playing field

                                int lookFor – number from zero to three 0-empty, 1-me, 2-opponent

 

                public static int directionReturn(byte[][] field, int accPoints, Dir direction,

                                int lookFor, int row, int col) – recursive function for estimating return from one                                                 position in one direction.

                                Arguments:

                                byte[][] field – playing field

                                int accPoints – accumulating points

                                dir direction – direction for search

                                int lookFor – number from zero to three 0-empty, 1-me, 2-opponent

                                int row – row number

                                int col – column number

Object ImpFourInARow:

                Implements playing agent.

                public void fourStart(String role) – runs in the beginning of the game.

                                Arguments:

                                String role – the role, red/black.

                public void fourEnd(int score) – runs in the end of the game.

                                Arguments:

                                int score – total score.

                public int fourMove(List<Integer> moved, List<Integer> options, long arg2) – calculates next                 turn.

                                Arguments:

                                List<Integer> moved – last turn of the opponent.

                                List<Integer> options – list of available turns.

                                long arg2 – time limit.

                public String getName() – returns player’s name, impPlayer in our case.

                private void readPolicyFile(double succIndicator[], boolean is_black) – reads file with values.

                                Arguments:

                                double succIndicator[] – puts values In this array.

                                boolean is_black – if we play red it is false, otherwise, true.

                private void writePolicyFile(boolean turnsArr[], double succIndicator[], boolean is_black) – used      for training. Writes values to file from succIndicator[] array.

                                Arguments:

                                boolean turnsArr[] – array with true values with turns that were made.

                                double succIndicator[] – puts values In this array.

                                boolean is_black – if we play red it is false, otherwise, true.

                public static void main(String[] args) – empty function.

 

*To run training need to uncomment blocks, commented as for training and comment block, that is signed as one that is needs to be commented in the comments. To initialize *.dat files need to uncomment all code in Test.java and when finished need to comment again what was commented.