Reinforcement Learning Workshop – 2012/3

 

Agent Nemesh

Project Report submitted by Shay Shalita, Matan Saranga and Netanel Haim

 

April 2013

 

The Blavatnik School of Computer Science

Raymond and Beverly Sackler Faculty of Exact Sciences

Tel-Aviv University

 

1. Project Preface

1.1 Game Introduction [1]

Four in a Row (also known as Captain’s Mistress, Connect Four, Four Up, Plot Four, Find Four, Fourplay, and Four in a Line) is a two-player game in which the players first choose a color and then take turns dropping colored discs from the top into a seven-column, six-row vertically-suspended grid. The pieces fall straight down, occupying the next available space within the column. The object of the game is to connect four of one’s own discs of the same color next to each other vertically, horizontally, or diagonally before your opponent does so.

 

1.2 Overview of the project

This project was designed and implemented as part of a Reinforcement Learning workshop (0368.3500), the Blavatnik School of Computer Science, the Raymond and Beverly Sackler Faculty of Exact Sciences, Tel-Aviv University. The workshop was taught and instructed by Prof. Yishay Mansour and Mr. Mariano Schain.

The main goal of this workshop is to design, implement and train a software agent that utilizes reinforcement learning concepts in order to play the Four in a Row board game. For reasons to be explained later on, we chose the SARSA learning pattern (on-policy model-free) as our main learning scheme.

As opposed to the traditional game specifications, where the game board can be viewed as a 6x7 matrix of cells, the agents of our workshop had to deal with a slightly different game board representation of 6x8. Such representation could offer a symmetric view of the game board.

This paper is divided as follows: we start by introducing the learning method and all of its relevant algorithms (Section 2) and move on to the specific details of the game representation (Section 3). We then turn to describe the training process and bootstrapping methods (Section 4) followed by a discussion on important ideas and issues that arose throughout our work (Section 5). Afterwards, we present and discuss some of the main unsuccessful attempts and difficulties that were encountered and dealt with (Section 6). We proceed with a general overview of the architecture of the training program and our agent, Nemesh (Section 7), and then present some of our ideas regarding the project’s infrastructure and debugging schemes (Section 8). Appendix A contains some graphs that show the progress for each weight over training sessions.

 

2. Learning Method

As mentioned above, we implemented and trained our agent Nemesh according to the SARSA learning algorithm. In addition, we decided to implement and use the MiniMax algorithm, for the reasons explained below.

 

2.1 SARSA – an On-Policy Model-Free Learning Algorithm

2.1.1 The Algorithm

SARSA derives its name as abbreviation for the State-Action-Reward-State-Action combination, and indeed, the main idea of the SARSA algorithm is to learn the action-value function (Q-function), which holds the expected return of starting from state si performing action aj (and then proceeding according to the current policy). As a member of the Temporal-Difference (TD) algorithms family, the learning in SARSA is done online and is Model-Free: the agent interacts with its environment and updates the Q-function according to the actions taken, and with our ability to control the next action of the agent (On-Policy), using an e-greedy policy for example. The heart of the algorithm lies within the following formula:

Qt+1(st,at) = Qt(st,at) + a[rt + gQt(st+1,at+1) – Qt(st,at)]

Where:

  • Qt(st,at) = Expected return of starting from state st performing action at.
  • g = Discounted rate (will be discussed in detail in the following sections).
  • a = Learning rate (will be discussed in detail in the following sections).
  • rt = Immediate reward received after performing at while in st.
  • a[rt + gQt(st+1,at+1) – Qt(st,at)] = A step towards the error, which in time will allow us to converge to the Q-values of the optimal policy.

 

2.1.2 Linear Approximation

Since the game in hand forces us to examine a very large state space, we shall turn to a linear approximation of the Q-function. In order to reduce the number of states, each state of the board will be mapped into a k-feature/attribute vector with pre-defined features (k < #possible states), all of which are informative with respect to the Four in a Row game (e.g. #open threes in a row, #blocked threes in a row etc. See section 3 – Game Representation). The main idea of the approximation is that we can think of the value of the Q-function as a linear combination of some weight vector (<q1, q2, …, qk>) with the feature vector of the current state. By doing so, we reduced our problem of finding the Q-function values to simply finding the values of the weight vector.

The values of the weight vector are updated by applying Gradient Descent methods (that should reduce the error by stepping to the opposite direction of the gradient) to the TD framework, as indicated by the following formula:

qt+1 = qt + a[rt + gQt(st+1, at+1) – Qt(st, at)]*vector(st)

Where:

  • qt =Weights to be updated.
  • a[rt + gQt(st+1, at+1) – Qt(st, at)] = As before, a step towards the error.
  • vector(st) = The previous state which is part of the Gradient Descent method.

 

 

2.1.3 Motivation and Reasoning

We chose to use SARSA as our main learning algorithm for some intuitive reasons and for some more educated ones, most of them rely on Sutton and Barto’s work in the field of reinforcement learning [2]. The main reasons for choosing SARSA are:

  • Our intuition that a model-free algorithm will best reflect what was discussed in class, namely the property of treating our game as having a very large number of states. It seems like a waste of resources to try and build a complete Markovian model of the game (just learning), and only then address the planning issues (finding an optimal policy), which is basically what model-based algorithms do. It seems wiser to concurrently combine learning and planning, which is more or less what the model-free learning algorithms do.
  • The desire to control our actions and resolve the exploration vs. exploitation issues of our agent. We thought of the On-Policy algorithms as the perfect candidates for allowing us to balance these two competing forces, using an e-greedy policy for example.
  • Further support for an On-Policy (SARSA) instead of an Off-Policy (Q-Learning) is provided in Precup, Sutton and Dasgupta (2001) [3]. Their point of departure is that Off-Policy algorithms such as Q-Learning, are proven to be unsound when trying to apply linear approximation methods. They are unsound in the sense that there exists a non-pathological Markov decision processes for which the algorithm diverges (values diverge to infinity). While not excluding Off-Policy algorithms right away, one of their suggested solutions is to turn to some other RL methods, such as SARSA, which are more stable in that sense.
  • Sutton and Barto’s list of TD advantages over Monte-Carlo methods (MC) in their online book (chapter 6.2, Advantages of TD Prediction Methods) [2]. The most compelling argument was that the TD algorithms are naturally implemented in an online fashion, whereas MC methods need to wait for the end of an episode (only then the return is known). In our perspective, this advantage can be translated into a more stable and convenient implementation.

 

2.1.4 Related Implementation Mechanisms and Data Structures

In order to achieve the SARSA learning pattern, we implemented two mechanisms:

  • We created the SarsaLearning class, which handles all aspects of the learning algorithm. The class contains two static variables holding the learning rate (a) and discount rate (g), and a static method (UpdateWeightVector) which updates the weight vector according to the SARSA algorithm when provided with qt, rt, Qt(st+1, at+1), Qt(st, at) and vector(st). This class is also responsible for running a training program for the agent encapsulated within the TrainAgent static method which simulates a number of training games for the agent.
  • The data to be passed as an argument to the UpdateWeightVector static method is collected from within the GameSimulation class (throughout the simulation). While the simulated game is not over (no one reached four-in-a-row or the game is not a tie), we collect the relevant data for updating both players’ weight vectors (in case one of them is an agent).

 

2.2 The MiniMax Algorithm

2.2.1 The Algorithm

The MiniMax is a decision algorithm which is quite common in game theory, especially for a two player’s gaming scheme [6]. The main idea that underlies the algorithm is that the game can be thought of as a decision tree for two players with opposite goals: a Max player who wants to maximize its profit at a given level, and a Min player who wants to minimize it. The leaves of this tree contain a score indicating the “worthwhile-ness” of that specific path of decisions.

The MiniMax algorithm simply traverses this decision tree (the whole tree or until a certain depth has been reached) and finds the best possible move for the current player, i.e. finds an action that will be better off for the player, regardless of what its opponent will do.

 

2.2.2 Motivation and Reasoning

Our foremost motivation for using the MiniMax algorithm is closely related to the bootstrapping process. We wanted to create a somewhat “smarter” Trainer which has a few moves look ahead, instead of just a random trainer (although we did start training the agent against a random opponent, just so it could get a sense of the game. See section 5 – Training and Bootstrapping). In our perspective, not only will this look ahead speed up the training process, but also create a training environment suitable for enhancing the performance of our agent.

In addition, we arrived to the conclusion that the MiniMax algorithm could be of some use during the actual game itself. An agent with a decent look ahead can perform much better than an agent with no look ahead at all.

 

2.2.3 Related Implementation Mechanisms and Data Structures

Our implementation of the MiniMax algorithm follows an online youtube lecture of UCB which explains the main principles of Game Trees and provides a general pseudo code for the algorithm (with or without alpha-beta pruning) [4]. We chose to implement the version with the alpha-beta pruning in order to speed up the traversal of the tree.

The implementation relies mostly on the static recursive method chooseMove in the MiniMax class (in fact, this is the only method in the class). This method returns the Best data structure, which holds the best move for the current player and the score of that move.

The recursive algorithm that we’ve implemented traverses the tree in a DFS order. For each legal move out of a list of possible moves, the algorithm plays that move, generate a recursive call with the proper arguments (the new board, the MIN/MAX player and decreasing depth), receives a reply result and restores the board to its previous condition. Given the best reply, the algorithm checks if that score is better than what it achieved so far, and if so it updates the new score and move of the player.

When the recursion reaches a leaf (either the game is over or the required depth had been reached), the algorithm calculates the score of that leaf according to the linear combination of the state vector of that leaf and the weight vector of the player and returns the Best data structure with the proper values. This way, values are bubbled up in the tree in a DFS fashion. Whenever alpha >= beta for some inner node, the alpha/beta cutoff can apply (alpha-cutoff for a Min node and beta-cutoff for a Max node).

 

3. Game Representation

As was mentioned before, the specification of the game board throughout the whole workshop was that of [6 rows x 8 columns], which can be represented as a 6x8 matrix. In order to maintain a game board with all its complexities, we created the GameBoard class which holds all the properties related to the game board. We shall not pursue all of GameBoard’s functionalities here (since some of them are rather trivial), but we will go over all of its main fields and functions. Through those elements we will have a more coherent understanding of the board representation, state features and their maintenance.

 

3.1 Board Representation

We chose to implement our board as a two-dimensional array of integers, where each couple of indices represents a cell of the board. Each cell can receive one of the following values:

  • 0 = Empty Cell.
  • 1 = Red Player (player1).
  • 2 = Black Player (player2).

The 2D-array will be of size [NumOfColumns x NumOfRows]. A little counter intuitive for matrix representation, but since most of our searches traverse the board column by column, it seemed like a good idea.

In order to speed-up performances, we maintained a few extra fields in addition to the 2D-array board:

  • Total Discs – an integer that holds the total number of discs currently present on the board for both players.
  • Discs per Column – an array of integers which holds the number of discs in each column. The number ranges from 0 to NumOfRows.
  • Highest Disc – an integer indicates the row number of the highest disc in the board. This field is needed mainly for speeding up the feature counts. For example, there is no need to search for row-feature in rows that are located above the highest disc on the board. Surely we won’t find anything there.
  • Possible Moves – a list of all currently possible moves in the board.
  • Player1Four/Player2Four – holds the number of four-in-a-row’s that were counted for each player. This attribute is updated after each move and is needed to tell when the game has reached its end.
  • State Vector – holds the current state of the board, according to the features that are defined in the next subsection.

We kept two somewhat more obvious fields, one that holds the winner of the board and the other holds the current player of that board.

 

3.2 State Representation

3.2.1 Defining the State Features

As was explained in section 2, we chose to represent a state of the board as a list of features in order to reduce the number of states. The features themselves, as representing a state of the board, should be defined as part of the board representation (in our case, in the GameBoard class). All the features were chosen carefully and are meaningful with respect to the game in hand.

Since the representation of these features is constant, we chose to implement them as static final integers, where each feature received a unique identifier.

We defined a total of 18 features per player, and built the state vector as a long combination of these features for player1 and for player2, like so:

[p1, p2, p3…, p18, q1, q2, q3, …, q18] ; p=player1, q=player2

 

The reason we created such a long state vector (36 features) is to allow us treating both counts for player1 and player2 as having different weights.

The 18 distinct features can be divided to row-related features, column-related features, diagonal-related features and special features, which are brought in detail below:

 

Row Features:

  • TwoSideThreeInARow – Includes all cases of three-in-a-row which are opened on both sides, and can be completed immediately to four-in-a-row.
  • OneSideThreeInARow – Includes all cases of three-in-a-row which are blocked on one side, and can be completed immediately to four-in-a-row.
  • AllRowThrees – Counts all occurrences of three-in-a-row that can be completed to a four-in-a-row, possibly not right away (open threes count as two such cases).
  • ThreeInARowHole1 – Includes all cases of three-in-a-row which can be completed after one move, i.e. there is a hole of 1 that needs to be filled before we can put a disc that completes a four-in-a-row.
  • ThreeInARowHole2 – Same as the previous feature, only with a hole of 2.
  • TwoInARow – Includes all cases of two-in-a-row that can be completed immediately to four-in-a-row.

 

Column Features:

  • ThreeInACol – Includes all cases of three-in-a-column, with a possibility to complete a four-in-a-column (enough room).

 

Diagonal Features:

  • TwoSideThreeInADiagonal – Includes all cases of three-in-a-diagonal which are opened on both sides, and can be completed immediately to four-in-a-diagonal.
  • OneSideThreeInADiagonal – Includes all cases of three-in-a-diagonal which are blocked on one side, and can be completed immediately to four-in-a-diagonal.
  • AllDiagonalThrees – Counts all occurrences of three-in-a-diagonal that can be completed to a four-in-a-diagonal, possibly not right away (open threes count as two such cases).
  • ThreeInADiagonalHole1 – Includes all cases of three-in-a-diagonal which can be completed after one move, i.e. there is a hole of 1 that needs to be filled before we can put a disc that completes a four-in-a-diagonal.
  • ThreeInADiagonalHole2 – Same as the previous feature, only with a hole of 2.
  • TwoInADiagonal – Includes all cases of two-in-a-diagonal that can be completed immediately to four-in-a-diagonal.

 

Special Features:

  • OneMoveToWin – Counts all columns that placing the player’s disc there will win the game. This features ranges from 0 to NumOfColumns.
  • ProhibitedMoves – Counts all columns that placing the player’s disc there will win the game for the opponent. This features ranges from 0 to NumOfColumns.
  • PotentialFour – Counts all cells of the board that putting the player’s disc there will win the game. The win does not have to be immediate.
  • WinColumns – Counts all columns for which the player can put one disc and win, or the opponent can put one disc, but still the player will win (it has two potential fours one above the other).
  • OneMoveToWinHole – Counts all columns that can bring a winning to our player, but needs to be filled first (there is a hole).

 

For the OneMoveToWin and ProhibitedMoves features, we specifically created a list data structures which are yet another field of the game board (two for each player). The idea behind these data structures is to allow us manipulate and force the player to take certain actions, either actions that will make it win the game, actions that it must do in order to keep the game going (not losing) or prohibit actions that taking them will cause it to lose.

 

3.2.2 Counting the State Features

In order to count all of the above features for a player, we created the FeatureCount class which contains a few static methods that perform all the counts (method for counting row-features, method for counting column-features etc.). We shall go over the general ideas that underlie the counting methods for each feature – as for the specific details, the code is quite informative and well-documented in this respect.

 

3.2.2.1 Counting Column Features

For each column we search for the longest sequence of top discs that belong to the player. When such sequence is found, it is added to the relevant feature counter in case there is enough room for completing a four-in-a-row. Otherwise, the column does not worth much.

 

3.2.2.2 Counting Row Features

We traverse the board in sequences of four cells and search for each feature separately (three-in-a-row, four-in-a-row etc.). As an optimization, we traverse only those rows that are beneath the row of the highest disc on the board (no need to check empty rows above the highest disc). The most problematic cases are those in which the disc is in an edge cell, so we cannot be sure if it participates in some other feature. For example, O_ _ _ could be a genuine one if the cell left to the disc is empty, or could participate in a two-in-a-row in case the cell on the left contains the player’s disc and its left is either empty or the edge of the board, OO_ _. So, in order to prevent duplicate and inaccurate counts we simply check whether this position (left to the left edge or right to the right edge) is legal, i.e. does not located outside the board. If it is not legal, we have no problem. Otherwise, we make sure that it does not contain the player’s disc.

 

3.2.2.3 Counting Diagonal Features

The counts are conducted in a similar manner to the row counts. The only thing we need to take into consideration is that now we count both diagonals to the right and diagonals to the left and that the counts are done up to and including the 3rd row (we cannot find a higher diagonal since there is not enough room).

 

3.2.2.4 Counting Special Features

All the counts for the special features are conducted and based upon a 2D-array which tells us for each cell of the board whether this cell can be used by the player to win the game or not. We can count Potential Fours that way, and use it to count all the other special features according to their nature (for example, for WinColumns we will search two empty cells, one above the other, both of them can bring the player its victory).

 

3.3 Main Functionalities of the Board

The GameBoard class contains all functionality that is expected from a game board: disc insertion, determining whether the game is over or not, deciding the winner, switching players etc. Each one of these methods preserves the invariants of the board. The most important functionalities of the board are depicted below:

  • insertDisc – this procedure inserts a player’s disc to the specific column, and updated all the relevant board fields (2D-array, discsPetColumn etc). It also updates the state vector, by calling the updateStateVector procedure.
  • removeDisc – this procedure removes the top disc from a given column. It is not a standard functionality of the game board (i.e. no player can actually remove a disc from the board), but rather is needed as part of the MiniMax algorithm and its recursive calls – we play a move, make the recursive call and then restore the board to its previous condition in order to check another path.
  • updateStateVector – this procedure updates the state vector to hold the proper features, as was explained above.
  • isOver – a predicate that checks whether the current game is over, either if some player has won (the player1Four or player2Four is not 0), or if the game reached a tie (total discs on board = NumOfColumns*NumOfRows).

 

4. The Training Process and Bootstrapping

4.1 The Training Process

The training program that we’ve built can handle the training of both player1 and player2 concurrently (AGENT), and provides much more neat combinations including a random player (RANDOM) and a trainer (TRAINER) which uses the MiniMax algorithm and always strives for the highest reward (exploitation).

 

4.1.1 Setting the Parameters

For reasons that are explained in section 5, we chose to use the following setting for our parameters:

  • The Learning Rate (α) – A decreasing function dependent on the game turns. We define a turn as once circle of player1’s move and player2’s move. When calling to the training program (Training class), we can pass an argument which allows us to start with an advanced number of turns. The idea is that we want to start each session with a somewhat lower learning rate, so that we can use what we’ve learned so far.
  • The Discount Rate (γ) – A constant value of 0.8 (approaching 1 for long-term learning).
  • e-greedy Policy – in a chance of epsilon we shall make a random move (explore), otherwise exploit what was learned so far. Epsilon changes as a function of the moves the player made (sqrt(1/# of moves)).

 

4.1.2 Weight Vector per Player

The training process was designed to find the optimal weights for our weights vector (for the Q-function approximation). We wanted to learn two different sets of weights – one set for player1 and the other for player2. The logic behind this choice is that the set of weights can differ when playing as player1 as opposed to player2, since the players have a different starting point.

 

4.1.3 Training Details

We decided to take a gradual training scheme so that Nemesh can learn the game more thoroughly from its basics, and only then will proceed playing against more advanced players. We wouldn’t want to frustrate it from the beginning… The training consisted out of the following steps:

  • Running a short session of 1,000 games against a random player, for both player1 and player2. Each player has a depth of 3 in the MiniMax tree. The idea is that the players will get the flavor of the game. The number of turns will start from 0 (high learning rate).
  • A session of 100,000 games, in which the players will train against one another. Each player has a depth of 3 in the MiniMax tree. The number of turns will start from 0 (high learning rate).
  • A session of 100,000 games, in which the players will train against one another. Each player has a depth of 4 in the MiniMax tree. The number of turns will start from 10 (smaller learning rate).
  • A session of 100,000 games, in which the players will train against one another. Each player has a depth of 5 in the MiniMax tree. The number of turns will start from 20 (even smaller learning rate).

 

In order to verify that our player has learned well, we played against it with our human player and let it play against the GoodPlayer. We saw some progress over sessions, although it was pretty much converged after the first serious session.

 

4.2 Bootstrapping

As was mentioned before, our agent started to train against itself, with some look-ahead provided by the MiniMax tree.

 

4.2.1 Initializing the Weight Vector

In order to enhance the training process even further, we initialized the players’ weight vectors with non-arbitrary weights as we saw fit. Since the state vector is composed out of player1’s features followed by player2’s features, and since we wanted to train two different sets of weights, player1’s weight vector was initialized with positive values for its good features (most of its side of the vector) and with negative values representing its opponent. Player2’s weights were initialized with the same values, only reversed (what is good for player2 is bad for player1 and vice versa).

 

4.2.2 Coerced Moves

As a sort of guidance for our player, we used lists containing coerced moves for our player – a list of forced moves that the player must do in order to win, and a list of prohibited moves that the player must not do since it will cost it the game. Those lists were updated after each disc insertion, along with the features counts. Before selecting an action using the MiniMax algorithm, the player first checked for its winning moves and then for its opponent winning move (in that order). If such move exists, the player would take it, because it means that it can either win or prevent the opponent from winning, thus keeping the game going with a chance of winning.

 

5. Project Thoughts and Issues

Throughout the entire project, we had to deal with both technical and content-related issues. The way we handled those issues heavily influenced our working progress and the way we decided to train our agent. In this section we list the problems encountered and try to present our best reasoning and solutions for the problems (we could not find a solution for every problem, but we tried our best to achieve the optimal result in our perspective).

 

5.1 Setting the Learning Parameters

There are three main parameters which can affect the learning progress and which we must consider: the learning rate (a), the discount rate (g) and the e-function (e-greedy policy).

 

5.1.1 Setting the Learning Rate a

The learning parameter determines to what extent the new information will override the old information. A factor of 0 will make the agent not learn anything (since the error addend will be 0), while a factor of 1 will make the agent consider only the fresh learned information. [5]

Following Sutton’s and Barto’s online book (chapter 8.2, Gradient Descent Methods) [2], the learning rate (or step-size parameter as they call it) should decrease over time in order for the gradient descent method to reach a local optimum. They elaborate their argument by saying that we do not wish or expect to find a function that has zero errors in all states, but only an approximation that balances the errors in different states. That is the reason why we take only small steps towards our error – we could not balance the errors if we completely corrected the error at each state.

In addition, since our learning can be thought of as “batch-learning” (chapter 6.3, Optimality of TD(0)), values could converge independent of the learning rate parameter, as long as it is chosen to be sufficiently small.

We completely agree with Sutton’s and Barto’s argument, and in our perspective decreasing the learning rate over time is the logical thing to do: at first we want to learn as much as we can, and over time we want to make a few minor adjustments to what we’ve achieved so far.

 

5.1.2 Setting the Discount Rate g   

The discount rate determines the importance and influence of future rewards. A factor of 0 will make the agent be opportunistic, i.e. consider only current rewards, while a factor approaching 1 will make it strive for a long-term high reward. [2, 5]

Following Sutton’s and Barto’s online book (chapter 3.3, Returns) [2], since we want our agent to be as much farsighted as possible, it is advised to use a discount rate approaching the value of 1 (but not 1 itself, since the weight values may diverge [5]).

After a number of trial and error sessions, we fixed the value on 0.8.

 

5.1.3 Setting the e-function

This is the online policy to select an action which balances the exploration ~ exploitation issue. In an epsilon probability, we shall make a random move (explore), while in an (1-epsilon) probability we shall exploit what was learned so far.

Following Sutton’s and Barto’s online book (chapter 6.4, Sarsa: On-Policy TD Control) [2], the epsilon should be defined as a function of time, so that epsilon=1/t should make us converge.

According to our perspective, it is a logical choice indeed, since we want to explore more at the beginning of each session, and want to utilize what we’ve learned so far towards the end of each session.

In order to enhance the exploration stage, we tried using an epsilon=sqrt(1/t) function. It turned out to be a good choice indeed.

 

5.2 State Representation – More vs. Fewer Features

As was previously mentioned, our final feature state vector was composed out of 36 features: 18 features for player1 followed by parallel 18 features for player2. However, this was not our first choice. We started our implementation with a few extra features for both sides, features that were rejected during the working progress. Those features are listed below:

  • AllRowTwos – Includes all cases of two-in-a-row that can be completed to four-in-a-row, possibly not right away.
  • OneInARow – Includes all cases of one-in-a-row that can be completed immediately to four-in-a-row.
  • AllRowOnes – Includes all cases of one-in-a-row that can be completed to four-in-a-row, possibly not right away.
  • TwoInACol – Includes all cases of two-in-a-column, with a possibility to complete a four-in-a-column (enough room).
  • OneInACol – Includes all cases of one-in-a-column, with a possibility to complete a four-in-a-column (enough room).
  • AllDiagonalTwos – Includes all cases of two-in-a-diagonal that can be completed to four-in-a-diagonal, possibly not right away.
  • FourInARow – Includes all cases of four-in-a-row.

 

Some of the features were rejected since they made too much noise and were not so important (such as AllRowOnes and OneInACol). The other features were rejected because they did not contribute to the main goal. After some trial and error sessions, we saw very little contribution of those features (weight values that approached 0). We could have just left them as is.

One interesting thing worth mentioning is the FourInARow feature. The weight corresponding to this feature did not update. At first we thought that there is something wrong with our implementation, but then we realized that this weight can never be updated in the SARSA learning scheme, since it is a terminal feature. Once some player reached this feature the game is over, so the St vector (previous vector) as denoted by the SARSA algorithm will always have the value of 0 for this feature, and it will never be updated.

 

5.3 Randomization

We thought of ways to insert some sort of randomization to our game, so that two players won’t play the same game over and over (deterministically). A nice solution for that was implemented as part of the MiniMax algorithm. In order to achieve different move choices when we have some equally good actions, we shuffle all possible moves before we iterate over them. Since we implemented our tree with alpha-beta pruning, receiving some optimal score will cause a cutoff and following paths will not be traversed.

Thus, shuffling the possible moves actually shuffles the paths we traverse in the tree and may result in a different choice of action each time we have some equally good ones.

 

6. Unsuccessful Attempts

On our way to the final fully-trained agent, we had our share of some unsuccessful attempts and not so productive choices. We bring in detail some of our main problems which we came across during our project.

 

6.1 Non-Converging Weights

As we tried to train Nemesh, we got a lot of results that were not logical (the weights totally diverged). We believe that this diversion was due to a number of factors:

  • Setting the learning rate too high – before we read Sutton’s and Barto’s notes on the learning rate (as was explained in the previous section), we set it too high (above 0.5) for too long (10,000 games), and the weights diverged (we received a lot of NaN’s).
  • Setting the starter weights too high – having no system for normalizing the weights, we believe that setting the values of the weights too high was a contributing factor for the divergence.
  • Setting the new expected return at the end of the game to 1 – In their chapter about the SARSA algorithm, Sutton and Barto note that the new expected return in case of a terminal state should be set to 0. At the beginning, we thought that it will be a good idea to reward the winning player with 1 and to fine the loser with a -1 as the new expected return in addition to the immediate reward (0 only at ties). The basic reasoning was that since we have both features of player1 and player2 on the same state vector, we wanted to make sure that the winning player will receive a good score, and the loser a bad score (so that we won’t be “punished” winning or rewarded for losing). That turned out to be a mistake, since the terminal state does not really contribute to the learning. The immediate reward should reflect winning or losing.

 

In order to solve this problem, we tried to normalize the weight vector and the approximated value of the Q-function in two different schemes: (a) Dividing each value of the dot-product vector by the maximal value right after performing the dot-product (state Features product weights); (b) Dividing the weight vector itself by its Euclidean norm – that way the norm was always set to 1.

While conducting some empirical testing for both methods, we did not notice any significant improvement in our agent’s performances (although we did not diverge). After some more testing with the learning rate, we found a scheme that works and does not make the weights diverge. The agent that trained by the learning rate manipulation scheme (as depicted in Section 4) yielded the best results among those testings.

 

6.2 Logical Problems with the SARSA learning algorithm

We had some trouble figuring out what the St+1 stands for when we have a game of two players. All the material regarding the SARSA algorithm that we came across deals with the SARSA concept in a more abstract way – there is an agent that performs action at while in St, receives an immediate reward rt, proceeds to state St+1, chooses an action at+1 and update the Q function. It was a little confusing, since we could come up with two possible interpretations of the St+1 notion: it could refer to the state right after the player had made its move, or it could refer to the state after the player’s move and its opponent move. Things get a little more complicated when one player can win after a move – how will the St+1 be determined then? We started off by implementing the algorithm completely wrong, and didn’t save the state before the player’s action – rather, we updated the weights considering the state right after the action as St (which is obviously wrong, since we need to take St as the state right before the action).
After we corrected that logical error, and saved the state just before the action took place, we updated the weights right after the player has made a move. This is again a mistake, since we treated St to be the state just before the player’s action, which means that it is the state reached after the player’s move and its opponent move. However, the weights update didn’t consider that, since the weights were updated just before we’ve reached the next state (so, we’ve updated the weights according the an intermediate Qt(St+1, at+1), a value that does not take into consideration the opponents reaction).
In our final implementation, the weights are updated so the opponent’s move is counted as well.

 

6.3 Interchanging state between player1 and player2

In our first implementation, a state was represented as a feature vector that held values for the current player and its opponent. For example, if player1 was the current player, the feature vector would look something like this: [player1_features, player2_features] (same features for both players). Every time the current player has changed, so did the feature vector, so we got: [p1_features,p2_features]--changePlayer-->[p2_features,p1_features]--changePlayer-->[p1_features,p2_features] etc.
We thought that this was the best way to realize two sets of weights for two different players, player1 and player2. Although the implementation was coherent, we decided that a non-changing state vector (i.e. not switching from player1 to player2 vector) would be the best thing, and will prevent from future mishaps. So we’ve changed our implementation so that the state vector will be consisted of player1 features followed by player2 features.

 

6.4 Optimizations for the MiniMax Algorithm

We tried to apply some optimizations to the MiniMax basic algorithm implementation, in order to enhance performances and allow the algorithm to efficiently scan the decision tree for even deeper depths. Although these optimizations were not applied in the final training process, we bring in detail the ideas behind those optimizations:

  • A scan that was performed only in the immediate surroundings of the previous opponent move. For example, if the opponent placed a disc in the first column from the left, only the first four columns from the left would be traversed in the tree. The idea was that an opponent move can only affect the decision for the immediate surroundings. This turned out to be false, since we cannot estimate how it will affect the weights and we may want to examine other paths, even if they are not in the immediate surrounding (maybe we have a better move there).
  • Saving previous alpha-beta values in order to determine future scanning priority. This optimization was logically hard to code and even harder to debug and validate correctness “beyond a reasonable doubt”. Since the main goal of the project was utilizing RL methods and not perfecting the MiniMax algorithm, this optimization was rejected.

7. High Level Architecture

7.1 Training Program Schematic Relations between the Classes

etween the Classesions tains a private field of that class.er if some player has won ()r, switching players etc.will cause it t

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Graph Index (start at Training Class):

 

             - Class instantiates another class.

                                         

             - Class invokes a method of another class.

 

             - Class contains a private field of another class.


7.2 Agent Schematic Relations between the Classes

                                             

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Graph Index (Start at Nemesh Class):

 

             - Class instantiates another class.

                                         

             - Class invokes a method of another class.

 

             - Class contains a private field of another class.

 

             - Class implements an interface.

 


7.3 Classes overview

Training – A class that its sole purpose is to initialize the training program by calling the TrainAgent static method of the SarsaLearning class.

SarsaLearning – This class is the heart of the training process. It simulates game training sessions for two TrainPlayers by successive calls to GameSimulation. It also holds all the relevant parameters of the learning algorithm (learning rate and discount rate) and contains a static method for updating the weight vector.

GameSimulation – The GameSimulation class manages all the data required to simulate a game (GameBoard and Players), and is responsible for the flow of the game. Players take turns in the game (starting from player1), and the main game loop keeps on running until one of the players has won or the game ended in a tie. This class also holds specific player variables that are needed for updating the player’s weight vector (previous state of the board, old expected return, immediate reward and new expected return). The weight vector updating scheme is integrated all over the flow of the game for both players (we can train two players simultaneously).

TrainPlayer – A class that represents a training player. The player can be our learning agent (AGENT), a smart trainer (TRAINER) or a random player (RANDOM). The class holds the weight vector of the player and MiniMax tree depth (in case it is an agent of a trainer), and contains a chooseMove method that chooses the next move for the player, according to its type: A random player will choose a random move; A trainer will choose the best possible move according to the MiniMax algorithm (always exploit); An agent will choose a move according to an e-greedy policy.

MiniMax and Best classes – The two classes compose the MiniMax algorithm which is encapsulated within the static recursive chooseMove method. The method returns the Best data structure, which contains the best move for the calling player and its score.

GameBoard – A class that represents a game board on all its complexities. This class is elaborated in detail in section 3 of this report.

Vector – A class that represents a vector, along with all of its functionality (initialization, add value to index, write/read vector to/from a file etc.).

FeatureCount – A class that is responsible for counting all the state features, as defined in the GameBoard class. The class is divided into a few static methods, each one of them can be used to count different type of features (column features, row features etc.).

InitializeStartingWeights – A class that initializes the weight according to what we saw fit.

Nemesh and GameGUI – The Nemesh class is our agent, and it implements the FourInARow interface as was defined in the workshop’s website. It reads the weights that were learned during the training process, and chooses its move according to the MiniMax algorithm. It is also accompanied by GUI software that presents the logical board in a much more convenient and user-friendly way.

 

8. Infrastructure

8.1 Watching a Full Simulated Training Game

Our main concern was to debug the training process, so we added an extra argument for the TrainAgent method that allows us to watch a full simulated game with printouts to the console. There is no need to watch more than one game per session, since the games are quite long and representative.

 

8.2 Testing the FeatureCount Class

In order to debug the counts of the features, we implemented a few static functions in the GameBoard class: GenerateRandomBoard, GenerateRandomSizes and PrintBoard. The purpose of these function is to generate a random (yet legitimate) board and do a fine print of it to the console, so that we can debug the feature counts.

 

8.3 MiniMax Debugging

Due to its recursive nature, this function was the hardest class to debug. In order to debug it, we used the debugger (step by step) and printed the tree to ourselves. We tried to insert some printouts inside the function and debug for small depths, but it was quicker and more useful debugging it step by step.

 

8.4 General Testing

We used some more general testing methods in order to check our agent implementation and performance (for example, feature counts etc.). This was done in two main ways:

  • We generated a board and “simulated” a game in which we can control each move done by the players (step by step). We wanted to bring the game to more interesting situations, and test our implementation there (i.e. the MiniMax algorithm for choosing the best move, not falling into traps etc.).
  • We played a lot of games against Nemesh as checkpoints. This way we could get a general impression of the playing level of Nemesh, and verify that it makes all of the needed coerced moves.

 

 

 


 

Reference

  1. Connect Four from wikipedia - http://en.wikipedia.org/wiki/Connect_Four
  2. Sutton, Richard S., Barto, Andrew G., 1998. Reinforcement Learning: an Introduction. MIT Press, Cambridge, MA.
  3. Precup, Doina, Sutton, Richard S., Dasgupta, Sanjoy. 2001. Off-Policy Temporal-Difference Learning with Function Approximation. Proceedings of the 18th International Conference on Machine Learning.
  4. An online youtube lecture on Game Trees: http://www.youtube.com/watch?v=Unh51VnD-hA. University of California, Berkeley.
  5. SARSA algorithm from wikipedia – http://en.wikipedia.org/wiki/SARSA
  6. MiniMax algorithm from wikipedia - http://en.wikipedia.org/wiki/Minimax

 

 

 

 

 

 

 

 

 

 

 

 


Appendix A

Player1 Weights:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Player2 Weights: