Reinforcement Learning Workshop – 2012/3
Agent Nemesh
Project Report submitted by Shay Shalita,
Matan Saranga and Netanel Haim
April 2013
The Blavatnik School of Computer Science
Raymond and Beverly Sackler Faculty of Exact Sciences
Tel-Aviv University
1. Project Preface
1.1
Game Introduction [1]
Four in a Row (also known as Captain’s Mistress, Connect Four, Four Up, Plot Four, Find Four, Fourplay, and Four in a Line) is a two-player game in which the players first choose a color and then take turns dropping colored discs from the top into a seven-column, six-row vertically-suspended grid. The pieces fall straight down, occupying the next available space within the column. The object of the game is to connect four of one’s own discs of the same color next to each other vertically, horizontally, or diagonally before your opponent does so.
1.2
Overview of the project
This project was designed and
implemented as part of a Reinforcement Learning workshop (0368.3500), the Blavatnik School of Computer Science, the Raymond and
Beverly Sackler Faculty of Exact Sciences,
The main goal of this workshop is to design, implement and train a software agent that utilizes reinforcement learning concepts in order to play the Four in a Row board game. For reasons to be explained later on, we chose the SARSA learning pattern (on-policy model-free) as our main learning scheme.
As opposed to the traditional game specifications, where the game board can be viewed as a 6x7 matrix of cells, the agents of our workshop had to deal with a slightly different game board representation of 6x8. Such representation could offer a symmetric view of the game board.
This paper is divided as follows: we start by introducing the learning method and all of its relevant algorithms (Section 2) and move on to the specific details of the game representation (Section 3). We then turn to describe the training process and bootstrapping methods (Section 4) followed by a discussion on important ideas and issues that arose throughout our work (Section 5). Afterwards, we present and discuss some of the main unsuccessful attempts and difficulties that were encountered and dealt with (Section 6). We proceed with a general overview of the architecture of the training program and our agent, Nemesh (Section 7), and then present some of our ideas regarding the project’s infrastructure and debugging schemes (Section 8). Appendix A contains some graphs that show the progress for each weight over training sessions.
2. Learning Method
As mentioned above, we implemented and trained our agent Nemesh according to the SARSA learning algorithm. In addition, we decided to implement and use the MiniMax algorithm, for the reasons explained below.
2.1
SARSA – an On-Policy Model-Free Learning Algorithm
2.1.1
The Algorithm
SARSA derives its name as abbreviation for the State-Action-Reward-State-Action combination, and indeed, the main idea of the SARSA algorithm is to learn the action-value function (Q-function), which holds the expected return of starting from state si performing action aj (and then proceeding according to the current policy). As a member of the Temporal-Difference (TD) algorithms family, the learning in SARSA is done online and is Model-Free: the agent interacts with its environment and updates the Q-function according to the actions taken, and with our ability to control the next action of the agent (On-Policy), using an e-greedy policy for example. The heart of the algorithm lies within the following formula:
Qt+1(st,at) = Qt(st,at) + a[rt + gQt(st+1,at+1) – Qt(st,at)]
Where:
2.1.2
Linear Approximation
Since the game in hand forces us to examine a very large state space, we shall turn to a linear approximation of the Q-function. In order to reduce the number of states, each state of the board will be mapped into a k-feature/attribute vector with pre-defined features (k < #possible states), all of which are informative with respect to the Four in a Row game (e.g. #open threes in a row, #blocked threes in a row etc. See section 3 – Game Representation). The main idea of the approximation is that we can think of the value of the Q-function as a linear combination of some weight vector (<q1, q2, …, qk>) with the feature vector of the current state. By doing so, we reduced our problem of finding the Q-function values to simply finding the values of the weight vector.
The values of the weight vector are updated by applying Gradient Descent methods (that should reduce the error by stepping to the opposite direction of the gradient) to the TD framework, as indicated by the following formula:
qt+1 = qt + a[rt + gQt(st+1, at+1) – Qt(st, at)]*vector(st)
Where:
2.1.3
Motivation and Reasoning
We chose to use SARSA as our main learning algorithm for some intuitive reasons and for some more educated ones, most of them rely on Sutton and Barto’s work in the field of reinforcement learning [2]. The main reasons for choosing SARSA are:
2.1.4
Related Implementation Mechanisms and Data Structures
In order to achieve the SARSA learning pattern, we implemented two mechanisms:
2.2
The MiniMax Algorithm
2.2.1
The Algorithm
The MiniMax is a decision algorithm which is quite common in game theory, especially for a two player’s gaming scheme [6]. The main idea that underlies the algorithm is that the game can be thought of as a decision tree for two players with opposite goals: a Max player who wants to maximize its profit at a given level, and a Min player who wants to minimize it. The leaves of this tree contain a score indicating the “worthwhile-ness” of that specific path of decisions.
The MiniMax algorithm simply traverses this decision tree (the whole tree or until a certain depth has been reached) and finds the best possible move for the current player, i.e. finds an action that will be better off for the player, regardless of what its opponent will do.
2.2.2 Motivation and Reasoning
Our foremost motivation for using the MiniMax algorithm is closely related to the bootstrapping process. We wanted to create a somewhat “smarter” Trainer which has a few moves look ahead, instead of just a random trainer (although we did start training the agent against a random opponent, just so it could get a sense of the game. See section 5 – Training and Bootstrapping). In our perspective, not only will this look ahead speed up the training process, but also create a training environment suitable for enhancing the performance of our agent.
In addition, we arrived to the conclusion that the MiniMax algorithm could be of some use during the actual game itself. An agent with a decent look ahead can perform much better than an agent with no look ahead at all.
2.2.3
Related Implementation Mechanisms and Data Structures
Our implementation of the MiniMax algorithm follows an online youtube lecture of UCB which explains the main principles of Game Trees and provides a general pseudo code for the algorithm (with or without alpha-beta pruning) [4]. We chose to implement the version with the alpha-beta pruning in order to speed up the traversal of the tree.
The implementation relies mostly on the static recursive method chooseMove in the MiniMax class (in fact, this is the only method in the class). This method returns the Best data structure, which holds the best move for the current player and the score of that move.
The recursive algorithm that we’ve implemented traverses the tree in a DFS order. For each legal move out of a list of possible moves, the algorithm plays that move, generate a recursive call with the proper arguments (the new board, the MIN/MAX player and decreasing depth), receives a reply result and restores the board to its previous condition. Given the best reply, the algorithm checks if that score is better than what it achieved so far, and if so it updates the new score and move of the player.
When the recursion reaches a leaf (either the game is over or the required depth had been reached), the algorithm calculates the score of that leaf according to the linear combination of the state vector of that leaf and the weight vector of the player and returns the Best data structure with the proper values. This way, values are bubbled up in the tree in a DFS fashion. Whenever alpha >= beta for some inner node, the alpha/beta cutoff can apply (alpha-cutoff for a Min node and beta-cutoff for a Max node).
3. Game Representation
As was mentioned before, the specification of the game board throughout the whole workshop was that of [6 rows x 8 columns], which can be represented as a 6x8 matrix. In order to maintain a game board with all its complexities, we created the GameBoard class which holds all the properties related to the game board. We shall not pursue all of GameBoard’s functionalities here (since some of them are rather trivial), but we will go over all of its main fields and functions. Through those elements we will have a more coherent understanding of the board representation, state features and their maintenance.
3.1
Board Representation
We chose to implement our board as a two-dimensional array of integers, where each couple of indices represents a cell of the board. Each cell can receive one of the following values:
The 2D-array will be of size [NumOfColumns x NumOfRows]. A little counter intuitive for matrix representation, but since most of our searches traverse the board column by column, it seemed like a good idea.
In order to speed-up performances, we maintained a few extra fields in addition to the 2D-array board:
We kept two somewhat more obvious fields, one that holds the winner of the board and the other holds the current player of that board.
3.2
State Representation
3.2.1
Defining the State Features
As was explained in section 2, we chose to represent a state of the board as a list of features in order to reduce the number of states. The features themselves, as representing a state of the board, should be defined as part of the board representation (in our case, in the GameBoard class). All the features were chosen carefully and are meaningful with respect to the game in hand.
Since the representation of these features is constant, we chose to implement them as static final integers, where each feature received a unique identifier.
We defined a total of 18 features per player, and built the state vector as a long combination of these features for player1 and for player2, like so:
[p1, p2, p3…,
p18, q1, q2, q3, …, q18]
; p=player1, q=player2
The reason we created such a long state vector (36 features) is to allow us treating both counts for player1 and player2 as having different weights.
The 18 distinct features can be divided to row-related features, column-related features, diagonal-related features and special features, which are brought in detail below:
Row Features:
Column Features:
Diagonal Features:
Special Features:
For the OneMoveToWin and ProhibitedMoves features, we specifically created a list data structures which are yet another field of the game board (two for each player). The idea behind these data structures is to allow us manipulate and force the player to take certain actions, either actions that will make it win the game, actions that it must do in order to keep the game going (not losing) or prohibit actions that taking them will cause it to lose.
3.2.2
Counting the State Features
In order to count all of the above features for a player, we created the FeatureCount class which contains a few static methods that perform all the counts (method for counting row-features, method for counting column-features etc.). We shall go over the general ideas that underlie the counting methods for each feature – as for the specific details, the code is quite informative and well-documented in this respect.
3.2.2.1 Counting Column
Features
For each column we search for the longest sequence of top discs that belong to the player. When such sequence is found, it is added to the relevant feature counter in case there is enough room for completing a four-in-a-row. Otherwise, the column does not worth much.
3.2.2.2 Counting Row Features
We traverse the board in sequences of four cells and search for each feature separately (three-in-a-row, four-in-a-row etc.). As an optimization, we traverse only those rows that are beneath the row of the highest disc on the board (no need to check empty rows above the highest disc). The most problematic cases are those in which the disc is in an edge cell, so we cannot be sure if it participates in some other feature. For example, O_ _ _ could be a genuine one if the cell left to the disc is empty, or could participate in a two-in-a-row in case the cell on the left contains the player’s disc and its left is either empty or the edge of the board, OO_ _. So, in order to prevent duplicate and inaccurate counts we simply check whether this position (left to the left edge or right to the right edge) is legal, i.e. does not located outside the board. If it is not legal, we have no problem. Otherwise, we make sure that it does not contain the player’s disc.
3.2.2.3 Counting Diagonal
Features
The counts are conducted in a similar manner to the row counts. The only thing we need to take into consideration is that now we count both diagonals to the right and diagonals to the left and that the counts are done up to and including the 3rd row (we cannot find a higher diagonal since there is not enough room).
3.2.2.4 Counting Special
Features
All the counts for the special features are conducted and based upon a 2D-array which tells us for each cell of the board whether this cell can be used by the player to win the game or not. We can count Potential Fours that way, and use it to count all the other special features according to their nature (for example, for WinColumns we will search two empty cells, one above the other, both of them can bring the player its victory).
3.3
Main Functionalities of the Board
The GameBoard class contains all functionality that is expected from a game board: disc insertion, determining whether the game is over or not, deciding the winner, switching players etc. Each one of these methods preserves the invariants of the board. The most important functionalities of the board are depicted below:
4. The Training Process and Bootstrapping
4.1
The Training Process
The training program that we’ve built can handle the training of both player1 and player2 concurrently (AGENT), and provides much more neat combinations including a random player (RANDOM) and a trainer (TRAINER) which uses the MiniMax algorithm and always strives for the highest reward (exploitation).
4.1.1
Setting the Parameters
For reasons that are explained in section 5, we chose to use the following setting for our parameters:
4.1.2
Weight Vector per Player
The training process was designed to find the optimal weights for our weights vector (for the Q-function approximation). We wanted to learn two different sets of weights – one set for player1 and the other for player2. The logic behind this choice is that the set of weights can differ when playing as player1 as opposed to player2, since the players have a different starting point.
4.1.3
Training Details
We decided to take a gradual training scheme so that Nemesh can learn the game more thoroughly from its basics, and only then will proceed playing against more advanced players. We wouldn’t want to frustrate it from the beginning… The training consisted out of the following steps:
In order to verify that our player has learned well, we played against it with our human player and let it play against the GoodPlayer. We saw some progress over sessions, although it was pretty much converged after the first serious session.
4.2
Bootstrapping
As was mentioned before, our agent started to train against itself, with some look-ahead provided by the MiniMax tree.
4.2.1
Initializing the Weight Vector
In order to enhance the training process even further, we initialized the players’ weight vectors with non-arbitrary weights as we saw fit. Since the state vector is composed out of player1’s features followed by player2’s features, and since we wanted to train two different sets of weights, player1’s weight vector was initialized with positive values for its good features (most of its side of the vector) and with negative values representing its opponent. Player2’s weights were initialized with the same values, only reversed (what is good for player2 is bad for player1 and vice versa).
4.2.2
Coerced Moves
As a sort of guidance for our player, we used lists containing coerced moves for our player – a list of forced moves that the player must do in order to win, and a list of prohibited moves that the player must not do since it will cost it the game. Those lists were updated after each disc insertion, along with the features counts. Before selecting an action using the MiniMax algorithm, the player first checked for its winning moves and then for its opponent winning move (in that order). If such move exists, the player would take it, because it means that it can either win or prevent the opponent from winning, thus keeping the game going with a chance of winning.
5. Project Thoughts and Issues
Throughout the entire project, we had to deal with both technical and content-related issues. The way we handled those issues heavily influenced our working progress and the way we decided to train our agent. In this section we list the problems encountered and try to present our best reasoning and solutions for the problems (we could not find a solution for every problem, but we tried our best to achieve the optimal result in our perspective).
5.1
Setting the Learning Parameters
There are three main parameters which can affect the learning progress and which we must consider: the learning rate (a), the discount rate (g) and the e-function (e-greedy policy).
5.1.1
Setting the Learning Rate a
The learning parameter determines to what extent the new information will override the old information. A factor of 0 will make the agent not learn anything (since the error addend will be 0), while a factor of 1 will make the agent consider only the fresh learned information. [5]
Following Sutton’s and Barto’s online book (chapter 8.2, Gradient Descent Methods) [2], the learning rate (or step-size parameter as they call it) should decrease over time in order for the gradient descent method to reach a local optimum. They elaborate their argument by saying that we do not wish or expect to find a function that has zero errors in all states, but only an approximation that balances the errors in different states. That is the reason why we take only small steps towards our error – we could not balance the errors if we completely corrected the error at each state.
In addition, since our learning can be thought of as “batch-learning” (chapter 6.3, Optimality of TD(0)), values could converge independent of the learning rate parameter, as long as it is chosen to be sufficiently small.
We completely agree with Sutton’s and Barto’s argument, and in our perspective decreasing the learning rate over time is the logical thing to do: at first we want to learn as much as we can, and over time we want to make a few minor adjustments to what we’ve achieved so far.
5.1.2
Setting the Discount Rate g
The discount rate determines the importance and influence of future rewards. A factor of 0 will make the agent be opportunistic, i.e. consider only current rewards, while a factor approaching 1 will make it strive for a long-term high reward. [2, 5]
Following Sutton’s and Barto’s online book (chapter 3.3, Returns) [2], since we want our agent to be as much farsighted as possible, it is advised to use a discount rate approaching the value of 1 (but not 1 itself, since the weight values may diverge [5]).
After a number of trial and error sessions, we fixed the value on 0.8.
5.1.3
Setting the e-function
This is the online policy to select an action which balances the exploration ~ exploitation issue. In an epsilon probability, we shall make a random move (explore), while in an (1-epsilon) probability we shall exploit what was learned so far.
Following Sutton’s and Barto’s online book (chapter 6.4, Sarsa: On-Policy TD Control) [2], the epsilon should be defined as a function of time, so that epsilon=1/t should make us converge.
According to our perspective, it is a logical choice indeed, since we want to explore more at the beginning of each session, and want to utilize what we’ve learned so far towards the end of each session.
In order to enhance the exploration stage, we tried using an epsilon=sqrt(1/t) function. It turned out to be a good choice indeed.
5.2
State Representation – More vs. Fewer Features
As was previously mentioned, our final feature state vector was composed out of 36 features: 18 features for player1 followed by parallel 18 features for player2. However, this was not our first choice. We started our implementation with a few extra features for both sides, features that were rejected during the working progress. Those features are listed below:
Some of the features were rejected since they made too much noise and were not so important (such as AllRowOnes and OneInACol). The other features were rejected because they did not contribute to the main goal. After some trial and error sessions, we saw very little contribution of those features (weight values that approached 0). We could have just left them as is.
One interesting thing worth mentioning is the FourInARow feature. The weight corresponding to this feature did not update. At first we thought that there is something wrong with our implementation, but then we realized that this weight can never be updated in the SARSA learning scheme, since it is a terminal feature. Once some player reached this feature the game is over, so the St vector (previous vector) as denoted by the SARSA algorithm will always have the value of 0 for this feature, and it will never be updated.
5.3
Randomization
We thought of ways to insert some sort of randomization to our game, so that two players won’t play the same game over and over (deterministically). A nice solution for that was implemented as part of the MiniMax algorithm. In order to achieve different move choices when we have some equally good actions, we shuffle all possible moves before we iterate over them. Since we implemented our tree with alpha-beta pruning, receiving some optimal score will cause a cutoff and following paths will not be traversed.
Thus, shuffling the possible moves actually shuffles the paths we traverse in the tree and may result in a different choice of action each time we have some equally good ones.
6. Unsuccessful Attempts
On our way to the final fully-trained agent, we had our share of some unsuccessful attempts and not so productive choices. We bring in detail some of our main problems which we came across during our project.
6.1
Non-Converging Weights
As we tried to train Nemesh, we got a lot of results that were not logical (the weights totally diverged). We believe that this diversion was due to a number of factors:
In order to solve this problem, we tried to normalize the weight vector and the approximated value of the Q-function in two different schemes: (a) Dividing each value of the dot-product vector by the maximal value right after performing the dot-product (state Features product weights); (b) Dividing the weight vector itself by its Euclidean norm – that way the norm was always set to 1.
While conducting some empirical testing for both methods, we did not notice any significant improvement in our agent’s performances (although we did not diverge). After some more testing with the learning rate, we found a scheme that works and does not make the weights diverge. The agent that trained by the learning rate manipulation scheme (as depicted in Section 4) yielded the best results among those testings.
6.2
Logical Problems with the SARSA learning algorithm
We had some trouble figuring out
what the St+1 stands for when we have a game of two players. All the
material regarding the SARSA algorithm that we came across deals with the SARSA
concept in a more abstract way – there is an agent that performs action at
while in St, receives an immediate reward rt,
proceeds to state St+1, chooses an action at+1 and update
the Q function. It was a little confusing, since we could come up with two
possible interpretations of the St+1 notion: it could refer to the
state right after the player had made its move, or it could refer to the state
after the player’s move and its opponent move. Things get a little more
complicated when one player can win after a move – how will the St+1
be determined then? We started off by implementing the algorithm completely
wrong, and didn’t save the state before the player’s action – rather, we updated
the weights considering the state right after the action as St
(which is obviously wrong, since we need to take St as the state
right before the action).
After we corrected that logical error, and saved the state just before the
action took place, we updated the weights right after the player has made a
move. This is again a mistake, since we treated St to be the state
just before the player’s action, which means that it is the state reached after
the player’s move and its opponent move. However, the weights update didn’t
consider that, since the weights were updated just before we’ve reached the
next state (so, we’ve updated the weights according the an intermediate Qt(St+1, at+1), a value
that does not take into consideration the opponents reaction).
In our final implementation, the weights are updated so the opponent’s move is
counted as well.
6.3
Interchanging state between player1 and player2
In our first implementation, a
state was represented as a feature vector that held values for the current
player and its opponent. For example, if player1 was the current player, the
feature vector would look something like this: [player1_features, player2_features]
(same features for both players). Every time the current player has changed, so
did the feature vector, so we got: [p1_features,p2_features]--changePlayer-->[p2_features,p1_features]--changePlayer-->[p1_features,p2_features] etc.
We thought that this was the best way to realize two sets of weights for two
different players, player1 and player2. Although the implementation was
coherent, we decided that a non-changing state vector (i.e. not switching from
player1 to player2 vector) would be the best thing, and will prevent from
future mishaps. So we’ve changed our implementation so that the state vector
will be consisted of player1 features followed by player2 features.
6.4
Optimizations for the MiniMax Algorithm
We tried to apply some optimizations to the MiniMax basic algorithm implementation, in order to enhance performances and allow the algorithm to efficiently scan the decision tree for even deeper depths. Although these optimizations were not applied in the final training process, we bring in detail the ideas behind those optimizations:
7. High Level Architecture
7.1
Training Program Schematic Relations between the Classes
Graph Index (start at Training Class):
- Class instantiates another class.
- Class invokes a method of
another class.
- Class contains a private field
of another class.
7.2
Agent Schematic Relations between the Classes
Graph Index (Start at Nemesh Class):
- Class instantiates another
class.
- Class invokes a method of
another class.
- Class contains a private field
of another class.
- Class implements an interface.
7.3 Classes overview
Training – A class that its sole purpose is to initialize the training program by calling the TrainAgent static method of the SarsaLearning class.
SarsaLearning – This class is the heart of the training process. It simulates game training sessions for two TrainPlayers by successive calls to GameSimulation. It also holds all the relevant parameters of the learning algorithm (learning rate and discount rate) and contains a static method for updating the weight vector.
GameSimulation – The GameSimulation class manages all the data required to simulate a game (GameBoard and Players), and is responsible for the flow of the game. Players take turns in the game (starting from player1), and the main game loop keeps on running until one of the players has won or the game ended in a tie. This class also holds specific player variables that are needed for updating the player’s weight vector (previous state of the board, old expected return, immediate reward and new expected return). The weight vector updating scheme is integrated all over the flow of the game for both players (we can train two players simultaneously).
TrainPlayer – A class that represents a training player. The player can be our learning agent (AGENT), a smart trainer (TRAINER) or a random player (RANDOM). The class holds the weight vector of the player and MiniMax tree depth (in case it is an agent of a trainer), and contains a chooseMove method that chooses the next move for the player, according to its type: A random player will choose a random move; A trainer will choose the best possible move according to the MiniMax algorithm (always exploit); An agent will choose a move according to an e-greedy policy.
MiniMax and Best classes – The two classes compose the MiniMax algorithm which is encapsulated within the static recursive chooseMove method. The method returns the Best data structure, which contains the best move for the calling player and its score.
GameBoard – A class that represents a game board on all its complexities. This class is elaborated in detail in section 3 of this report.
Vector – A class that represents a vector, along with all of its functionality (initialization, add value to index, write/read vector to/from a file etc.).
FeatureCount – A class that is responsible for counting all the state features, as defined in the GameBoard class. The class is divided into a few static methods, each one of them can be used to count different type of features (column features, row features etc.).
InitializeStartingWeights – A class that initializes the weight according to what we saw fit.
Nemesh and GameGUI – The Nemesh class is our agent, and it implements the FourInARow interface as was defined in the workshop’s website. It reads the weights that were learned during the training process, and chooses its move according to the MiniMax algorithm. It is also accompanied by GUI software that presents the logical board in a much more convenient and user-friendly way.
8. Infrastructure
8.1
Watching a Full Simulated Training Game
Our main concern was to debug the training process, so we added an extra argument for the TrainAgent method that allows us to watch a full simulated game with printouts to the console. There is no need to watch more than one game per session, since the games are quite long and representative.
8.2
Testing the FeatureCount Class
In order to debug the counts of the features, we implemented a few static functions in the GameBoard class: GenerateRandomBoard, GenerateRandomSizes and PrintBoard. The purpose of these function is to generate a random (yet legitimate) board and do a fine print of it to the console, so that we can debug the feature counts.
8.3
MiniMax Debugging
Due to its recursive nature, this function was the hardest class to debug. In order to debug it, we used the debugger (step by step) and printed the tree to ourselves. We tried to insert some printouts inside the function and debug for small depths, but it was quicker and more useful debugging it step by step.
8.4
General Testing
We used some more general testing methods in order to check our agent implementation and performance (for example, feature counts etc.). This was done in two main ways:
Reference
Appendix A
Player1 Weights:
Player2 Weights: