Saturday, June 06, 2009

Delayed N-Armed Bandit

I have been trying to more thoroughly understand concepts related to reinforcement learning; specifically temporal difference learning. I reread portions of the book by Barto and Sutton. One of the examples given in the book as a toy problem is that of the n-armed bandit. An interesting variant of this problem occurred to me – What if the rewards are not dispensed immediately, but rather after some fixed (or perhaps random) delay?

slot machines

Imagine walking into a casino and being given one hour to play a large group of slot machines. However, unlike regular slot machines, these ones will produce a fixed amount of coins per pull. The trick is to figure out which slot machine gives you the most reward, and then exploit it. Sounds simple, right? There are two caveats: First, the machines will only give you the coins after some unknown delay. Second, when you receive the coins, you will not know the identity of the machine that produced the reward.

I wrote a fairly simple program in java to do this experiment by making use of eligibility traces, and found that it worked far better than I had expected was possible. Given a large number of possible actions (say n = 30), the system was able to properly assign credit to actions such that for all actions in A, if the real reward R(An) < R(Am) then the expectation of reward E(An) < E(Am) as well - so the actions would be ranked correctly in terms of value. This ranking could be learned perfectly, even when the delay between action and reward was 100 time-steps or more. Granted, a large number of observations are required (in the millions), but it is still impresses me that such a thing can be done at all.

In order to understand how and why this worked, I attempted to characterize the result mathematically, and was able to do so in a way that very closely matched the empirical results:

Delayed N-Armed Bandit Eligibility Equation

Where:

r := eligibility trace decay rate (such that 0 < r < 1)
d := (fixed) delay between causal action and reward
E[Vall] := Expectation of reward averaged over all actions
E[Vcause] := Expectation of reward due to causal action

This rather simple equation appears to nearly perfectly describe the observed results in the limit of a great number of observations.

The main technique used is treating the “volume” of eligibility as the sum of an infinite geometric series, which is where the 1 / (1 – r) terms come from. Basically, we are assuming a randomized action policy, and therefore, an average over all possible rewards weighted by their relative likelihoods. These rewards will be (incorrectly) applied to our “causal” action, weighted by the eligibility. The one difference is the time-step at which the “actual” reward is emitted; this will be weighted by the eligibility at that delay, which is represented by the rd terms.


Additionally, I tried some variations on the basic experiment.

Successful circumstances:

  • Various delays, uniform across bandits (up to about 500 time-steps)

  • Online randomized delays (varies from observation to observation)

  • Heterogeneous delays across bandits

  • Skewed action frequencies (some actions taken far more frequently than others)

  • Online Gaussian noise added to rewards (up to variance of 10)

  • Various numbers of bandits (up to about 30)

  • Non-random sequences of action (e.g. 1, 2, …, n–1, n, 1, 2, …)


Unsuccessful circumstances:

  • Replacing traces

Wednesday, December 24, 2008

Evolving Mona Lisa

Basic idea: I evolved a feed-forward neural network that takes (x, y) pixel coordinates (each dimension normalized to fall within [0, 1]) as inputs, and outputs a scalar value in the range of [0, 1], representing the grayscale value of that pixel. For example, the image size used during evolution of "Mona" was 30 by 34 pixels, meaning that the neural network needed to be iterated 1020 times to generate a single image; each iteration of the neural network generates only a single pixel. Of course, the neural network will smoothly generalize over any input coordinates in the range [0, 1], therefore it is straightforward to have it generate larger images (as depicted below) simply by partitioning the 2D plane using finer granularity. It is evolved on the smaller images for sake of efficiency.

The results are by no means perfect, but still fun to watch.

Original Version MonaEvolved Version Mona
Original Version StripesEvolved Version Stripes

Tuesday, July 22, 2008

Visualizing high-dimensional fitness landscapes

I've recently been exploring various ways of improving my intuitive understanding of fitness landscapes. This is a difficult problem, because the parameter spaces I want to be able to think about are often composed of dozens or even hundreds of dimensions. I have found that most authors describing fitness landscapes do so using overly simplified diagrams that somewhat resemble hills or mountain ranges.

In some recent experiments, I've found some nice ways of visualizing the geometry of actual fitness landscapes. I produced the images below by evolving recurrent neural networks to optimize performance on various tasks (e.g. wandering through a virtual maze), and then examining the resulting fitness values of a large number of very nearby points to those evolved solutions. Of course it is impossible to visualize the entire space, because of the high dimensionality; so instead what we are looking at are randomly selected 2D slices through that space. The shading is representative of numeric fitness (dark = low-fitness, light = high-fitness).

Here are a few examples of the results:
















Thursday, April 19, 2007

2-Stage-Developmental Neural Networks

I'm experimenting with a two-stage process; evolving one neural network (we will call this the "genotypic network"), which is then used to produce a second neural network (we will call this the "phenotypic network"), which is in turn tested against the problem domain and evaluated (the fitness of the "genotypic network" is a direct function of the fitness of the "phenotypic network"). Note that the "genotypic network" is far simpler than the "phenotypic network", thus biasing the search to explore simple, patterned architectures.

Result of Direct-To-Phenotype approach:



Result of 2-Stage-Developmental approach:



As can easily be seen, the "phenotypic network" produced by the 2-Stage-Developmental approach has a far more patterned appearance to it, whereas with the Direct-To-Phenotype approach (as is standardly done in neuro-evolution), the results look completely random and disordered.

I have found that there are advantages and disadvantages to both approaches. In tasks that possess very simplistic and global symmetries (such as N-bit Parity task), the 2-Stage-Developmental approach appears to be far superior, outperforming the Direct-To-Phenotype approach by perhaps an order of magnitude in terms of speed of finding a solution. However, on other problems (e.g. Embedded Reber Grammar task), the 2-Stage-Developmental approach proves to be over-constrained, and is therefore unable to fully solve the problem in a reasonable period of time.

My next experiments will involve overcoming some of the limitations of the 2-Stage-Developmental approach. The hope is that techniques like this could be used to scale up to solving problems of much greater difficulty.

Saturday, March 31, 2007

Artificial Ants - Santa Fe Trail

I was just reading a bit about Genetic Programming and decided to test my program out against a standard benchmark - The Santa Fe Trail:


The Santa Fe Trail is a 32-by-32 Torroidal grid of 1024 squares. 89 of these squares possess "food". An "ant" is placed at the upper left corner of the grid and can explore around using the following actions:
1) Turn Left
2) Turn Right
3) Move Forward.

One action is taken at each time-step. Moving onto a square with food will cause the ant to consume it automatically, thus adding to fitness (after this the square will be empty). The ant can only "see" the square ahead of itself in the direction it is facing. Other than that it must rely on contextual memory to guide its wanderings. There is a time-limit of 400 time-steps. The fitness is the number of food items consumed during the time-limit. There are several large gaps in the path, adding considerably to the challenge.

It took about 2 hours to program up (half the time spent manually entering in the coordinate information for the trail). Rather than using GP, I am using a combination of evolutionary algorithms applied to Recurrent Neural Networks.

First time running it the program got stuck at about 85% of the food eaten. After about 5 minutes I tried it again, but from a different random seed. This time it came up with a perfect solution (all 89 food squares visited) in about 1 minute, and in the process evaluated about 10k individuals (this is similar to the results reported by Koza with Genetic Programming). I then tried again with another random seed, and it was able to discover a perfect solution in roughly 5 seconds, and only required evaluating roughly 700 individuals.

I was happy to see that my methods faired well on this task.

I think that this is an interesting problem to work with because it is very easy to analyze and I already know it can find a perfect solution rather easily and quickly, thus giving me a fast feedback loop.

Tuesday, March 27, 2007

Really sensitive learning algorithm

HILLCLIMBING

I just finished rereading a book called Stochastic Local Search, and was wondering if I could apply some of the techniques described therein to the analysis of fitness landscapes produced by neuro-evolution. As a first experiment, I decided to implement a very simple hill-climber. I defined the "neighborhood" as the set of vectors produced by transforming the current solution by a fixed amount along a single dimension at a time.

Therefore, a solution of {0.3, -2.4, 2.1} would have the following neighborhood (assuming a step-size of 0.1):

x += 0.1 -> {0.4, -2.4, 2.1}
x -= 0.1 -> {0.2, -2.4, 2.1}
y += 0.1 -> {0.3, -2.3, 2.1}
y -= 0.1 -> {0.3, -2.5, 2.1}
z += 0.1 -> {0.3, -2.4, 2.2}
z -= 0.1 -> {0.3, -2.4, 2.0}

For N parameters, the size of the neighborhood would be 2N.

So, as a first test, I wanted to see how well simple hill-climbing could train a neural network attempting to solve n-bit parity, a standard machine-learning benchmark. I tried 5 bits (this has 32 training exemplars), and it solved it very rapidly. This was a surprise as I was expecting it to get stuck; I merely wanted to see how far it would get before getting stuck.

I eventually got it up to about 94% correct on 12-bit parity. It should be noted that this is an MUCH larger version of the problem (4096 training exemplars), and this is also MUCH larger than what is standardly attempted. Most are 6 or fewer bits. (One interesting larger example - 21 bits - can be found here).

I compared this new method against Evolutionary Programming, and it proved far superior - EP never made it much past 65% accuracy.

NOISE IS BAD

I decided to see how this algorithm would fair in the presence of some noise applied to the fitness function. Incredibly, even the slightest amount of noise totally disrupts it! I have tried applying a gaussian noise with a standard deviation more than 5000 times less than the maximal fitness of the function, and strangely enough just this tiny "whisper" of noise is enough to cripple its performance.

I'm very intrigued, as my intuition has been severely violated twice by these experiments. Clearly more are in order.

Monday, January 22, 2007

miscellaneous stuff

I found a very interesting podcast on iTunes, called Talking Robots talking about both Robotics and AI.

Just read that a Genetic Algorithm has been used to solve a billion-variable problem! The particular algorithm used is something called a Compact Genetic Algorithm (cGA), which uses a statistical model to generate new candidate solutions rather than more standard operators like crossover and mutation. This algorithm falls in the class of Estimation of Distribution Algorithms.

I've been experimenting with an java based open source data mining tool called Weka. I've been using it to try to find patterns in neural network fitness landscapes. As a first experiment, I randomly generated several thousand neural networks and tested them against Exclusive-OR. The value of their parameters and the resulting fitness was used to generate an ARFF file recognizable by Weka. I then analyzed the data using quite a few different tools, including trees, bayesian methods, linear regression, multilayer perceptrons, etc... My conclusion? Looking for patterns in the fitness landscape of a neural net is HARD. None of the dozen or so classifiers I used were able to do any better than chance (most did worse) after performing 10-fold crossvalidation. And this is when dealing with a space of only 9 dimensions, much smaller than anything practical.

I've been reading about the Hierarchical Bayesian Optimization Algorithm (hBOA). The idea behind this is to infer bayesian directed acyclic graphs (DAGs) based on a selected set of successful candidates and use that to generate new candidates (this is also an Estimation of Distribution Algorithm, as described above). By using bayesian DAGs, it is able to sample conditional dependencies, rather than being limited to a univariate model (like the cGA). This allows it to discover "building-blocks" of progressively greater complexity, and it is able to solve some very difficult problems, including Hierarchical If-And-Only-If (HIFF), which is impossible for other evolutionary algorithms. While this is indeed impressive, I am not personally convinced that many real-world problems possess such a structure (however, one example they have found so far is in finding ground-states for Spin Glasses). My strong feeling is that such a technique would not translate well into Neuro-evolution, because small groups of link-weights in a neural network cannot impart fitness without reference to the whole - in other words, I don't think that, in general, building blocks in artificial neural networks even exist, certainly not recurrent ones. [One possible counter-example to this intuition would be Enforced SubPopulations]. I may be proven wrong later, but that's how it looks to me right now.