Programming puzzles are often designed in such a way that getting your solution to complete under the time limit is at least as challenging as getting the correct result. To create such a challenge, a problem setter can adjust the size of the input until sub-optimal solutions no longer run in under the time limit.

Many puzzles can be solved using both a slower but easier approach and a more sophisticated method that runs within the required time limit. Among the uHunt starred problems, 732: Anagrams by Stack is one in which people have found performance particularly challenging, at least based on the forum threads. As with previous anagram problems, and performance-sensitive problems in general, getting Java to perform at the required level can be especially difficult. While some online judges provide wiggle room for languages with more overhead than C/C++, UVa Online Judge doesn’t seem to use that policy.

## Yet Another Anagram Problem

732: Anagrams by Stack is conceptually simple: given two strings, specify a sequence of stack operations (character pushes and pops) that produce the second string from the first string. My first thought when I encounter this type of problem is to ask whether I can just try all possible sequences, and keep the ones that work. This approach will produce the correct solution for UVa 732, but it isn’t fast enough for the input size used by the online judge. On uHunt, a problem that is higher than Level 2 (UVa 732 is rated at Level 4) will usually not be solvable by a simple brute force approach. Nevertheless, I will often implement that approach anyway as a way of understanding the problem better. This is a reasonable strategy for developers who are getting started with competitive programming.

## Pseudocode

For this problem, the fast solution still searches many sequences of stack operations and simulates them, but it requires a few key insights that make it run efficiently. First let’s look at some pseudocode for the simple and slow solution. This code runs for each test case.

```
read the input and required output strings
print "["
for each possible sequence of pushes and pops
for each push or pop in the sequence
if push
push the current letter onto a stack
advance to the next letter
append "i " to the output sequence
else
if stack is empty, clear output string and exit
pop a letter from the stack
append it to the output string
append "o " to the output sequence
if the stack is empty and the output string matches
print the output sequence
print "]"
```

The nested loops in this solution provide a hint that performance will be a concern. The first loop runs through the possible sequences of stack operations. For each character in the input string, there are two possible operations, push and pop. Therefore, for a string of length $n$, there are $2^n$ possible sequences. And that’s just the outer loop! There is also an inner loop with an additional $2n$ operations ($n$ pushes and $n$ pops). UVa 732 is a very old problem (published 2000-08-31) that doesn’t specify any input size information, but it’s safe to say that this algorithm will be slow.

## Optimizations

There’s no obvious way to test a sequence of stack operations without actually doing the pushes and pops. So at some point we’ll have to pay the cost of the $2n$ operations from the inner loop. The real question is how small we can get the list of sequences that we need to test. There are a number of optimizations we can use to reduce this number. Here are some ideas:

- Exit immediately if the input and output strings are different lengths, or are not anagrams of each other. We’ll never be able to produce a valid sequence in these cases.
- Don’t generate any sequences that start with a pop, since that will always result in popping an empty stack, which is illegal. This immediately halves the number of sequences to generate and check.
- Only simulate sequences that have exactly $n$ pushes and $n$ pops, since that’s the only way to use all of the input characters and end with an empty stack.
- When generating the sequences, keep track of the number of o’s and i’s. If the number of o’s ever exceeds the number of i’s, discard that sequence, since it will result in trying to pop an empty stack.

Another category of optimization involves doing work once and caching it, rather than calculating it each time. In other words, trade memory for CPU time. For this algorithm, the appropriate data to cache is the list of good sequences. When you find a sequence that meets the criteria listed above, save it. Since every string of length $n$ uses the same list of sequences, this means you only have to generate sequences once for each $n$. This will save a significant amount of time for a dataset with many test cases and/or with long strings.

## Data Structures

The final category of optimizations has to do with adjusting the data structures that you use in various parts of the algorithm. To extract the most performance out of the algorithm, you’ll want to use arrays when possible. For example, you’ll need a data structure to store a list of sequences for each $n$ value, and you’ll have to store the sequences themselves. One possible choice in Java is a `HashMap<Integer, BitSet>`

. A `BitSet`

implements methods to generate and store sequences of bits. The sequences of 1’s and 0’s then naturally map to sequences of i’s and o’s. The `Integer`

in this case is just the base-10 number whose binary representation is stored in the `BitSet`

. If you create a `BitSet`

for each natural number through $2^{2n}-1$, you’ll have all permutations of $2n$ bits, which is what you need to represent $n$ i’s and $n$ o’s (plus a lot of invalid sequences, like $2n$ o’s).

While the Java `BitSet`

is an efficient data structure, and a `HashMap`

provides constant-time storage and lookup, they’ll never be as fast as the constant-time lookup provided by `boolean[][]`

. But once you start replacing convenient data structures with arrays, you’ll need to ask yourself why you’re not just implementing your solution in C. In my experience, it’s not a good idea to take this approach with optimization. If the most efficient Java data structures are not fast enough, you probably need to re-evaluate the efficiency of your algorithm. It remains to be seen whether this is true for all uHunt starred problems. Check back with me in a few years.

## An Even Faster Algorithm

There is a fundamental performance problem with the designs considered so far. The problem is that they rely on generating and evaluating complete sequences of $n$ i’s and $n$ o’s. The design can be optimized in various ways. We can avoid generating sequences that start with a pop, or contain more than $n$ pops. We can discard a sequence as soon as the number of pops exceeds the number of pushes. But even after these optimizations, there are still too many sequences to evaluate. To finish under the time limit, we need an algorithm that can discard even more sequences without evaluating them individually. The key idea is *sequence prefixes*.

One of the first optimizations listed above is to discard sequences that start with o, since we can’t pop an empty stack. In a single step, this optimization halves the search space. What if we could do this with more sequence prefixes? For example, any sequence that starts with “i o o” is invalid, since it tries to pop an empty stack in the third step. And for a given string, when we generate a sequence that pops a character that doesn’t match the current output character, we can also discard all other sequences with that prefix. For example, consider the first sample input provided for this problem:

- input: “madam”
- output: “adamm”

No sequence that starts with “i o” can be a solution for this string pair, because it produces an output string that starts with *m*. Therefore, we can discard all sequences with that prefix. This cuts down the search space very quickly.

The natural way to prune the list of sequences by prefix is to think of them in the form of a binary tree. The root of the tree is i, since we always start with a push. Every node has two children, i and o. We don’t have to build the tree in advance. Instead, we want to find all paths through a kind of virtual tree, which we stop building as soon as we encounter an incorrect or invalid prefix.

Consider this pseudocode for a recursive method called buildPath. $n$ is the length of the input and output strings:

```
buildPath
parameters: (
the current path (a list of i's and o's)
the current stack
next operation (left or right child node)
current input position
current output position
number of pushes in the current path
number of pops in the current path
)
add nextOperation to the current path
if nextOperation is push
increment numPushes
push the current input character on the stack
else
if stack is empty, exit (abandon this path prefix)
increment numPops
pop a character off the stack
if it does not match the current output character, exit
if the current path length is 2*n, print it
else
if numPushes < n
recursively call buildPath with nextOperation=push
if numPops < n,
recursively call buildPath with nextOperation=pop
```

There are a few implementation details to consider, mostly about ensuring that each recursive call gets its own copy of the current path and the current stack. In Java, I found that a `LinkedList`

of `Boolean`

worked well to keep track of the path.

## Performance

In practice, the recursive algorithm shown above is very fast, because it quickly eliminates most of the possible sequences. To compare it with the other implementations considered in this post, I created a ~92KB input file containing about 5000 word/anagram pairs generated by my implementation of UVa 195. I used that file as input to three different implementations of this “Anagrams by Stack” puzzle, each instrumented with `System.nanoTime`

calls to track total elapsed time.

### Baseline

Implementation #1 uses the strategy of generating `BitSet`

s to represent sequences of i’s and o’s. However, it doesn’t use a completely brute-force approach: it only generates bit strings that start with 1 (push), and it discards any generated bit strings that don’t have an equal number of 1’s and 0’s. The `BitSet.cardinality`

method can be used to conveniently calculate the number of 1’s in a bit string. Besides those two optimizations, this implementation only considers performance as a side effect: it breaks out of its inner loop when it would otherwise have to pop an empty stack or run off the end of the input string, since there isn’t much else to do in these cases.

In my tests, the median runtime of multiple runs of implementation #1 on my local machine was 70.9 seconds. It turns out that this is rather slow for my sample dataset.

### Optimized

Implementation #2 is as optimized as reasonably possible without changing the underlying algorithm. In addition to the optimizations from implementation #1, it also uses the remaining optimizations suggested at the beginning of this post: check if the input and output are anagrams of each other; don’t use a sequence unless it has an equal number of pushes and pops; break when the number of pops exceeds the number of pushes; cache sequences once they are created; and use arrays when possible.

Implementation #2 had a median runtime of 0.994 seconds, a 98.6% improvement over the baseline. Clearly the optimizations made a big difference. However, on the UVa hardware and dataset, this implementation ran in over three seconds, and was therefore not accepted. To get under the time limit, a more sophisticated approach is required, at least for a Java implementation.

### Recursive

Implementation #3 is the recursive implementation that uses paths through a binary tree to generate sequences, and prunes parts of the tree that produce undesired sequence prefixes.

The median runtime for implementation #3 was 0.100 seconds, an 89.9% improvement over the optimized approach, and a 99.9% improvement over the baseline. On UVa, this implementation easily passed with a runtime of 1.477 seconds, well under the three-second limit. This is despite using a Java `LinkedList<Boolean>`

and `Stack<Character>`

, and no arrays.

## Performance Optimization

This UVa 732 example shows that despite the gains that can be accrued by making targeted optimizations to an algorithm, a new algorithm that uses a fundamentally more efficient approach can provide much better results. This is not a surprising statement, but when working on a programming puzzle solution, it can be tempting to try to squeeze extra performance out of an implementation that you have invested time in. Often it makes more sense to go back to the planning stage and try something new.

(Image credit: LadyDragonflyCC)