Solving UVa 11340 in Java

ISO-8859-1

UVa 11340: Newspaper, despite being ranked at only Level 2 difficulty on uHunt, turned out to be rather tricky. Apparently others thought so too, judging by the 11 pages (151+ posts) of discussion on the UVa OJ message board. Many of the message board posts focus on the characters used in the test input. The consensus is that they are 8-bit characters (with values from 0 to 255). At a minimum, these characters need to be stored in an unsigned char data type in C++. Nevertheless, people seemed to run into problems even after getting that hint. I solved the problem in Java, which doesn’t have an unsigned char data type, but it has its own set of difficulties. I’ll cover the two issues that I ran into when solving this problem.

Character Sets and Encoding

Character sets and encoding can be a complicated topic for programmers. Over 10 years ago, Joel Spolsky wrote a blog post entitled “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” to try to clear up some of the confusion. To solve UVa 11340, you don’t even need to know everything in that post, since the problem doesn’t use Unicode. But you do need to know how to handle character values above decimal 127 (hex 7F).

Character sets and encoding are confusing partly because of the extensive terminology used when discussing these concepts. For the purpose of UVa 11340, we really only need a few terms:

  • A character encoding is a particular convention for mapping bytes to characters (e.g., letters, numbers, and symbols).
  • ASCII is a character encoding that uses only 7 bits per character. This means it can encode a total of 128 characters, since $2^7=128$. In base 10 terms, the integers 0 through 127 (inclusive) are used to encode these 128 characters.
  • Extended ASCII refers to a type of character encoding that maps 0 through 127 to the standard ASCII characters, and then uses one or more additional bits to encode characters beyond 127. The term “Extended ASCII” doesn’t refer to a particular encoding that defines what the additional characters are. Therefore, this term should be used with care. In particular, it should not be used unless accompanied by the name of a specific character encoding.
  • ISO 8859-1, which is also known as ISO Latin 1, is the most popular 8-bit extended ASCII character encoding, and one of the encodings that Java is required to support on all platforms.

UVa 11340 could probably be solved using any 8-bit character encoding. This is because the problem involves mapping characters to monetary values, not displaying the characters for anyone to read. As long as we use the same encoding to interpret the character value table section of the input as we do to interpret the article section of the input, we should be fine. Using an 8-bit encoding is important because the problem authors seem to have used characters beyond 127 in the source text. Therefore, using standard ASCII would result in ignoring one bit per character, which would lead to incorrect results.

Java Charset (Writing)

To properly test a UVa 11340 solution, we need to generate input text that contains characters beyond position 127. The simplest way to do this would appear to be:

for (int i=0; i<=255; i++) System.out.println((char)i);

On my system, this works for some extended characters. However, characters 128 through 159 appear as question mark characters. And they aren’t just being displayed as ?. If I redirect the output to a file and open it in a hex editor, each byte in those positions is stored as 3F, which is the ASCII question mark (a common replacement character). So this simple approach doesn’t work.

Let’s see what facilities Java has that can help. There is a Charset class that implements methods for character encoding/decoding. To find the default charset for your system, import java.nio.charset.Charset and then run this code:

System.out.println(Charset.defaultCharset());

On my system, this returns:

windows-1252

I’d like to try using ISO 8859-1 instead of windows-1252 to print my extended ASCII file. I can do this as follows:

PrintStream out = null;
try {
    out = new PrintStream(System.out, true, "ISO-8859-1");
} catch (UnsupportedEncodingException uee) {}
for (int i=0; i<=255; i++) out.print((char)i);

If I redirect the output of this program to a file, I get what I’m looking for: a file containing exactly 256 bytes, encoding the numbers 0 to 255 (00 to FF) in consecutive order. The image at the top of this post shows what this file looks like in a hex editor.

Java Charset (Reading)

Now that we have an input file with one of each character, let’s try to read it. To verify that it’s being read properly, I’ll print the decimal value of each character that I read. Once again, here’s the simple approach first:

Scanner stdin = new Scanner(System.in);
while (stdin.hasNextLine()) {
    String line = stdin.nextLine();
    for (int i=0; i<line.length(); i++)
        System.out.println((int)line.charAt(i));
}

The results:

  • The 10 (line feed) and 13 (carriage return) characters are missing. This is expected, since Scanner.nextLine() doesn’t include these characters in the string that it returns.
  • Characters 128 through 159 are all interpreted as values above 255, including numbers of the form 3xx, 4xx, 7xx, 82xx, 83xx, 84xx, and 65533 (which is U+FFFD, the Unicode replacement character).
  • All other characters are returned with their correct values.

Remember that the source file was written using encoding ISO 8859-1. I then read it back using the default system encoding, windows-1252 Apparently they aren’t compatible when it comes to characters 128 through 159. To fix that, we can pass an encoding string to the Scanner constructor to use the same encoding for reading as we did for writing:

Scanner stdin = new Scanner(System.in, "ISO-8859-1");

With that fix, the results are almost perfect. 10 and 13 are missing as expected, but for some reason 133 is as well. It turns out that 133 (hex 85) is also a newline-type character. It corresponds to the Unicode character Next Line (NEL). So it makes sense that Scanner.nextLine() strips it out along with 10 and 13.

We could of course read the input file as binary, which would allow us to extract the exact numerical values without any encoding concerns. But this doesn’t really make sense in the context of UVa problem statements, which describe inputs in terms of lines (e.g., “the next M lines contain an article.”) So for now we’ll assume that the problem author has not included any 133 characters in the secret tests.

File Reading Performance

Just when you thought this deceptively simple problem had enough tricks, it throws in one more, at least for Java programmers. It’s actually made clear in the problem statement:

Be aware of a large input size, the whole input file is about 7MB.

The large file size (at least in the context of a UVa problem) combined with the one-second time limit for this problem means that the standard Scanner.nextLine approach won’t work. Even with a reasonably efficient algorithm, you’ll hit the time limit. Fortunately, there’s a standard Java solution. But before we get to that, let’s do some benchmarking.

I created a text file consisting of 1,000,000 lines of 100 random non-newline characters each (102 million bytes total including the CR and LF at the end of each line), encoded using ISO-8859-1. Then I looped through the file using Scanner.nextLine, read each file into a string, and calculated a cumulative total of the numeric values of each character. At the end of each run, I printed the total, which acts as a simple hash of the file contents (for use in the next test). Over 10 trials on my local machine, the median runtime was about 4.05 seconds.

For comparison, I ran a test using BufferedReader instead of Scanner to read the file. For this simple scenario, the program logic was essentially the same, but the syntax is different. The examples below show the differences:

Constructors:

Scanner stdin = new Scanner(System.in, "ISO-8859-1");
BufferedReader stdin = new BufferedReader(
    new InputStreamReader(System.in, "ISO-8859-1"));

Loop while another line is available:

while (stdin.hasNextLine()) // Scanner
while (stdin.ready())       // BufferedReader

Read the next line into a string:

String line = stdin.nextLine(); // Scanner
String line = stdin.readLine(); // BufferedReader

The BufferedReader version also requires catching a couple more exceptions.

The results for the BufferedReader approach: for 10 trials, it had a median runtime of about 0.77 seconds. This is considerably faster than the Scanner approach, and easily fast enough for the 7MB input file used by UVa 11340.

Thoughts on UVa 11340

UVa 11340 was intended to be easy: the problem statement on UVa OJ indicates that it appeared in “Huge Easy Contest #1.” Nevertheless, the problem setters decided to mix things up a bit by adding features that would cause problems for contestants who used language features in only the most basic way: not specifying an encoding when reading a text file, and not using a buffered reader. This is another tiny bit of experience that could be useful in future problems.

(Image credit: UltraEdit screenshot)