I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.
It would be useful to have a page in the FAQ for a glossary of competitive programming terms. The Q&A part of the FAQ and the associated wiki discuss terms in detail, but a glossary provides an easy way to look up short definitions of terms that appear in questions and answers. This week, I started to collect a list of terms.
One approach to building a competitive programming glossary is to look at the words that people use most often when discussing the topic, filter out simple words that everyone already knows, and define the rest. As a source of words that people are using, I used my master list of question titles. Terms appear in the question title, the question body, and the answer body. But using the question title reduces the quantity of text to analyze, and having a term appear in a title is a signal that it’s important to the question being analyzed.
So I started with a text file containing my current question list. I wrote a simple console app to read it line by line and extract the words as follows:
- Split the line at each space, and also at each period, question mark, comma, each Unicode character from last week, and quite a few others. This is a quick and dirty way to extract individual words while ignoring non-essential punctuation.
- Convert each word to lowercase and trim leading and trailing whitespace.
- Add each word to a (
int) dictionary, where
stringis the word and
intis a count of how many times it appears in the question list.
I then made a pass through the dictionary, and merged singular and plural forms of words, using a simplistic process: for each word in the dictionary, append an
s and check if that new word also appears in the dictionary. This algorithm is far from perfect, but it helps consolidate words that refer to the same thing (e.g., programmer and programmers).
The proper way to extract words from text is to use something like the Natural Language Toolkit, but the process I used is easy to implement and good enough to give me a list of candidate words for the glossary.
The purpose of the glossary is to provide a short definition of words as people use them in competitive programming. Some popular terms from the list, like CodeChef, TopCoder, and SPOJ, are only meaningful in that context. But others, like problem, contest, and ACM, have specific meanings when applied to competitive programming, which might differ from their general meaning.
- Multiple Table of Contents sections are included throughout the page to facilitate navigation.
- Titles of glossary entries can be links.
- There’s no arbitrary limit on the size of glossary entries.
- The whole glossary is on one page, which allows Ctrl-F searching.
(Image credit: Dave Worley)