CPFAQ: Fast Classification

QuoraClassifier

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

Over the past few months, I’ve been classifying a list of Quora questions. Each question gets a primary category, the topic it is most relevant to. Classification helps organize the FAQ and also allows me to find duplicates to merge on Quora.

Until now, I’ve been classifying questions using a spreadsheet containing one question per row and a column for the primary category. But now that I’m familiar with the types of questions people ask, it only takes a few seconds to decide on a category, so the spreadsheet approach is slowing me down. This week, I wrote a simple classification program, QuoraClassifier, to optimize this process.

QuoraClassifier

The goal of QuoraClassifier is to make it as fast as possible to record a single decision: the primary category for a question. Recording the decision shouldn’t take much longer than deciding on the category.

QuoraClassifier is a simple WPF app. The UI language for WPF programs is XAML, which I used in last year’s project.

Like most of the tools I’ve built for the CPFAQ project, QuoraClassifier reads and writes tab-separated value (TSV) files. When it starts up, it reads a list of question data (titles, URLs, categories, etc.), which it keeps in memory. To save the classification results, it writes a TSV file in the same format. Since this is a quick and dirty tool, not a thoroughly tested application, I’m using two separate files to reduce the risk of data loss. Rather than overwriting the original file, I use two files that I can compare to verify that the changes are what I expect. If everything looks good, I can copy the output file onto the input file before the next classification session.

Once the input data loads, QuoraClassifier displays its single screen. The purpose of this UI is to 1) Display information about a question, and 2) Accept a classification choice. For maximum efficiency, the program lets me select a category using a single keystroke, at which point it displays the next question.

As shown in the screenshot at the top of this post, the UI includes the following fields:

  • Previous category: The category assigned to the previous question. Since the program moves to the next question as soon as it receives the classification choice for the current question, this field provides a way for me to check that I classified the previous question was the way I wanted.
  • Question title: The question title (from Quora).
  • Question URL: The Quora URL (not clickable).
  • View in Browser: A clickable version of the URL. This is a quick way to open the Quora question page, for cases when the title doesn’t provide enough information to classify the question.
  • Primary category: The primary category classification for the current question. Blank if it still needs classification.
  • Canonical title: A combo box containing a list of canonical titles that I have previously assigned to questions in this category, sorted in descending order by how often I used them. I haven’t implemented this yet, but I plan to use it to assign canonical titles once I’m done assigning categories.
  • Statistics: Total number of questions; number of questions with primary categories assigned; number of questions with canonical titles.
  • Last saved timestamp: When the output file was last written.

Instant Classification

To get through the questions in my list, I need to optimize the classification process. There’s a lower limit to how fast I can read a question title and decide how to classify it. For questions related to competitive programming, I’m probably already at the lower limit. So the only way to speed up the process is to optimize the act of recording a classification.

I decided that the fastest way to classify a question is using a single alphabetic key on the keyboard. Clicking with the mouse or tapping the screen is intuitive for many applications, but for speed, it’s hard to beat a keyboard interface. I assigned a unique letter to each category name. Since multiple categories start with the same letter, I couldn’t always use the first letter of the category. But it doesn’t take long to learn the letter associations: For example, P for People and R for PRoblems. The alternative, like pressing P once for People and twice for Problems, would just slow things down.

Here’s the category list:

  • Algorithms and data structures for competitive programming
  • Getting Better at competitive programming
  • Programming Contests
  • Exclude this question from the list (for off-topic questions)
  • General questions about competitive programming
  • MatH for competitive programming
  • Interviews and jobs in programming fields
  • BooKs about competitive programming
  • Programming Languages for competitive programming
  • Time Management for competitive programming practice
  • OrganizatioNs that competitive programmers join
  • Online judges
  • People who practice competitive programming (a.k.a. Competitive Programmers)
  • Competitive programming PRoblems (programming puzzles)
  • Getting Started with competitive programming
  • Tools for competitive programming
  • CoUrses for learning competitive programming
  • Competitive programming Vs. other types of programming (professional, hobbyist, academic, etc.)
  • Websites for studying competitive programming (other than online judge sites)
  • Y: Competitive programming coaches

And a few more keyboard commands:

  • Ctrl-F (Find): Navigate to the next uncategorized question.
  • Ctrl-S (Save): Write the output TSV.
  • Right/Left Arrow: Navigate to the next/previous question without changing a classification.
  • Spacebar: Navigate to the next question without changing a classification.

With these classification commands, I can classify a question in an average of 6.7 seconds. That’s about 30 hours for the ~16,000 questions in my list, or less than a month of calendar time at an hour or two per day.