CPFAQ: Fast Classification, Part 2

Quora classifier

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

I’m writing a tool called QuoraClassifier to speed up the process of organizing the list of Quora questions I’ve collected. Last week, I used an early version of the tool that let me select a primary category for each question with a single keypress. This week, I added a few enhancements to allow for more specific classification.

Secondary Category

Consider these two questions:

Using my classification process, I have assigned both questions the primary category algorithms and data structures. But to locate these questions in the FAQ, I need more than just that category. Although both questions ask about algorithms, the first one is about study techniques, while the second one is asking for a syllabus.

To capture this additional information, I created a secondary category field. The secondary category uses the same list of descriptors as the primary category, but it refers to a more specific aspect of each question. In this example, the first question has a secondary category of getting better, and the second question has contests.

On Quora, users can associate one question with many topics. If someone is following one of these topics, then they might see the question. My categories have a different purpose. The idea is to progressively narrow down the subject of the question, starting with the most relevant topic (primary category) proceeding to associate it with a group of related questions (with the same canonical title), and finally getting to the question title itself, the most specific identifier. Given this goal, there is a diminishing return to including many categories. I think two will be sufficient.

One consideration when assigning categories using QuoraClassifier is whether to assign both categories when I first look at a question, or to assign the primary category for every question in the first pass, and then go back through the whole list again to assign the secondary category. I tried the former approach this week, and I think it’s less efficient. The reason is that it requires switching between two contexts (primary and secondary category) for every question. Another way to describe it: if I have 20 categories, then I have to keep in mind $20 \times 20 = 400$ category pairings. If I only consider the primary category or the secondary category individually, I only have to keep track of 20 options at a time.

So I think the better of the two approaches is to categorize every question with a primary category and then go through each set of questions in a category and assign the secondary categories. For example, take all the algorithms and data structures questions and decide what else each of those questions is about.

Canonical Title

The most specific level of categorization, other than the Quora question title itself, is the canonical title. For a particular category, there can be many more canonical titles than the total number of categories. (E.g., I already have 62 unique canonical titles for algorithms and data structures, and there are many more questions to classify). So following the argument described above, it wouldn’t make sense to assign a canonical title at the same time as a primary category.

But a few canonical titles are so common that it requires no special effort to associate them with a question. For example, many questions ask some form of How do I get started with competitive programming? When I see one of those questions, I’d like to assign that title right away rather than having to find it later. Because the relevant canonical title is so obvious, it doesn’t slow down the classification process much.

To accomplish this with QuoraClassifier, I added a Canonical Title field. The canonical titles appear in a drop-down box, sorted in descending order by popularity. So the ones I want to assign immediately are likely to be at the top of the list and therefore be quick to access. Due to the number of titles, it isn’t practical to assign each of them a letter as I do with categories. But I’m considering assigning a hot key to the top few titles, to make it even faster to assign them to questions.