I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.
In recent weeks, I’ve been using text mining techniques to analyze a set of Quora questions and look for patterns. This week, I’m taking a more manual approach to analyzing the question database.
Canonical Questions
For any topic, certain questions will come up repeatedly. That’s why FAQs were invented. A canonical question is a standard wording for one of these frequently-asked queries.
One of the challenges when authoring a canonical question is deciding on the optimal specificity. On Quora, this decision process happens through merging and unmerging. Unless you’re merging two questions with identical titles, deciding to merge is a decision to give up some specificity from at least one of the question titles.
For example, here’s a popular question: What is the best way to learn C++ STL for programming contests?. Over time, a number of questions have been merged into it, including:
- How should I learn STL for topcoder?
- What is the best source for learning new STL features in C++11 and C++14, useful in competitive programming?
- How do I learn C/C++ programming from the basics to advanced for programming contests?
The canonical part of the question is How do I learn C++ STL for programming contests. Details about which contest (TopCoder), which version of the STL (C++11, C++14, some other version), and which level of instruction (basic, intermediate, advanced) are not as important as the core of the question, which is how to learn STL for competitive programming.
If a question already exists on Quora, one consideration for the merge decision is what answers have been written. In the example above, if people are willing to write answers that focus on details like learning the STL specifically for TopCoder questions, or specifically learning the C++14 language version for competitive programming, or specifically learning advanced C++ concepts for competitive programming, then it might be worth leaving those questions and answers alone. But more often than not, answers just address the core issue, maybe with a brief mention of some non-core aspect of the question. In that case, leaving multiple questions unmerged just makes core answers harder to find, encourages people to repeat their questions and answers, and frustrates people who have put effort into answering the original question.
Primary Category
The proliferation of duplicate questions means there are a lot of question to go through when trying to comprehensively research competitive programming and related topics on Quora. I’m currently working with a list of 16.5k questions. Searching that list for keywords is easy, but reading each title and deciding how it should be written canonically is another story.
One challenge during this process is finding previously evaluated questions in order to group related questions together under a canonical question title. Keyword searching helps, but with thousands of questions to go through, it’s important to find techniques that make the process faster.
One technique I found useful is to select a primary category for each question. Quora questions can have multiple categories (topics), but there’s usually one that captures the main idea of the question. This is especially true because all of the questions on the list I’m using are about the same overall topic, competitive programming.
In addition to the primary category, I also write a canonical question title for each question. In some cases, this is a title that the question should be merged to. What is the best way to learn C++ STL for programming contests? from the example above would make a good canonical title. In other cases, it’s just an additional level of specificity in the hierarchy of Competitive Programming -> Primary Category -> Canonical Title -> Question Title.
Sorting question data by primary category and then by canonical title (within each primary category) is a quick way to evaluate the current list of categories and titles and make any required adjustments.
Categories
I categorized the first 500 or so questions in my list (sorted in descending order by number of followers), and came up with a set of Primary Categories sufficient for those questions. During this process, I’ve been implicitly defining each category based on the questions I put in it. But it would be better to have actual category descriptions. That will be the topic of next week’s post.
(Image credit: Elliott Brown)