CPFAQ: Classifying Quora Questions


I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

On Quora, it’s common to see the same questions, or variations of the same questions, show up repeatedly. The competitive programming topic is no exception to this rule. Merging similar questions is an option, but it presents a couple of challenges:

A competitive programming FAQ that is independent of Quora can resolve these problems. If QCR doesn’t want questions to be merged, they can still be listed under the same question in the FAQ. And the FAQ can categorize questions in a way that makes it easy to find appropriate merge targets.

Using some of the data collection from previous weeks, I have an initial list of categorized questions to use as a starting point for the FAQ categories.

Collecting Questions

In past weeks, I have collected a set of Quora topics related to competitive programming. I also have a tool that can extract the list of questions from a topic page. This week, I extended the tool to accept a list of topic pages and collect the complete set of questions from all of the topic pages. That allows me to get all of the questions from my topic list into one master question list.

One of the useful statistics on the all_questions page is a follower count for each question. Using XPath, this number can be extracted from the FollowSecondaryActionItem div. The advantage of follower count is that it indicates interest in a question, even if the question has few or no answers. My master question list currently contains over 17,000 questions. It would take a long time to categorize all of those manually. So it’s useful to sort by the number of followers, and start with questions that have hundreds or thousands of people interested in them, rather than the long tail of questions that are only interesting to a few people.

Question Categories

I’m taking two complementary approaches to reducing the thousands of related questions to a manageable list of FAQs: canonical question titles, and question tagging.

Canonical question titles

Popular questions are asked in different ways using slightly different wording. But the answers to these questions mostly ignore the subtle differences in question wording, and focus on a few key ideas. This is an argument for collecting answers under a canonical question title. Quora supports the idea of canonical questions, and they’re trying various techniques to make questions more canonical. But they’re doing it at the scale of hundreds of thousands of topics, and I’m organizing less than 200. So I think I can come up with better canonical titles than the various Quora content control processes, or generalist content gnomes.

Here’s the canonical wording I’m currently using for some popular questions. Each one is linked to one of the Quora questions that would be listed under that canonical title:

Question tagging

Another way to organize questions is to tag them. On Quora, tags are called topics, and I have been using them to collect questions. But as with Quora’s version of canonical questions, Quora’s topic ontology also has to work for millions of questions on every conceivable subject. An ontology designed specifically for competitive programming doesn’t have that requirement.

As I found a couple weeks ago in my discussion with the topic gnomes, Quora’s competitive programming topic ontology needs some work. For now, I’m going to work on tagging questions independently of Quora, and then see how the result can be merged into the Quora ontology.

Here are a few tags to start with:

  • competitive-programming: Most questions in the FAQ will be tagged with this one, but perhaps not all. For example, a question like What books should I read to learn about algorithms and data structures? might not have that tag.
  • training: For questions about techniques to help practice competitive programming. There’s an equivalent Quora topic called Training for Competitive Programming. My tag ontology can be more concise, since the overall topic is a given and doesn’t need to be repeated in each tag. In other words, the competitive programming context is assumed for each tag.
  • algorithms-and-data-structures: I think it’s best to combine these two in a single tag, since in practice there’s not much point in having a data structure with no algorithm to operate on it, or an algorithm with no data structure for storing results.
  • online-judge: For questions about competitive programming competition sites in general.
  • specific-online-judge: For questions about specific competitive programming competition sites. I borrowed this topic organization from Quora, which has topics like Specific Competitive Programming Competitions and Specific SPOJ Problems (which currently just contains other topics).

I think the combination of question tagging and canonical questions will help describe the complete set of questions that people ask about competitive programming.

(Image credit: Drew Stephens)