I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.
In recent weeks, I have been experimenting with ways to collect Quora questions, especially those that don’t appear in standard views like search engine results and the All Questions page. But I also want to make sure that the questions I collect from alternative sources are relevant, since I eventually need to manually evaluate the best questions for a FAQ. Last week I started filtering on the Competitive Programming topic tag, but I realized that this can filter out relevant questions. To see why that is, I’m investigating how Quora topics work.
To categorize questions and help users find them, Quora allows questions to be tagged with one or more topics. When tagging questions, users can select existing topics or create new topics. A bot called the Quora Topic Bot also tags questions.
Each Quora topic is associated with a set of pages of the form
www.quora.com/ topic /[TopicName]/ [PageType], where
PageType can be:
read: The topic home page, which contains a feed of questions related to that topic.
all_questions: A list of all questions tagged with that topic. For large topics, the page uses “infinite scroll,” and you may never actually see all of the questions.
followers: The Quora users who follow the topic.
log: The edits that have been made to the topic itself.
writers: The most viewed writers in the topic.
faq: Up to 10 frequently asked questions for the topic.
links: A new Quora feature that isn’t used much yet.
top_questions: A view intended to help users find questions to answer.
Most of these topic page types contains these common elements, which can be extracted from the page HTML using XPath:
- The topic title,
- Statistics about the number of questions, followers, and edits,
- A list of related topics (which requires a more complex XPath expression to extract).
Some of the topic pages types also contain a list of questions. My eventual goal in analyzing the topic page is to collect and organize these questions.
Just as the question page has a list of related questions, the topic page has a list of related topics. And just as we can use related questions to expand a small question list into a larger list, we can expand a small topic list in the same way. Since topic pages contain question lists, this can lead to more questions, which can lead to more topics, and so on.
The problem with recursively collecting related questions and topics is that it’s easy to collect a large number of irrelevant questions. However, the number of Quora topics is small compared to the number of Quora questions. So I’m starting to manually curate relevant topics, which I’ll then use to filter my question list.
As a first step, I extracted all of the unique topics from last week’s list of 5900 questions, and manually filtered them. Out of about 7000 topics, I decided that only about 150 were relevant for this project. Here is some trivia about that list of topics:
- Competitive Programming is by far the most popular topic in the list, with over 20k questions and over 290k followers.
- The next most popular topics are TopCoder (>4k questions, >128k followers), CodeChef (>4k questions, >134k followers), and Algorithms in Competitive Programming (>3k questions, >1k followers).
- There are a lot of creative ways to spell Competitive Programming, including Competitive Progarmming, Competive Programm, Compettive Programming, Compitative Programmin, and Compitative Programming (these are all actual topic names, which I’ll get around to merging soon).
- There are even more correctly-spelled topics that are so close to the main topic that it’s debatable whether it really makes sense to maintain them as separate topics. For example, do we really need Coding Competition, Competitive Coding, and Programming Competitions, in addition to the main topic?
Starting with my initial list of 150 topics, I’ll recursively collect the set of unique related topics and see if they converge to a reasonable list. If they do, that will become my official list of topics for the Quora portion of this research project. If they don’t, I’ll stop after a few iterations and manually filter the list. Either way, I’ll start using that list to collect and filter questions.