I’m building a webliography of competitive programming resources, and I’m currently focusing on Quora questions. So far, I have extracted questions from search engine results and from the All Questions page that Quora generates. But as I mentioned last week, only a small fraction of the available topic questions appear in those locations. Where are the rest?
Quora question pages have a Related Questions section that an algorithm has decided are similar to the current question. As explained previously, I have a process for parsing page markup. To find the related questions, I look for a set of anchor tags that are each identified by a
As I find useful parts of the question page, I’m building a C#
Question class that exposes the information on the page in a more useful format. For example, the related questions are stored as a C#
List of title/URL pairs.
Most question pages seem to have 25 related questions, though sometimes that number is larger or smaller. Starting with an initial set of questions (from search results and the All Questions page), it’s therefore possible to build up a much larger set of related questions. Using those questions, I can find even more related questions, until the list of questions (hopefully) converges on the full question list for the topic.
If the goal is to collect all competitive programming questions on Quora, the related questions list isn’t as targeted as search engine results or the All Questions list. For example, a question about competitive programming books is seen by the algorithm as being related to the question What book(s) are you currently reading? That question may be related, but it’s a distant relative. If we collected related questions this way, we would end up collecting every question on Quora. Even the more relevant questions like How can I learn C and C++? are too general for our purposes. Quora has hundreds of programming topics. Collecting questions on all of those topics is not the goal.
So how do we separate the irrelevant questions and the general questions from the specific competitive programming questions? Topic tags. Quora users can add topic tags to their questions to help users find them. Quora bots also add and remove tags (not always correctly). In the page markup, topic tags are contained in
spans with the
TopicNameSpan class. While parsing the page, I can collect a list of these tags.
Having the list of tags for a question helps in two ways. First, I can discard questions that don’t have the
competitive programming tag. Although these questions may be relevant in some way, there are too many to spend time on all of them. Inspecting the tag list helps narrow down the list of questions to study further.
It’s also possible to find questions that are about competitive programming (based on their title or answer text), but which are missing the
competitive programming tag. After a quick manual check, the tag can be added to the question through the Quora UI. This helps improve the quality of Quora question data for this topic.
Last week, I had a list of 2220 unique questions collected from search engine results and the All Questions page. Although most question pages list 25 related questions, many of the same related questions appear repeatedly. In all, I found about 18,000 unique questions by consolidating and de-duplicating the related questions from the original list.
However, not all of these 18,000 questions are relevant to competitive programming. Only 5900 of them had the “Competitive Programming” tag, and less than a hundred more had “competitive programming” in their titles, but were missing the tag. So that’s still below the advertised total number of questions in the topic.
The 5900 results are a good start, and a proof of concept that related questions and tags can be used to collect questions. However, the data needs some cleanup, and there is still more value that can be extracted from the source data. In particular, I need to:
- Recursively collect questions, their related questions, the related questions of those related questions, and so on to see what the number of questions with a “Competitive Programming” tag or title converges to.
- Filter out questions whose URL ends with
no_redirect=1. This indicates a read-only question whose answers have been merged with another question. It’s a duplicate. A moderate number of these have snuck into the data, mostly from related question lists (which is strange given that these lists are created by Quora’s algorithms, so one would think that merged questions would be filtered out already).
- Look for questions with “competitive programming” in the title, but without the “Competitive Programming” tag. There are more than a few of these, and they can be fixed using the Quora UI.