I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.
Quora doesn’t provide a page where a user can see every new question for the topics that they follow. Like other social media companies, Quora believes that the best way to present content to users is in the form of a “feed.” This feed is not just a reverse chronological list of new posts. Rather, it’s the output of an algorithm that considers multiple factors to determine what to show the user.
There’s an ongoing debate, which I won’t get into here, about the wisdom of allowing a secret algorithm to control what you see online. But regardless of the overall pros and cons of an algorithmic feed, there are definitely drawbacks to using it to maintain a canonical list of questions. This week, I’ll discuss an alternative process for Quora content.
Earlier this year, I created a tool called Webliographer whose purpose is to keep track of web links. One of the early features that I implemented is for assembling a list of unique links given a set of input files. For the CPFAQ project, I used Webliographer to consolidate a master list of Quora questions from multiple topics. But that list only represents questions available at a point in time. To update the list, I can run Webliographer again with newer source files, and it will add new links to the master list, discarding any duplicates.
Let’s look at some sources of new questions.
“All Questions” Pages
Just as each user has a customized feed (the list you see when you visit
quora.com), each Quora topic page also has a feed, which you see when you click on a topic name or visit
quora.com/topic/[topic-name]. Although new questions do show up here, they are mixed in with older content that the Quora algorithm thinks is interesting. So it’s not ideal for collecting only new questions.
all_questions page, on the other hand, is a less filtered version of the questions for a topic. This page, which is accessible via the
Questions link on the top right side of the topic home page, is theoretically a list of all questions that have been tagged with a given topic. The
all_questions page uses “infinite scroll” and for large topics I haven’t been able to collect as many questions from the page as the topic says it contains. But the questions at the top of the page are often new questions.
To build my master list of questions, I collected a set of relevant topics, scrolled to the bottom of the
all_questions page for each one, and collected the questions from the full page. But to keep up to date with new questions for each topic, it might be sufficient to just visit the page without scrolling (which would be convenient for an automated scraper that doesn’t know how to scroll) and collect the questions at the top of the list. By doing this often enough (according to the rate of incoming questions for each topic), a scraper could keep up with all incoming questions for a set of topics.
User Topic Answer Pages
As described the past two weeks, Quora provides a page showing answers that any user has written to questions on any topic. These answers are shown in strictly reverse chronological order with no algorithmic funny business, so they’re especially useful for collecting new answers (though not necessarily new questions). Because they show answers, user topic answer pages are also good for collecting statistics, like number of upvotes, that would otherwise be inconvenient to get in an automated way.
I sometimes give examples of Quora questions that I find using various collection techniques. This week, that isn’t as useful, since “newness” isn’t a very interesting quality for questions unless the topic is a current event. The reason to look at new questions is primarily to make sure that my question set is complete, and to merge duplicate questions. Occasionally someone even comes up with a unique question that deserves its own entry in the FAQ.
Since I won’t be linking to questions this week, I’ll focus on process and statistics. Currently my process has a few manual steps, which I’ll need to automate if I want to run the process frequently. Here are the steps for the
- Download: Using my master list of topics, download the HTML for the top of the
all_questionspages (the part that appears before scrolling starts).
- Scrape: Using XPath, extract question titles and URLs.
- De-duplicate: Using Webliographer, merge new question titles and links with the question master list, discarding any duplicates.
And here are some statistics:
- My master question list, which I last updated about three months ago, has about 16.5k questions.
- Adding the Most Viewed Writers results from last week gave me about 900 more unique questions.
- Using the
all_questionsprocess described above, I got about 500 more unique questions.
- Since the above process doesn’t capture all of the data on the
all_questionspage (due to the scrolling issue), I manually downloaded the
all_questionspage for the Competitive Programming topic (which is by far the largest one in my master topic list). Running it through the same process, I got an additional 400 or so unique questions.
It has been about three months since I updated my master list, so using a very rough estimate, I should expect about 300 new questions per month, or 10 per day, if I use this process to keep up with new questions.