CPFAQ: Scraping with Selenium

Selenium

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

When you’re logged in to Quora, you see more information than an anonymous user does. For example, on the all_questions page for a topic, logged-in users see a title for each question along with how many answers it has, when it was last followed or requested, how many followers it has, and various available actions. Anonymous users just see the question titles.

When I started collecting Quora questions for the FAQ, I noticed this discrepancy between the anonymous and logged in experiences. To collect as much information as possible, I often manually saved pages while logged in, and then ran my tools on the saved HTML. But for individual question pages this wasn’t practical since I’m tracking over 15,000 questions. For those, I wrote a program to download pages automatically. And since that program did not log in, some useful information was not available.

It would be ideal to combine the convenience of automation with the extra data provided to logged-in users. This week, I experimented with using the Selenium testing framework to achieve this. It turned out to be a simple process.

« Continue »

CPFAQ: Canonical Question Statistics, Part 2

Question Stats

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

I have now classified over 1000 Quora questions, using 552 canonical titles, and I think that’s a good spot to move on to some other CPFAQ tasks. But first, here’s how the numbers look compared with my previous checkpoint a few weeks ago.

« Continue »

CPFAQ: Good Answers to Bad Questions

42

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

As I mentioned at the end of last week’s post, it’s hard to write a good canonical question title. Fortunately, there’s a Quora tradition in which answer writers provide high-quality answers to questions that might be less than stellar. I can take advantage of such questions in CPFAQ by reading all the related questions (and associated comments and answers), writing a canonical title that clearly expresses what the question writers want to know, and including links to the original questions.

Here are some categories of bad question/good answer pairs I have observed on Quora.

« Continue »