CPFAQ: Scraping with Selenium

Selenium

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

When you’re logged in to Quora, you see more information than an anonymous user does. For example, on the all_questions page for a topic, logged-in users see a title for each question along with how many answers it has, when it was last followed or requested, how many followers it has, and various available actions. Anonymous users just see the question titles.

When I started collecting Quora questions for the FAQ, I noticed this discrepancy between the anonymous and logged in experiences. To collect as much information as possible, I often manually saved pages while logged in, and then ran my tools on the saved HTML. But for individual question pages this wasn’t practical since I’m tracking over 15,000 questions. For those, I wrote a program to download pages automatically. And since that program did not log in, some useful information was not available.

It would be ideal to combine the convenience of automation with the extra data provided to logged-in users. This week, I experimented with using the Selenium testing framework to achieve this. It turned out to be a simple process.

« Continue »

CPFAQ: Canonical Question Statistics, Part 2

Question Stats

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

I have now classified over 1000 Quora questions, using 552 canonical titles, and I think that’s a good spot to move on to some other CPFAQ tasks. But first, here’s how the numbers look compared with my previous checkpoint a few weeks ago.

« Continue »

CPFAQ: Good Answers to Bad Questions

42

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

As I mentioned at the end of last week’s post, it’s hard to write a good canonical question title. Fortunately, there’s a Quora tradition in which answer writers provide high-quality answers to questions that might be less than stellar. I can take advantage of such questions in CPFAQ by reading all the related questions (and associated comments and answers), writing a canonical title that clearly expresses what the question writers want to know, and including links to the original questions.

Here are some categories of bad question/good answer pairs I have observed on Quora.

« Continue »

CPFAQ: The Value of Canonical Questions

Canonical

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

Last week I discussed how question merging works for Quora and CPFAQ. Related to question merging is the idea of canonical questions. Although I have written about canonical questions in the past, I haven’t explained why they’re critical for CPFAQ. That’s the topic for this week.

« Continue »

CPFAQ: Merging Questions

Merge

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

If a CPFAQ page has a canonical title and contains a list of Quora questions that all relate to the title, why not just merge all the Quora questions into one canonical Quora question? Good question.

« Continue »

CPFAQ: Canonical Question Statistics

Numbers

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

As I mentioned last week, I’m currently creating FAQ pages, and those FAQ pages rely on canonical question titles. This week I’ll discuss some observations about the set of titles I have so far.

« Continue »

CPFAQ: Creating a FAQ Page

Pages

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

For at least the next few weeks, I’ll be creating competitive programming FAQ pages for the most frequently-asked competitive programming questions, according to my analysis of Quora content. That set of pages will give me a foundation on which to add more specialized questions over time. This week, I’ll explain the page creation process that I’m currently using.

« Continue »

CPFAQ: Adding Wiki Pages

Pages

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

We’re officially halfway through the year, as measured by weekly blog posts. That means I’m also halfway through the CPFAQ project. As I mentioned last week, I’m building the Competitive Programming FAQ inside a MediaWiki site. This week, I added a few more pages to the wiki. My plan is first to focus on the questions, and later in the year to work on the answers. So the FAQ pages will initially just contain pointers to Quora questions (along with their answers), and will later include answer text in the wiki itself.

« Continue »

CPFAQ: CPWiki

MediaWiki

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

With the halfway point of 2018 approaching, it’s time to focus on the website that will host the content for the CPFAQ. I decided a few months ago that I would use MediaWiki software to host the FAQ. The advantage of a wiki is that it will allow me to write to write encyclopedia-style pages to supplement the main FAQ pages. This week, I have been thinking about how I want to organize the wiki, and I’ve created a few pages to get things started.

« Continue »

CPFAQ: Document Classification

MonkeyLearn

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

To organize my list of Quora questions, I have started giving each one a primary category that indicates what it is primarily about. For example, the primary category for How can I sharpen my mathematical skills in the context of competitive programming? is Mathematics (with Competitive Programming as the implicit overall topic for all questions).

On Quora, categories are known as topics, and they are assigned to questions by (1) Quora users, and (2) the Quora Topic Bot (QTB), an automated process. But there’s a lot of inaccuracy in topic assignments. For topics assigned by users, there are a few contributors to inaccuracy: First, most question askers don’t think much about correct topic assignment. They are just trying to get their question answered. Secondly, they often just spam the question with as many topics as possible because they think it will increase the probability of it being answered. For topics assigned by QTB, the main problem is that machine learning algorithms still aren’t perfect at assigning topics, and they can be misled by users’ topic assignment behavior.

Using a set of Quora questions that I categorized myself, I thought it would be interesting to see what kind of auto-categorization results I could get using some free text classifiers.

« Continue »