CPFAQ: Canonical Question Statistics

Numbers

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

As I mentioned last week, I’m currently creating FAQ pages, and those FAQ pages rely on canonical question titles. This week I’ll discuss some observations about the set of titles I have so far.

Canonical Questions

The FAQ depends on the idea of taking all of the competitive programming question titles that people are writing in their own words, and mapping them to a smaller set of questions with canonical titles. The canonical questions are slightly more general, and ideally are written more clearly than questions found in the wild. The people who run Quora would like all Quora questions to be written like these canonical questions, but they haven’t found a way to make it happen at scale.

Here’s an example canonical question that I haven’t added to the FAQ yet:

Will competitive programming success help me get a programming job?

And here are the five Quora questions that I have classified under that canonical question (so far):

Each of these Quora questions expresses the same idea — the relationship between competitive programming practice and getting a programming job — in a slightly different way. The Quora questions also mention specific online judge and contest names, which the canonical question does not.

Statistics

When I let the Quora algorithm present me with content, I see many competitive programming questions that are near duplicates of popular questions that have been asked and answered many times. I suspect this happens because people tend to view, answer, and upvote these popular questions. The algorithm then picks up on these signals, and assumes the questions are interesting.

For my canonical question project, I’m taking popularity (followers and answer upvotes) into account. So I do see these same popular questions in my list. But there’s also a long tail of other questions.

So far, I have assigned 382 canonical titles to 743 Quora questions. The top ten titles categorize about 25% of these Quora questions, and they have more than ten Quora questions each. Here are the wiki pages that list those questions:

Canonical titles with fewer than ten but more than two Quora questions each make up the next 25% of the current set. And the remaining 50% is the long tail of canonical titles with only one or two Quora questions each.

I have only classified a small portion of all Quora competitive programming questions, so these numbers will change as I make more progress. But it will be interesting to see if the percentages change drastically. For example, will the long tail of unique or almost unique questions always make up about half of the total set of canonical titles.

It’s worth pointing out that, since I’m doing the classification process, I’m influencing the results of this experiment by deciding how I want to match Quora questions to canonical titles. But I am continually evaluating whether the classification makes sense for the goal of a useful FAQ.

At some point, I’ll need to decide how to present the long tail of questions in the FAQ: I’ll either create many FAQ pages with a small number of questions each, or a few FAQ pages holding all of the long tail questions. I’m inclined to use the latter strategy. I think people will find the FAQ hard to navigate if the number of FAQ pages gets too large.

(Image credit: Bernard Spragg. NZ)