CPFAQ: Patterns in Question Titles, Part 2

Question Titles 2

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

Last week, I did some simple text mining to classify Quora Competitive Programming questions based on the first word (How, What, Why, etc.) of the question title. This week, I’m extending that a bit by looking at starting phrases containing 3-4 words each.

Text Mining

As before, I started with my set of 16.5k questions, and ran the question titles through a simple program that split each title into a word list. I then counted how many times the same $n$ words occurred at the beginning of a title. Last week I just used $n = 1$. This week, I’m focusing on $n >= 3$, but mainly the lower $n$ values, since they produce more matches. (E.g., there aren’t very many questions in the set that start with the same $8$ words).

The list below contains all of the phrases that start at least 160 questions (at least 1% of the total set per phrase). For example, the phrase How do you is used at the beginning of 163 questions. In all, the phrases discussed below start about 36% of the questions in the set.

Starting Phrases

How do I (9.7% of set)

By far the most common three-word starting phrase in the set is How do I. Questions starting with this phrase include the very popular How do I learn competitive programming as a beginner? and How do I strengthen my knowledge of data structures and algorithms?

How do I solve (2.7% of set)

The most common four-word starting phrase in the set is How do I solve. Most of these are asking how to solve specific online judge questions, often on SPOJ. E.g., How do I solve the ACODE problem on SPOJ?

How can I (4.2% of set)
How should I (1.0% of set)
How do you (1.0% of set)

These starting phrases are often used interchangeably with How do I, so popular questions in all of these categories resemble each other. Here are few on the subject of training and practice:

What is the (6.8% of set)
What are the (3.5% of set)
What are some (3.4% of set)

The What is/are the phrases can be used in the same way as the How phrases. For example, What is the learning path in competitive programming to be in top 100 ranks of competitive programming sites? is another way to ask How do I become a top-100 competitive programmer?

But What is the can also be used to refer to a specific target, like an algorithm or even competitive programming in general:

This usage is even more clear with What are some questions. Those are clearly different from How questions, in that they always refer to specific targets. For example:

What is the best (1.4% of set)

What is the best is the second most popular four-word phrase in the overall list. It’s a standard way to ask for recommendations:

Where can I (1.3% of set)

Where can I questions are usually asking for an online location for something, and they often start with the four-word phrase Where can I find (0.8% of set):

But sometimes the request is for a physical location, as with Where can I find a competitive programming coach in Latin America or via the internet?

Is there any (1.2% of set)

There’s no question type that requires Is there any. Any question that starts with this phrase can be more directly stated using one of the other options. For example, Is there any use of binary search trees in competitive programming? could be expressed as Where can I find a list of competitive programming problems that use binary search trees? or What are some competitive programming problems that use binary search trees?