I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.
I’m mining my Quora question corpus to find patterns and collect data to help write a list of canonical questions. In recent weeks, I’ve been looking at the words used to start question titles. This week, I’m analyzing the full text of the question titles in the list.
I modified my program from last week to split each question into words, and count each word using a hash table that maps the word to the number of times it occurs in the question list. For better grouping quality, I did a bit of processing on each word before counting it:
- Expand contractions into separate words.
- Trim punctuation and whitespace.
- Convert the word to lowercase (though when I discuss the words below, I use uppercase letters for words that require them).
The 16.5k question titles in the set expand to about 255k total words, and 11k distinct words. However, the top 54 words make up half of the total, and the top 292 words make up 75% of the total.
Stop words are words that are often filtered out during textual analysis because they appear so frequently in all types of text that they don’t say much about a particular corpus. One exception is when these words appear at the beginning of a sentence, as I discussed in the last two weeks. But for the purpose of this week’s analysis, they don’t provide much value.
In the question list, the top 29 words by frequency include these 23 common words. Each word makes up the indicated proportion of all question title words: the (3.7%), I (3.3%), to (2.6%), in (2.6%), a (2.2%), how (2.2%), what (2.1%), is (2.0%), for (1.7%), of (1.7%), do (1.5%), and (1.5%), can (1.2%), are (1.1%), on (1.1%), should (0.7%), it (0.7%), or (0.7%), my (0.5%), be (0.5%), you (0.5%), that (0.5%), and from (0.5%).
Other English stop words appear throughout the ranked list, concentrated near the top.
The Main Topic
The other 6 words in the top 29 describe the subject matter of these questions:
- programming (1.37%)
- problem/problems (1.35%)
- competitive (1.01%)
- solve (0.63%)
- code (0.51%)
At rank 30 and 31, there are two online judges, CodeChef and SPOJ. Here are the top six online judge names in the question list:
- CodeChef (0.5%)
- SPOJ (0.5%)
- TopCoder (0.4%)
- HackerRank (0.4%)
- CodeForces (0.3%)
- HackerEarth (0.2%)
Top programming contests on the list (other than those held by the online judges) are:
- ACM-ICPC (0.66%)
- IOI (0.20%)
- Google Code Jam (0.16%), identified by the word jam
- Facebook Hacker Cup (0.13%), identified by the word cup
Google (0.33%), at position 51, is the first company in the list that doesn’t focus exclusively on competitive programming. Other such companies in the list include:
- Facebook (0.10%)
- TCS (Tata Consultancy Services) (0.08%), popular on the list because of the TCS CodeVita competition
- Amazon (0.04%)
- Microsoft (0.03%)
The top few programming languages on the list:
- C++ (0.28%)
- Java (0.16%)
- C (0.15%)
- Python (0.09%)
- C# (0.01%)
I found other potential word categories on the list, including People, Countries/Regions, Algorithms/Data Structures, and Tools. But manually categorizing them from a list of 11k unique words isn’t scalable, so I’ll stop with the examples above.
Using the Ranked List
Here are a few benefits of analyzing the ranked list of words from the question title list:
- It’s another way to determine the topics that people are interested in, and quantify the level of interest in each topic.
- Conversely, it’s a way to verify that the list covers the topics that it’s intended to cover.
- It can help clean up the question list by highlighting misspellings and other inconsistencies.
(Word cloud by WordClouds.com, using the question text as a source document).