I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.
When you’re logged into Quora, there’s a link at the top of your question feed that says: “What is your question?” Some people interpret this to mean: “What are your questions?” So we end up with questions like this one: What is ACM-ICPC? Is it necessary to have a team to participate in ACM?
As with most simple questions, the two sub-questions that make up this question have been asked repeatedly on Quora. For example:
So it would be useful to use question merging to clean things up. But we can’t merge one question into two separate questions. Quora policy in this situation suggests choosing the more specific question as a merge target. In this example, that would be the second sub-question, Is it necessary to have a team to participate in ACM? But this is a compromise. There will be a mismatch between the merged question and the answers, since some of the answers will address both sub-questions.
Having seen many of these multi-question questions in the CPFAQ question set, I decided to analyze my question title data to find out more about them.
Parsing Question Titles
To analyze the composition of question titles, we need to look at the characters that make up the titles. A period (.) in a question title can be used for purposes other than ending a sentence. For example, it can be used in an abbreviation. A question mark (?) is more likely to indicate the end of a question, thought it too can be used in other contexts — e.g., in a code snippet. Fortunately, it’s easy to use a more robust approach than just counting periods and question marks: The Natural Language Toolkit is designed to work with sentences as sentences, not just as strings of characters.
I used this Python code to analyze my question set:
import nltk.data tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') file = open("questions.txt", "r") lines = file.readlines(); for line in lines: splitSentences = tokenizer.tokenize(line) numSentences = len(splitSentences) numQuestions = 0 for sentence in splitSentences: if sentence.endswith("?"): numQuestions += 1 print (str(numSentences)+"\t"+str(numQuestions)+ "\t"+line.rstrip('\n')) file.close()
Given a text file with one question per line, this code uses the NLTK tokenizer to split each question title into sentences and count the number of sentences that end with a question mark. For each question title, it then prints the total sentence count, the question count, and the full question title.
Here are a few of the multi-question questions I found when I ran my NLTK program on my question set:
Who are the people who solve many Project Euler problems (200+)? How old are they? What is their job? Are they able to solve them consistently? How long do they spend on a typical problem? What is their motivation for doing them?
This is a popular and interesting question, which currently has 28 answers. Due to the nature of this question, there will never be a need to merge it into another question, so the multiple sub-questions aren’t a problem. It’s the only question in my set with six sub-questions.
This is not a bad question. It currently has ten answers, an answer wiki, and 229 followers. Considering the popularity of questions about dynamic programming, it could be a useful merge target.
This is another good question, with five detailed answers.
Unlike the others, this question illustrates the problem with having multiple sub-questions. It currently has zero uncollapsed answers. But although the sub-questions — about the ACM-ICPC and the pros and cons of C++ vs. Java — are well-covered in other questions, the Quora merge process will make it difficult to merge this question into a more useful one.
(Image credit: Steve)