For each of the past two years, I’ve been working on year-long projects and writing about them here. In 2017, the topic was designing and coding a time-tracking app. This year, my project was a competitive programming FAQ. Like most FAQs, it’s a work in progress, but it now contains a set of popular questions, and is set up for me to add more.
In the process of creating the FAQ, I learned a few things about how Quora and Google work, wrote some research tools, experimented with MediaWiki, classified a lot of Quora questions, tried to merge some of those Quora questions (and had many of them unmerged by the Quora Content Review bot), and even wrote a few FAQ answers.
Collecting Questions and Answers
For a competitive programming FAQ project, the obvious source of questions and answers is Quora. Although questions and answers on this topic come up on Reddit, Stack Overflow, the online judges (Topcoder, Codeforces, etc.), and other sites, Quora easily wins on volume of Q&A activity.
Despite this, I spent a few weeks at the beginning of the project collecting references from the Web in general, without focusing specifically on Quora. To keep track of these references, I wrote a simple tool called Webliographer whose purpose is to maintain a list of links and associated metadata. The main Webliographer workflow is to extract links from an input file, remove duplicates, and output a new master list.
To create my initial input data file, I used two tools, GoogleScraper and Linkclump, to save Google search results both automatically and manually. I then fed these results into Webliographer to get a list of unique links.
Google search limits the number of results returned from each domain (for example, from quora.com) in a default search. If you use the site:
syntax to return results from a single domain, you can get many more results, but still nowhere close to the full list for domains with many pages. So I needed another search technique.
For Quora, the search technique that worked was to find pages that listed questions and/or answers, and scrape them myself rather than relying on Google. The page to start with is all_questions
, which shows a list of questions for a particular topic (though not all questions for large topics). Another source is the Related Questions list, which appears on a question page and shows questions that are similar to the current question. Finally, I collected questions from the Most Viewed Writers page for the Competitive Programming topic and other related topics. This page type is especially useful because it lists questions that are likely to have at least one good answer.
Initially, I scraped pages by downloading them programmatically and using a library called OpenScraping to parse the HTML. This limits what data I could extract from the page because Quora hides some information from unauthenticated users and loads parts of the page dynamically through JavaScript. Later in the year, I switched to using Selenium, which supports logging in, scrolling, and clicking page elements. This allowed me to get all the information that a typical Quora user sees on the page.
Besides collecting all the questions that are out there, I also want to keep up to date with new questions. The competitive programming topics are active on Quora, so there are always new questions and answers coming in. To keep the FAQ up to date and improve Quora content quality, it’s important to keep scanning for new questions to add to the FAQ as new pages, add to the question lists for existing FAQs, or merge with older Quora questions.
Exploring How Quora Works
With so much competitive programming content available on Quora, there’s no way to create the CPFAQ without learning the details and quirks of how Quora content is created and organized.
One area of interest is how Quora topics work. Topics are a way of categorizing questions. Users and bots can tag questions with one or more topics, and it’s possible to get Quora to list all the questions categorized with a particular topic (up to some unspecified maximum number of questions).
Early in the process of collecting questions, I also collected topics related to competitive programming. There’s the top-level topic, which has by far the most questions, but also related topics like CodeChef, Topcoder, and 140 or so others. For FAQ collection, the two main ways to use topics are: 1) Checking whether someone has tagged a question with a CP-related topic, which implies that the question is on-topic for the FAQ; and 2) Collecting on-topic questions using the all_questions
page for a topic. There’s always the possibility of mis-tagged questions, so these steps aren’t perfect, but they’re a good prelude to manual classification.
Just as anyone can post a question or answer on Quora, anyone can create and manipulate Quora topics, except if an administrator has locked the topic in question. For unlocked topics, cleanup tasks include merging duplicates, fixing misspellings, and fixing the topic ontology (the graph that relates topics to each other).
Because topics have a wider scope than individual questions, administrators have locked many popular topics. This means regular users can’t change them. The only option for correcting problems is to make a request through the Topic Gnomery blog and hope that someone with the right permissions has time to make the change.
A key to creating a good ontology is writing effective topic definitions. As I was collecting topics, I created the Competitive Programming Wiki as a place for topic definitions and later for the FAQ itself.
One last thing to know about Quora is its strategy for maintaining content quality. It’s not the same strategy used by Wikipedia, which has notability criteria and engaged editors, or Stack Overflow, which has a core set of users who help the official moderators enforce site rules. Instead, Quora’s plan appears to be accept all content, delete only the most egregious violations of policy, and rely on algorithms to show the best quality content to users.
Question Classification
To keep track of 17,000+ Quora questions, I needed a way to classify questions for retrieval and inclusion in the FAQ. I settled on two types of classification:
- A primary category, distinct from Quora’s topics (which are often unreliable). The primary category specifies the main competitive programming topic covered by the question. As I classified more questions, I updated my category definitions based on how I was using them during classification.
- A canonical title. This is as clear an expression of the question as I can write. Questions with the same canonical title are often merge candidates, but not always.
So in order of increasing specificity, we can classify competitive programming questions using competitive programming 🡆 primary category 🡆 canonical title 🡆 question title. A FAQ page contains multiple questions grouped under a single canonical title. For example: How do I prepare for (contest)?
I experimented with a few ways to classify, store, and retrieve questions, including:
- A SQL Server question database that allowed me to analyze the question data using SELECT queries.
- Document classification tools that I trained with manually classified questions to see if they could predict the primary category for unclassified questions.
- A simple tool called QuoraClassifier that I created to help me quickly pick primary categories and canonical titles for unclassified questions.
Ultimately, I settled on QuoraClassifier as the best option. The total number of competitive programming questions on Quora is large, but not so large that I couldn’t classify them manually. The tool made recording the classification as simple as possible, which allowed me to focus on reading the questions and deciding how I should classify them.
Question Merging
One side-effect of classifying thousands of Quora questions is being exposed to the duplicate question problem. Quora support question merging, but it’s not as simple a process as it might first appear to be. Quora has a merge policy, but it is necessarily general. Actual merge rules are defined by Quora bots, Quora employees, and Quora users, with the first two groups having the most power. To help with merging, I came up with a merge process that documents things to look for when merging, and a merge tracking tool for verifying that previous merges haven’t been reverted by a Quora user or the Quora Content Review bot (which often reverts merges).
It’s worth repeating why questions should be merged at all. As Quora has stated in official blog posts, merging duplicate questions into a single canonical question gives writers one place to write answers to that question, and gives readers one place to read them. CPFAQ articles can have even better canonical titles than Quora questions, since they can take advantage of the experience provided by multiple people writing question titles expressing what they want to know about a topic. Even when a question is poorly written, Quora answer writers can often deduce what the question writer is asking, and write a useful answer. The FAQ can then include a link to the question in the article with the relevant canonical title.
Patterns in Question Titles
Besides classification, I also used my question title database to see how people write questions. I analyzed the question text by start word (e.g., How, What, Is, Why, I, which, Can Where), start phrase (e.g., How do I, How do I solve, How can I, How should I, How do you, What is the, What are the, What are some, What is the best, Where can I, Is there any), and overall word frequency (e.g., programming, problem(s), competitive, solve, code).
One way to use information about Quora question titles is for writing better canonical titles. The same question can be written using different starting phrases — for example, I would consider How do I prepare for ACM-ICPC?, How can I prepare for ACM-ICPC?, How should I prepare or ACM-ICPC?, and other similar variations to represent the same question, despite the subtle differences in wording. One way to improve canonical title consistency is to pick the best starting phrase to express a question type and use it consistently, even if other starting phrases would also be appropriate. This helps the reader focus on the unique part of the canonical title.
The overall word frequency results can also help with title consistency, for example by highlighting misspellings, which will have lower frequencies than correct spellings.
At two points during the year, I counted how many Quora questions were associated with each canonical question that I wrote. In both cases, the pattern showed a large number of Quora questions (about 25% of the total) associated with the top 10 canonical titles, and a long tail of less popular questions. This count helped prioritize which questions I should add to the FAQ first.
Quora question authors often include multiple sub-questions in their question titles, which makes merging difficult since the more specific a question is, the harder it is to find a close match. I analyzed my question database to find questions with the most sub-questions.
I noticed in some of my tool output that question titles had some strange characters in them, because of how Unicode characters are used in Quora question titles. The solution was for my tools to enforce a single character for each typographical purpose (e.g., left double quote).
Using MediaWiki
I chose MediaWiki as the content management system for hosting the CPFAQ and related content. Page types in the Wiki include FAQ pages with links to Quora questions, category pages for organizing related questions, and other pages created as necessary.
One of the other pages I created is a glossary of competitive programming terms. It’s useful to have a single page of short definitions to refer to when reading Quora and other sources. I based the page design on Wikipedia glossaries, and making it display correctly required updates to my MediaWiki installation.
Where other appropriate pages exist, I don’t plan to duplicate them in the CPWiki. For example, there was an entry in Wikipedia for the Codeforces online judge that I undeleted and updated. A few months later someone deleted it again, and recently someone else recreated it. So we’ll see where it ends up in the long term. If it doesn’t survive on Wikipedia, I may host it on CPWiki.
FAQs
In the last weeks of the year, I wrote answers to some popular questions from my question database:
- Will Competitive Programming Success Help Me Get a Programming Job?
- Is Competitive Programming Only For Smart People?
- How Long Should I Work on a Programming Problem Before Looking at the Answer?
- How Did X Become a Top Competitive Programmer? (Part 1, Part 2)
- How Do Coding Interviews Differ from Coding Contests?
And for answers directly from an expert, here’s an edited version of a classic Topcoder chat with Petr Mitrichev.
(Image credit: Hiking this week in the hills of New Mexico)