CPFAQ: Making a Wiki Look Like Wikipedia

Lua

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

I mentioned last week that I was creating a glossary of competitive programming terms, in the format used by glossaries on Wikipedia. This week, I made the necessary changes to CPWiki to properly render the glossary.

« Continue »

CPFAQ: Defining Competitive Programming Terms

Dictionary

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

It would be useful to have a page in the FAQ for a glossary of competitive programming terms. The Q&A part of the FAQ and the associated wiki discuss terms in detail, but a glossary provides an easy way to look up short definitions of terms that appear in questions and answers. This week, I started to collect a list of terms.

« Continue »

CPFAQ: Unicode in Quora Question Titles

Quotation Marks

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

As I have mentioned in the past, I often use Excel as a quick way to manipulate tables of data, even when that data doesn’t involve numbers and formulas. My Quora tools output data in TSV format, which is easy to import into Excel. But I noticed when importing those files that some question titles have strange characters mixed in with the valid ones, due to an encoding issue. I have been ignoring it until now, but I’d like to fix it.

« Continue »

CPFAQ: Multi-Question Questions

Multiple Shreks

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

When you’re logged into Quora, there’s a link at the top of your question feed that says: “What is your question?” Some people interpret this to mean: “What are your questions?” So we end up with questions like this one: What is ACM-ICPC? Is it necessary to have a team to participate in ACM?

As with most simple questions, the two sub-questions that make up this question have been asked repeatedly on Quora. For example:

So it would be useful to use question merging to clean things up. But we can’t merge one question into two separate questions. Quora policy in this situation suggests choosing the more specific question as a merge target. In this example, that would be the second sub-question, Is it necessary to have a team to participate in ACM? But this is a compromise. There will be a mismatch between the merged question and the answers, since some of the answers will address both sub-questions.

Having seen many of these multi-question questions in the CPFAQ question set, I decided to analyze my question title data to find out more about them.

« Continue »

CPFAQ: Quora Merge Tracker

Merge

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

The Competitive Programming topic on Quora, and related topics, contain thousands of examples of what people want to know about that subject. So it’s the definitive source for deciding what qualifies as a frequently asked question for CPFAQ. But many of these questions are duplicates, which makes it difficult to find the best answers to a question. As I mentioned last week, I have a process for merging some of these duplicates, but Quora automation often works against the process, despite Quora’s stated opposition to duplicate questions. This week, I worked on a basic tool to help me keep track of merges.

« Continue »

CPFAQ: Merging Questions, Part 2

Merge

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

One result of classifying many Quora questions is finding many duplicates. Quora knows about this and provides a Merge function. But as I have written about before, there’s also a content review bot that unmerges questions it thinks are not similar enough. I did some more investigation into this bot’s behavior, which I’ll describe this week.

« Continue »

CPFAQ: Question Categories, Part 2

Books

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

Using my QuoraClassifier tool, I’ve gotten about 25% of the way through my question list. So I thought it was the right time to revisit the criteria I’m using to assign primary categories.

« Continue »

CPFAQ: Fast Classification, Part 2

Quora classifier

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

I’m writing a tool called QuoraClassifier to speed up the process of organizing the list of Quora questions I’ve collected. Last week, I used an early version of the tool that let me select a primary category for each question with a single keypress. This week, I added a few enhancements to allow for more specific classification.

« Continue »

CPFAQ: Fast Classification

QuoraClassifier

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

Over the past few months, I’ve been classifying a list of Quora questions. Each question gets a primary category, the topic it is most relevant to. Classification helps organize the FAQ and also allows me to find duplicates to merge on Quora.

Until now, I’ve been classifying questions using a spreadsheet containing one question per row and a column for the primary category. But now that I’m familiar with the types of questions people ask, it only takes a few seconds to decide on a category, so the spreadsheet approach is slowing me down. This week, I wrote a simple classification program, QuoraClassifier, to optimize this process.

QuoraClassifier

The goal of QuoraClassifier is to make it as fast as possible to record a single decision: the primary category for a question. Recording the decision shouldn’t take much longer than deciding on the category.

QuoraClassifier is a simple WPF app. The UI language for WPF programs is XAML, which I used in last year’s project.

Like most of the tools I’ve built for the CPFAQ project, QuoraClassifier reads and writes tab-separated value (TSV) files. When it starts up, it reads a list of question data (titles, URLs, categories, etc.), which it keeps in memory. To save the classification results, it writes a TSV file in the same format. Since this is a quick and dirty tool, not a thoroughly tested application, I’m using two separate files to reduce the risk of data loss. Rather than overwriting the original file, I use two files that I can compare to verify that the changes are what I expect. If everything looks good, I can copy the output file onto the input file before the next classification session.

Once the input data loads, QuoraClassifier displays its single screen. The purpose of this UI is to 1) Display information about a question, and 2) Accept a classification choice. For maximum efficiency, the program lets me select a category using a single keystroke, at which point it displays the next question.

As shown in the screenshot at the top of this post, the UI includes the following fields:

  • Previous category: The category assigned to the previous question. Since the program moves to the next question as soon as it receives the classification choice for the current question, this field provides a way for me to check that I classified the previous question was the way I wanted.
  • Question title: The question title (from Quora).
  • Question URL: The Quora URL (not clickable).
  • View in Browser: A clickable version of the URL. This is a quick way to open the Quora question page, for cases when the title doesn’t provide enough information to classify the question.
  • Primary category: The primary category classification for the current question. Blank if it still needs classification.
  • Canonical title: A combo box containing a list of canonical titles that I have previously assigned to questions in this category, sorted in descending order by how often I used them. I haven’t implemented this yet, but I plan to use it to assign canonical titles once I’m done assigning categories.
  • Statistics: Total number of questions; number of questions with primary categories assigned; number of questions with canonical titles.
  • Last saved timestamp: When the output file was last written.

Instant Classification

To get through the questions in my list, I need to optimize the classification process. There’s a lower limit to how fast I can read a question title and decide how to classify it. For questions related to competitive programming, I’m probably already at the lower limit. So the only way to speed up the process is to optimize the act of recording a classification.

I decided that the fastest way to classify a question is using a single alphabetic key on the keyboard. Clicking with the mouse or tapping the screen is intuitive for many applications, but for speed, it’s hard to beat a keyboard interface. I assigned a unique letter to each category name. Since multiple categories start with the same letter, I couldn’t always use the first letter of the category. But it doesn’t take long to learn the letter associations: For example, P for People and R for PRoblems. The alternative, like pressing P once for People and twice for Problems, would just slow things down.

Here’s the category list:

  • Algorithms and data structures for competitive programming
  • Getting Better at competitive programming
  • Programming Contests
  • Exclude this question from the list (for off-topic questions)
  • General questions about competitive programming
  • MatH for competitive programming
  • Interviews and jobs in programming fields
  • BooKs about competitive programming
  • Programming Languages for competitive programming
  • Time Management for competitive programming practice
  • OrganizatioNs that competitive programmers join
  • Online judges
  • People who practice competitive programming (a.k.a. Competitive Programmers)
  • Competitive programming PRoblems (programming puzzles)
  • Getting Started with competitive programming
  • Tools for competitive programming
  • CoUrses for learning competitive programming
  • Competitive programming Vs. other types of programming (professional, hobbyist, academic, etc.)
  • Websites for studying competitive programming (other than online judge sites)
  • Y: Competitive programming coaches

And a few more keyboard commands:

  • Ctrl-F (Find): Navigate to the next uncategorized question.
  • Ctrl-S (Save): Write the output TSV.
  • Right/Left Arrow: Navigate to the next/previous question without changing a classification.
  • Spacebar: Navigate to the next question without changing a classification.

With these classification commands, I can classify a question in an average of 6.7 seconds. That’s about 30 hours for the ~16,000 questions in my list, or less than a month of calendar time at an hour or two per day.

CPFAQ: Scraping with Selenium

Selenium

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

When you’re logged in to Quora, you see more information than an anonymous user does. For example, on the all_questions page for a topic, logged-in users see a title for each question along with how many answers it has, when it was last followed or requested, how many followers it has, and various available actions. Anonymous users just see the question titles.

When I started collecting Quora questions for the FAQ, I noticed this discrepancy between the anonymous and logged in experiences. To collect as much information as possible, I often manually saved pages while logged in, and then ran my tools on the saved HTML. But for individual question pages this wasn’t practical since I’m tracking over 15,000 questions. For those, I wrote a program to download pages automatically. And since that program did not log in, some useful information was not available.

It would be ideal to combine the convenience of automation with the extra data provided to logged-in users. This week, I experimented with using the Selenium testing framework to achieve this. It turned out to be a simple process.

« Continue »