Red-Green-Code

Deliberate practice techniques for software developers

  • Home
  • About
  • Contact
  • Project 462
  • CP FAQ
  • Newsletter

CPFAQ: Fast Classification, Part 2

By Duncan Smith Aug 29 0

Quora classifier

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

I’m writing a tool called QuoraClassifier to speed up the process of organizing the list of Quora questions I’ve collected. Last week, I used an early version of the tool that let me select a primary category for each question with a single keypress. This week, I added a few enhancements to allow for more specific classification.

Secondary Category

Consider these two questions:

  • How can I start preparing notes for Data Structures and Algorithms?
  • What are the algorithms required to solve all problems (using C++) in any competitive coding contest?

Using my classification process, I have assigned both questions the primary category algorithms and data structures. But to locate these questions in the FAQ, I need more than just that category. Although both questions ask about algorithms, the first one is about study techniques, while the second one is asking for a syllabus.

To capture this additional information, I created a secondary category field. The secondary category uses the same list of descriptors as the primary category, but it refers to a more specific aspect of each question. In this example, the first question has a secondary category of getting better, and the second question has contests.

On Quora, users can associate one question with many topics. If someone is following one of these topics, then they might see the question. My categories have a different purpose. The idea is to progressively narrow down the subject of the question, starting with the most relevant topic (primary category) proceeding to associate it with a group of related questions (with the same canonical title), and finally getting to the question title itself, the most specific identifier. Given this goal, there is a diminishing return to including many categories. I think two will be sufficient.

One consideration when assigning categories using QuoraClassifier is whether to assign both categories when I first look at a question, or to assign the primary category for every question in the first pass, and then go back through the whole list again to assign the secondary category. I tried the former approach this week, and I think it’s less efficient. The reason is that it requires switching between two contexts (primary and secondary category) for every question. Another way to describe it: if I have 20 categories, then I have to keep in mind $20 \times 20 = 400$ category pairings. If I only consider the primary category or the secondary category individually, I only have to keep track of 20 options at a time.

So I think the better of the two approaches is to categorize every question with a primary category and then go through each set of questions in a category and assign the secondary categories. For example, take all the algorithms and data structures questions and decide what else each of those questions is about.

Canonical Title

The most specific level of categorization, other than the Quora question title itself, is the canonical title. For a particular category, there can be many more canonical titles than the total number of categories. (E.g., I already have 62 unique canonical titles for algorithms and data structures, and there are many more questions to classify). So following the argument described above, it wouldn’t make sense to assign a canonical title at the same time as a primary category.

But a few canonical titles are so common that it requires no special effort to associate them with a question. For example, many questions ask some form of How do I get started with competitive programming? When I see one of those questions, I’d like to assign that title right away rather than having to find it later. Because the relevant canonical title is so obvious, it doesn’t slow down the classification process much.

To accomplish this with QuoraClassifier, I added a Canonical Title field. The canonical titles appear in a drop-down box, sorted in descending order by popularity. So the ones I want to assign immediately are likely to be at the top of the list and therefore be quick to access. Due to the number of titles, it isn’t practical to assign each of them a letter as I do with categories. But I’m considering assigning a hot key to the top few titles, to make it even faster to assign them to questions.

Categories: CPFAQ

Prev
Next

Stay in the Know

I'm trying out the latest learning techniques on software development concepts, and writing about what works best. Sound interesting? Subscribe to my free newsletter to keep up to date. Learn More
Unsubscribing is easy, and I'll keep your email address private.

Getting Started

Are you new here? Check out my review posts for a tour of the archives:

  • Lessons from the 2020 LeetCode Monthly Challenges
  • 2019 in Review
  • Competitive Programming Frequently Asked Questions: 2018 In Review
  • What I Learned Working On Time Tortoise in 2017
  • 2016 in Review
  • 2015 in Review
  • 2015 Summer Review

Archives

Recent Posts

  • LeetCode 227: Basic Calculator II January 13, 2021
  • A Project for 2021 January 6, 2021
  • Lessons from the 2020 LeetCode Monthly Challenges December 30, 2020
  • Quora: Are Math Courses Useful for Competitive Programming? December 23, 2020
  • Quora: Are Take-Home Assignments a Good Interview Technique? December 17, 2020
  • Quora: Why Don’t Coding Interviews Test Job Skills? December 9, 2020
  • Quora: How Much Time Should it Take to Solve a LeetCode Hard Problem? December 2, 2020
  • Quora: Quantity vs. Quality on LeetCode November 25, 2020
  • Quora: LeetCode Research November 18, 2020
  • Quora: Optimal LeetCoding November 11, 2020
Red-Green-Code
  • Home
  • About
  • Contact
  • Project 462
  • CP FAQ
  • Newsletter
Copyright © 2021 Duncan Smith