Red-Green-Code

Deliberate practice techniques for software developers

  • Home
  • About
  • Contact
  • Project 462
  • CP FAQ
  • Newsletter

CPFAQ: Document Classification

By Duncan Smith Jun 13 0

MonkeyLearn

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

To organize my list of Quora questions, I have started giving each one a primary category that indicates what it is primarily about. For example, the primary category for How can I sharpen my mathematical skills in the context of competitive programming? is Mathematics (with Competitive Programming as the implicit overall topic for all questions).

On Quora, categories are known as topics, and they are assigned to questions by (1) Quora users, and (2) the Quora Topic Bot (QTB), an automated process. But there’s a lot of inaccuracy in topic assignments. For topics assigned by users, there are a few contributors to inaccuracy: First, most question askers don’t think much about correct topic assignment. They are just trying to get their question answered. Secondly, they often just spam the question with as many topics as possible because they think it will increase the probability of it being answered. For topics assigned by QTB, the main problem is that machine learning algorithms still aren’t perfect at assigning topics, and they can be misled by users’ topic assignment behavior.

Using a set of Quora questions that I categorized myself, I thought it would be interesting to see what kind of auto-categorization results I could get using some free text classifiers.

Document Classification

Document classification, or text classification, is the general term for what I described above. More specifically, I’m looking at supervised automatic document classification, where the document is a Quora question title, automatic means I’m using a classification algorithm, and supervised means I’ll be telling the algorithm what the classification should be for some questions, and then seeing how it performs on other questions.

There are many tools that do automated document classification. I tried out two of them, one commercial and one open-source.

MonkeyLearn

MonkeyLearn is a commercial online machine learning system that provides customer classifiers that can be trained with user data.

Creating a classifier using MonkeyLearn is a fairly simple process. To train the classifier on my Quora question data, I used their manual training process: I uploaded my list of about 500 questions, and then the system prompted me to associate one or more categories with each question. After a few rounds of categorization, the classifier started to suggest categories for each question, which I could accept or modify. When I finished categorizing about 70 questions, the system said the classifier was sufficiently trained and I could either test it or keep classifying.

Even with only a few questions classified, the system started to make some good recommendations. Here are some of its results using a training set of 70 questions:

  • How should you prepare for Google Code Jam? (Contests)
  • How does one start with TopCoder? (Getting Started)
  • What are the standard puzzles asked in programming interviews? (Interviews and Jobs)

However, the classifier can only work with the data it has been trained with. Here are some examples where the training data seems to be misleading it:

  • What is it like to win IOI? (classified as Problems; should be Contests)
  • How do I progress from blue to yellow on TopCoder? (classified as Problems; should be Getting Better or Online Judges)
  • What should I do to improve my CodeForces rating? (classified as Interviews and Jobs; should be Getting Better or Online Judges)

Despite promising early results, the main problem with MonkeyLearn for my project is cost. For the free account plan, classification is limited to 300 queries per month and the classifier can only have 300 texts. With these restrictions, I would barely be able to get started classifying my master question list of 16.5k questions.

Increasing the query allotment and model size requires a paid plan. But the lowest paid MonkeyLearn tier is $299 per month, so it’s clearly targeted only at businesses that somehow derive income from document classification.

TextBlob

To avoid the restrictions of a commercial web service, I tried out an open-source text processing system called TextBlob, which I ran on my local machine. Using a simple classification tutorial from the TextBlob documentation, I was able to quickly import my entire list of 500 manually-classified questions (in TSV format) using the following Python commands:

from textblob.classifiers import NaiveBayesClassifier
with open('train.tsv', 'r', encoding='utf8') as fp:
    cl = NaiveBayesClassifier(fp, format="tsv")

The classifier can then be called to classify any input text. For example:

>>> cl.classify("What are some good ways to approach
    a dynamic programming question?")
'Algorithms and data structures'

One interesting test result is that the classifier doesn’t always product the correct classification even when it is given the exact text that was used to train it, since it relies on the training set as a whole. For example, I manually classified “How can I get good at dynamic programming?” as “Algorithms and data structures.” But the classifier classified it as “Interviews and Jobs.” Since I classified dynamic programming-related questions in multiple categories depending on the context (e.g., “What are the top 10 most popular dynamic programming problems among interviewers?” –> “Interview and Jobs”), the classifier presumably found the question more similar to Interviews and Jobs questions than to Algorithms and data structures questions.

When I ran the classifier on my entire training set, it attained 69.5% accuracy when compared with the training data. Since I used 20 categories, this is much better than random chance. But it would still require manual fixing if I wanted to use the output directly for CPFAQ. It will be interesting to see how much additional training is required to get the accuracy into the >90% range. One possibility to increase accuracy is to include answer text in the training set. This would provide much more text for the classifier to work with.

Due to the limitations of the free MonkeyLearn plan, I can’t run a similar test on my entire training set. However, for the single example question discussed above, MonkeyLearn did correctly classify it as Algorithms and data structures.

Next Steps

Text classification is a big area of study that I don’t want to get sidetracked with. However, the TextBlob results are promising enough after a minimal time investment that I may continue testing it with manual training data to see how its accuracy improves.

Categories: CPFAQ

Prev
Next

Stay in the Know

I'm trying out the latest learning techniques on software development concepts, and writing about what works best. Sound interesting? Subscribe to my free newsletter to keep up to date. Learn More
Unsubscribing is easy, and I'll keep your email address private.

Getting Started

Are you new here? Check out my review posts for a tour of the archives:

  • 2023 in Review: 50 LeetCode Tips
  • 2022 in Review: Content Bots
  • 2021 in Review: Thoughts on Solving Programming Puzzles
  • Lessons from the 2020 LeetCode Monthly Challenges
  • 2019 in Review
  • Competitive Programming Frequently Asked Questions: 2018 In Review
  • What I Learned Working On Time Tortoise in 2017
  • 2016 in Review
  • 2015 in Review
  • 2015 Summer Review

Archives

Recent Posts

  • Do Coding Bots Mean the End of Coding Interviews? December 31, 2024
  • Another Project for 2024 May 8, 2024
  • Dynamic Programming Wrap-Up May 1, 2024
  • LeetCode 91: Decode Ways April 24, 2024
  • LeetCode 70: Climbing Stairs April 17, 2024
  • LeetCode 221: Maximal Square April 10, 2024
  • Using Dynamic Programming for Maximum Product Subarray April 3, 2024
  • LeetCode 62: Unique Paths March 27, 2024
  • LeetCode 416: Partition Equal Subset Sum March 20, 2024
  • LeetCode 1143: Longest Common Subsequence March 13, 2024
Red-Green-Code
  • Home
  • About
  • Contact
  • Project 462
  • CP FAQ
  • Newsletter
Copyright © 2025 Duncan Smith