I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.
So far this year, I’ve been building tools that operate on text files in tab-separate value (TSV) format. The advantage of this format is that it’s easy to read from and write to in code, and it imports directly into Excel for manual processing. For example, last week I worked on a TSV file in which each line contains one Quora question title, link, and follower count. I extracted this information automatically using one of my tools. I then imported the TSV file into Excel so I could manually edit each row to add a canonical question and set of tags.
As I classify each question, I find it useful to review how previous questions are classified. That helps ensure that question classification is consistent. For example, I classified one question as follows:
- Title: What are some good questions on CodeChef from which I will learn more algorithms?
- Canonical title: What are some good competitive programming problems?
I later came across the question What are some must-do problems on Codeforces?, which I thought should have the same classification. So I looked through the current list of classified questions to copy the classification decisions I made earlier.
If I only had a few questions classified, it would be easy enough to scan through the list and find a similar one. But as the list grows longer, that becomes impractical. So I decided this week that it’s time to upgrade my storage technology.
But the goal of Webliographer is to produce an artifact (the FAQ), so I’m putting less emphasis on the design purity of the tool itself. With that in mind, I’m choosing a different data access and storage mechanism.
Here’s the strategy:
- Design a simple data model that is reasonably normalized (avoids data duplication), while allowing new tables and columns to be added as needed.
- Write SELECT queries to extract data in a way that makes it easy to export to TSV and edit in Excel.
- Update my C# web data extraction tools to take as input the TSV files that they currently process, and INSERT or UPDATE the database tables.
The goal is to avoid spending too much time fiddling with UI or data access code, and concentrate on getting the data into a useful format for producing a FAQ.
For my database, I decided to use SQL Server 2017. It has a full-featured Developer Edition that is free for non-production use, and the SQL Server Management Studio tool makes it easy to experiment with ad-hoc queries. Not surprisingly, data access from a C# app is also simple.
The initial version of the data model has the following tables, based on the classification process from last week:
- Tag stores a name and description for each question tag.
- Question stores the text, URL, and number of followers for each question.
- CanonicalQuestion stores the question text for each canonical question.
The tables are related as follows:
- Many tags to many questions: A question has one or more tags. Each tag is associated with one or more questions.
- Many question to one canonical question: A question is associated with exactly one canonical question. Each canonical question groups together one or more questions.
Using this initial data model, I’ll import a set of questions and start building a library of queries to help classify them.