CPFAQ: Building a Webliography Tool

Bibliographer

Last week I proposed, as a project for this year, a competitive programming FAQ (CPFAQ). As a first step, I suggested a tool to facilitate research on the Web. This week I’m starting to design that tool.

Webliographer

A webliography is a list of web resources, just as a bibliography lists sources printed on paper. So if a bibliographer creates bibliographies, what we need to manage Web research is a webliographer (though this this word isn’t yet in the dictionary). The type of webliographer I have in mind is a software tool, though it could also refer to the person using the tool.

Consider a webliographer tool that does the following:

Import a list of web references from a tab-separated value (TSV) file

Let’s assume that we start with a TSV file containing web references. TSV is a good format because it’s simple, standard (supported by other programs), and allows commas within fields. In the simplest case, a TSV could be created manually by entering a list of references into a spreadsheet and exporting to TSV. Or the webliographer tool itself could extract references from another source (like a web page), and create an output TSV.

Here’s a preliminary list of columns to capture in the TSV:

  • URL/Link: This is the primary key, meaning that it uniquely identifies a row in our dataset. The URL points to a single web reference. Example URL: https://www.quora.com/topic/Competitive-Programming.

  • Domain: This is a way to identify the set of URLs that belong to the same site. URLs from a single domain might benefit from specialized processing that knows about the format of pages in that domain. Example domain: www.quora.com.

  • Rank: An indicator of the popularity of a link. This helps distinguish between higher- and lower-quality links. Example: a link with rank 10 is more popular than one with rank 15.

  • Sources: Where the link came from (who points to it). A link will often have multiple sources. Example sources: Google Search, Quora answer.

  • Title: The title of the page (from the HTML source). Example title: Competitive Programming – Quora.

  • Tag list: A way to categorize pages. Example tag list: competitive-programming, q-and-a.

  • Content summary: A summary of the page. This becomes useful when writing the FAQ or Wiki entry that uses this page as a reference.

Display results in response to queries

Once the data is imported, it can be stored in a database and queried. I expect tags to be useful in this step. Example query: display all references tagged competitive-programming and java.

Parse custom fields based on domain

Some web sites, like Medium, might contain a few pages on the topic of competitive programming. Others, like Quora, might have thousands of pages devoted to the topic. For the latter type of site, it may be worth writing custom parsing code to extract data specific to that site. For example, we could extract the author of a question or an answer, the number of people following a question, or the number of upvotes on an answer.

It’s tricky to get information from a site that doesn’t expose an API, since they may change their page layout, which can break code that attempts to extract information from the page. So it’s best not to spend too much time on this step.

Search Results

The standard way to collect a preliminary list of web references on a topic is to look at search results. That approach provides a list of links, along with some of the additional columns described above (e.g., domain, rank). Each result can then be examined further to write a summary and decide whether it’s worth writing custom code to further explore the associated domain.

Search results are what I’m going to start with to get an overview of competitive programming resources on the Web.

(Image credit: POP)