Red-Green-Code

Deliberate practice techniques for software developers

  • Home
  • About
  • Contact
  • Project 462
  • CP FAQ
  • Newsletter

CPFAQ: Building a Webliography Tool

By Duncan Smith Jan 10 0

Bibliographer

Last week I proposed, as a project for this year, a competitive programming FAQ (CPFAQ). As a first step, I suggested a tool to facilitate research on the Web. This week I’m starting to design that tool.

Webliographer

A webliography is a list of web resources, just as a bibliography lists sources printed on paper. So if a bibliographer creates bibliographies, what we need to manage Web research is a webliographer (though this word isn’t yet in the dictionary). The type of webliographer I have in mind is a software tool, though it could also refer to the person using the tool.

Consider a webliographer tool that does the following:

Import a list of web references from a tab-separated value (TSV) file

Let’s assume that we start with a TSV file containing web references. TSV is a good format because it’s simple, standard (supported by other programs), and allows commas within fields. In the simplest case, a TSV could be created manually by entering a list of references into a spreadsheet and exporting to TSV. Or the webliographer tool itself could extract references from another source (like a web page), and create an output TSV.

Here’s a preliminary list of columns to capture in the TSV:

  • URL/Link: This is the primary key, meaning that it uniquely identifies a row in our dataset. The URL points to a single web reference. Example URL: https://www.quora.com/topic/Competitive-Programming.

  • Domain: This is a way to identify the set of URLs that belong to the same site. URLs from a single domain might benefit from specialized processing that knows about the format of pages in that domain. Example domain: www.quora.com.

  • Rank: An indicator of the popularity of a link. This helps distinguish between higher- and lower-quality links. Example: a link with rank 10 is more popular than one with rank 15.

  • Sources: Where the link came from (who points to it). A link will often have multiple sources. Example sources: Google Search, Quora answer.

  • Title: The title of the page (from the HTML source). Example title: Competitive Programming – Quora.

  • Tag list: A way to categorize pages. Example tag list: competitive-programming, q-and-a.

  • Content summary: A summary of the page. This becomes useful when writing the FAQ or Wiki entry that uses this page as a reference.

Display results in response to queries

Once the data is imported, it can be stored in a database and queried. I expect tags to be useful in this step. Example query: display all references tagged competitive-programming and java.

Parse custom fields based on domain

Some web sites, like Medium, might contain a few pages on the topic of competitive programming. Others, like Quora, might have thousands of pages devoted to the topic. For the latter type of site, it may be worth writing custom parsing code to extract data specific to that site. For example, we could extract the author of a question or an answer, the number of people following a question, or the number of upvotes on an answer.

It’s tricky to get information from a site that doesn’t expose an API, since they may change their page layout, which can break code that attempts to extract information from the page. So it’s best not to spend too much time on this step.

Search Results

The standard way to collect a preliminary list of web references on a topic is to look at search results. That approach provides a list of links, along with some of the additional columns described above (e.g., domain, rank). Each result can then be examined further to write a summary and decide whether it’s worth writing custom code to further explore the associated domain.

Search results are what I’m going to start with to get an overview of competitive programming resources on the Web.

(Image credit: POP)

Categories: CPFAQ

Prev
Next

Stay in the Know

I'm trying out the latest learning techniques on software development concepts, and writing about what works best. Sound interesting? Subscribe to my free newsletter to keep up to date. Learn More
Unsubscribing is easy, and I'll keep your email address private.

Getting Started

Are you new here? Check out my review posts for a tour of the archives:

  • 2023 in Review: 50 LeetCode Tips
  • 2022 in Review: Content Bots
  • 2021 in Review: Thoughts on Solving Programming Puzzles
  • Lessons from the 2020 LeetCode Monthly Challenges
  • 2019 in Review
  • Competitive Programming Frequently Asked Questions: 2018 In Review
  • What I Learned Working On Time Tortoise in 2017
  • 2016 in Review
  • 2015 in Review
  • 2015 Summer Review

Archives

Recent Posts

  • Do Coding Bots Mean the End of Coding Interviews? December 31, 2024
  • Another Project for 2024 May 8, 2024
  • Dynamic Programming Wrap-Up May 1, 2024
  • LeetCode 91: Decode Ways April 24, 2024
  • LeetCode 70: Climbing Stairs April 17, 2024
  • LeetCode 221: Maximal Square April 10, 2024
  • Using Dynamic Programming for Maximum Product Subarray April 3, 2024
  • LeetCode 62: Unique Paths March 27, 2024
  • LeetCode 416: Partition Equal Subset Sum March 20, 2024
  • LeetCode 1143: Longest Common Subsequence March 13, 2024
Red-Green-Code
  • Home
  • About
  • Contact
  • Project 462
  • CP FAQ
  • Newsletter
Copyright © 2025 Duncan Smith