The purpose of Webliographer is to collect and manage web references (URLs). A good way to get a baseline set of references on a topic is to import the results of a web search. But if you use a search engine this way, you’ll find some quirks that don’t appear when you’re searching interactively.
Search engine companies want people to search for things and look at the results (and the ads). However, they want people to search through the official interface, such as the web site (so they can see the ads). They definitely don’t want people extracting large amounts of search results data using automated tools.
Webliographer is designed so that the user needs to look at each web reference and add information like a summary and a set of tags. So it wouldn’t do much good to import more references than a person could ever check. On the other hand, it’s tedious to enter every web reference manually. So I tried a few ways of importing search data.
GoogleScraper is a Python module that extracts search results from Google and other search engines. Although it is no longer maintained, I was able to get it running. Checking recent unmerged pull requests was useful.
Using an automated tool like this, it’s easy to make Google think that you’re trying to abuse their service. GoogleScraper has a few settings that help make sure you’re playing nicely. The best one I found is to configure the tool to use long wait times between search result pages. This keeps Google happy.
GoogleScraper returns results in JSON format. Webliographer then uses JSON.NET to convert the JSON results to a collection of C# objects. The most useful fields can then be extracted and written to a TSV file.
The advantage of GoogleScraper is that once it’s set up, there’s almost no manual work: running a single command in a console window sends the results to a JSON file, which Webliographer can import. But I wanted to check the results by hand to make sure I was getting what I expected. To do that, I ran a Google search in the usual way, but using a Chrome incognito window (to avoid being logged in and getting personalized results).
Manually copying all of the results to a file would be tedious, so I used a Chrome extension called Linkclump, as suggested by this article. Linkclump makes it easy to copy and paste a page of results into a text document in TSV format. With the page size set to 100, it only takes a few minutes to create the file and get Webliographer to determine which links are new compared to the GoogleScraper results.
One detail: This technique only returns the link and the title, while GoogleScraper returns other fields like domain, rank, and snippet. But I found that I could derive the important fields from the available data, including the order in which the results appeared.
Analyzing the Results
Over the years, Google has become so good at search relevance that the first page is all you need for most search tasks. But Webliographer is designed for research, not just casual searching, so I dug a bit deeper into the results. Here are some things I found.
Consider the following Google search query:
"competitive programming". That’s a two-word query enclosed in quotes, which means we’re asking for results that include those two words next to each other in order. That’s the query I’ll be focusing on.
How many results?
At the top of the search results page is a statement like this:
About 504,000 results (0.38 seconds). Guess what? There aren’t really that many results, as we’ll see later. But Google has been reporting inflated numbers for years. There’s an article from 2010 called Why Google Can’t Count Results Properly. It quotes an article from 2006 that points out the same problem. Now it’s 2018, and the numbers are still wildly off. Why don’t they fix it? Apparently it’s not worth the trouble to get an accurate count, and no one looks through thousands of pages anyway. So don’t pay much attention to that number.
The real result count
To find the actual number of results, just click through the result pages. For this query, it didn’t take long to converge on an answer:
- Page 1: “about 504,000 results”
- Page 2: “about 399,000 results”
- Page 3: “about 213 results”
Even that last number is an estimate. Using Linkclump, I copied exactly 216 links into my TSV file.
And that number still isn’t completely accurate. Webliographer detected 8 exact duplicate links, for a final count of 208 unique links.
We’re not done yet. At the bottom of the last page of search results is this message:
In order to show you the most relevant results, we have omitted some entries very similar to the 216 already displayed. If you like, you can repeat the search with the omitted results included.
(Note that the results count of 216 in that message is consistent with the actual number of links returned).
Accepting the offer to repeat the search produces a total of 500 results, of which 418 are new compared to the previous search. Webliographer combines the two result sets into a master list of 626 unique URLs. That is, there are no repeated URLs. The tool doesn’t currently check for duplicate content, where a site posts an article that has already appeared elsewhere. For example, this famous Quora answer was also published in Forbes, but they both appear in the final Webliographer list.
What does the “repeat with omitted results” search option actually do besides include “similar” results? One answer is that it returns more results from some domains. For example, the original search returns just one result from the
quora.com domain, while the repeated search returns 100 results from that domain.
But there is a more puzzling difference between the two types of searches: the number of different domains included in the results. For this query, the original search returns results from 200 unique domains, while the repeated search returns results from only 110 unique domains. Not only that, but the repeated search, despite including fewer unique domains, returns 28 domains that don’t appear in the original search. These include high-ranked domains like
stackoverflow.com, and domains like
codeforces.com that are entirely focused on competitive programming.
So it’s important not to think of the repeated search as a superset of the original search, which Google’s message implies when it says “repeat the search with the omitted results included.” Instead, think of it as a different type of search that focuses on more in-depth results from some domains, at the expense of other domains.
Fortunately, Webliographer is designed to merge search results from multiple sources, so it’s not necessary to rely on just one search type. Access to the repeated search function is a good reason to use a manual tool like Linkclump and not rely entirely on an automated tool like GoogleScraper.
Search One Site
Compared to the original search, the repeated search described above includes a larger number of results from a smaller number of unique domains. But notice that even the repeated search only returns 100 results from
quora.com. And as we know, there are a lot more than 100 competitive programming questions on Quora. That’s where the ability to search a specific site comes in handy.
Here’s the new search query:
"competitive programming" site:quora.com. The same two-word query in quotes, but now restricted to one domain.
As before, Google offers both a standard search and a “repeat with omitted results included” search. Since we’re just dealing with one domain, the repeated search can’t just include more results from fewer domains, as it did before. Here’s what I got:
- Standard search: 500 results, 499 unique URLs, 14 unique URLs not in the repeated version.
- Repeated search: 598 results, 595 unique URLs, 110 unique URLs not in the standard version.
- Webliographer merged result: 609 unique URLs.
So the purpose of the repeated search function is even less clear in this scenario. It doesn’t seem to offer much more than additional results. And as before, it excludes some results that were in the regular search. So it still pays to merge both results if you’re looking for a comprehensive view of a topic.
Finally, it’s clear that not all results are being returned even in the repeated search, since Quora has on the order of 20 thousand competitive programming questions. So web search is obviously not the way to get comprehensive results. But that’s a topic for a future discussion.