CPFAQ: Scraping with Selenium

Selenium

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

When you’re logged in to Quora, you see more information than an anonymous user does. For example, on the all_questions page for a topic, logged-in users see a title for each question along with how many answers it has, when it was last followed or requested, how many followers it has, and various available actions. Anonymous users just see the question titles.

When I started collecting Quora questions for the FAQ, I noticed this discrepancy between the anonymous and logged in experiences. To collect as much information as possible, I often manually saved pages while logged in, and then ran my tools on the saved HTML. But for individual question pages this wasn’t practical since I’m tracking over 15,000 questions. For those, I wrote a program to download pages automatically. And since that program did not log in, some useful information was not available.

It would be ideal to combine the convenience of automation with the extra data provided to logged-in users. This week, I experimented with using the Selenium testing framework to achieve this. It turned out to be a simple process.

Selenium

Selenium is an open-source framework for testing web applications. Developers and testers use it to automate actions in a web browser, and verify that the results are what they expect. It’s also useful for non-testing applications, like collecting data.

Here are the steps for creating a basic Selenium web page scraper:

Create a driver

To take actions in a browser, a Selenium test uses a driver. There are drivers for various browsers. I used Chrome and wrote my test program in C#, so I instantiated a driver like this:

var driver = new ChromeDriver();

Navigate to a web page

From previous work, I have a list of the all_questions pages for every Quora topic related to competitive programming. I used this Selenium command to navigate to the URL for each of these pages:

driver.Navigate().GoToUrl(url);

Log in

As I mentioned, one of the main reasons I wanted to use Selenium is to get the logged-in version of a Quora page. Selenium can automate a login page just like any other page. Here’s how a script can start the login process:

var signInButton = 
    driver.FindElementByClassName("header_signin_with_search_bar");
signInButton.Click();

FindElementByClassName looks for a web element with a particular class name. (There are similar methods for finding elements using other criteria, like Id). It returns an object of type Selenium.IWebElement. You can call methods on that object, which causes actions to occur in the browser.

Using this pattern repeatedly it’s possible to enter a user name and password and complete the login process. (To avoid any risk to my real Quora account due to automation bugs, I created a throwaway account for automation).

Execute JavaScript

Quora uses an infinite scroll technique, rather than pagination, to display content that doesn’t fit on a single page. So there’s no web element that an automated script can click to get to the next page of data. Fortunately, Selenium allows executing arbitrary Javascript on the current page. For example:

driver.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");

This command causes the page to scroll by 0 pixels in the horizontal direction, and scroll in the vertical direction by the size of the page body. In other words, scroll down the entire page.

Running this JavaScript causes Quora to load more content, if there is any. We can run it repeatedly to keep scrolling. But on an infinite scroll page, how do we know when to stop? (There’s not an infinite amount of content, even on Quora). A few Stack Overflow answers suggest checking the value of document.body.scrollHeight after every scroll and stopping once that value stops increasing. This sometimes works, but I found it to be unreliable on Quora. Large all_questions pages (like the one for the main Competitive Programming topic) would reach a point where scrollHeight would keep increasing (while Quora’s “still loading” animation played), but no additional content would load. So I needed a better approach.

Count questions

FindElementByClassName returns a single element. But some Selenium Find methods return a collection of elements. For example:

var questions = driver.FindElementsByClassName("ui_qtext_rendered_qtext");

This method returns a collection of all question titles on the page. If the size of this collection doesn’t increase even when the page is scrolled, that’s a good sign that we’re at the bottom of the page.

Click the More button

Sometimes an all_questions page will temporarily stop scrolling, and instead render a More button that has to be clicked before infinite scrolling resumes. A find element / click element approach can be used to deal with this button, as with any other page element.

Save the page

Although Selenium has commands to extract information from a page, I prefer to have a local HTML copy that I can experiment with offline. So when the page finishes scrolling, I run this:

File.WriteAllText(ConvertToWindowsFileName(url), driver.PageSource);

ConvertToWindowsFileName is a custom method that creates a legal Windows filename from the page URL. driver.PageSource is the entire HTML for the page. I can then parse the page to get details for each question: question URL, number of followers, number of answers, etc. I describe this process in my earlier notes on question collection.

Troubleshooting

The process described above works well enough. I collected thousands of relevant questions. But I ran into a few problems that I didn’t find ideal solutions for.

Question count mismatch

As I mentioned in my previous question collection post, Quora’s large all_questions pages promise many more questions than the infinite scrolling results list ever delivers. It’s still not clear what accounts for the discrepancy, but given how many different ways I have collected questions, I find it hard to believe there’s a huge set of missing questions. My best guess is that the number, like Google’s results count, is just a wild estimate.

Selenium exceptions

For long pages, Selenium would sometimes throw one of the following exceptions:

Selenium.StaleElementReferenceException: “state element reference: element is not attached to the page document.”

This usually happened while counting the question titles. The solution was to send another JavaScript scroll command and try again. Whether or not there was any more data to scroll into view, the exception went away.

Selenium.WebDriverException: “A exception with a null response was thrown sending an HTTP request to the remote WebDriver server. The status of the exception was ConnectFailure, and the message was: Unable to connect to the remote server.”

This happened when I tried to save the page at the end of the process. As a workaround, I sometimes just saved the content manually, which I could do because I was using a visible Chrome browser to run the automation.

I also tried using this syntax when constructing the driver:

var driver = 
    new ChromeDriver(ChromeDriverService.CreateDefaultService(),
    new ChromeOptions(),
    new TimeSpan(1, 0, 0, 0));

This is supposed to tell the driver not to time out for one day, but it didn’t seem to solve the problem.

Collecting More Questions

Despite a few quirks, the Selenium experiment was a success. Collecting questions while logged in is preferable to my previous approach. I plan to use this process to collect questions from other pages, like user answer pages.