CPFAQ: Unicode in Quora Question Titles

Quotation Marks

I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.

As I have mentioned in the past, I often use Excel as a quick way to manipulate tables of data, even when that data doesn’t involve numbers and formulas. My Quora tools output data in TSV format, which is easy to import into Excel. But I noticed when importing those files that some question titles have strange characters mixed in with the valid ones, due to an encoding issue. I have been ignoring it until now, but I’d like to fix it.

Quotation Marks on Quora

Consider this question:

How can I configure “Microsoft Visual Studio 2017” for competitive programming?

If you look closely, you can see that the double-quote character before Microsoft differs from the one after 2017. That’s also true if you look at the title on Quora. This question title uses typographic quotation marks, which distinguish between the opening and closing quotation marks.

What about this question?

What were the “16 standard algorithms” that Neal Wu’s coaches “drilled into his brain” in preparation for the International Olympiad in Informatics?

Here on the blog, this question title also uses typographical quotation marks, because WordPress rendering takes care of that detail. But if you look at the question on Quora, you’ll see typewriter-style straight quotation marks.

It isn’t unusual to see this inconsistency between quotation mark styles in Quora question titles, though the straight style seems to be more common. Quora rendering doesn’t enforce one or the other style.

Quotation Mark Characters

Since straight quotation marks are the same at the beginning and end of quoted text, they only require one character: ASCII character 34 (hex 22), the one that the computer emits when you press the " key.

For typographical quotation marks, the convention is to use two Unicode characters: 8220 (hex 201C) for the left double quotation mark, and 8221 (hex 201D) for the right double quotation mark.

When a program (e.g., the default Excel TSV importer) reads a file containing Unicode quotation marks and interprets it using a non-Unicode encoding, it munges the quote characters into multiple other characters. In the Excel example on English Windows, the importer uses Windows-1252 encoding, a Windows default. Under this encoding, the characters appear as “ for the left and †for the right double-quote.

Quora Tools

My Quora tools are for analyzing and interpreting Quora questions, not for rendering them (that’s Quora’s job). So my approach is to convert Unicode characters to their ASCII equivalents. The typographical double-quote characters both become ASCII 34. Similarly, left and right single quote characters (Unicode 8216 and 8217) both become apostrophes (ASCII 39). The en dash (Unicode 8211) becomes a hyphen (ASCII 45). And others will get converted as I encounter them. There are libraries to do this conversion, but the number of Unicode characters in my Quora question title list is small enough that it’s easy to write custom code in my tool to convert each one separately.

(Image credit: Kyle Van Horn)