I’m working on a project this year to build a competitive programming FAQ. This is one in a series of articles describing the research, writing, and tool creation process. To read the whole series, see my CPFAQ category page.
Data in a relational database is often not arranged in a way that makes sense to the end user. Instead, it’s optimized to reduce duplication and improve query performance. So when it’s retrieved from the database, it needs to be converted into a more optimal format through a combination of SQL queries and application logic. As an example, I’ll use data from my question database.
Outer Joins
In this example, the user interface is a spreadsheet. I’m using a spreadsheet for editing since it saves the time required to build my own UI on top of the database. In the spreadsheet, each row contains a question and its associated properties. Some properties, like Link (URL) and Number of Followers are unique to each question. But other properties, like Canonical Question Title, contain values that can be the same for multiple questions (since multiple questions can ask about different aspects of a single canonical question).
In the spreadsheet, if two questions have the same canonical question title, it’s convenient to show the full canonical question title for each one. Since the two questions may appear in different parts of the spreadsheet, we don’t want to have to scroll around to find the canonical title.
But in a database, storing an identical title multiple times would be wasteful. Instead, I have a CanonicalQuestion
table that stores the unique titles. Then each question record has a foreign key reference to the corresponding canonical question record. So returning questions and canonical questions just requires a join:
SELECT q.Title, q.Url, q.NumFollowers, cq.Title AS [CanonicalTitle]
FROM Question q
LEFT JOIN CanonicalQuestion cq ON cq.CanonicalQuestionId = q.CanonicalQuestionId
The LEFT JOIN
(short for LEFT OUTER JOIN
) returns all rows from the left table (the one mentioned first, Question
) and matches them where possible with rows from the right table (the one mentioned second, CanonicalQuestion
). If no match is possible, it shows NULL
for CanonicalTitle
.
This is the appropriate join to use in this case because canonical titles are created manually, so they won’t be available for every question. A NULL
just means that a canonical title needs to be written or chosen for that question.
The same query could be written with RIGHT JOIN
just by reversing the tables:
SELECT q.Title, q.Url, q.NumFollowers, cq.Title AS [CanonicalTitle]
FROM CanonicalQuestion cq
RIGHT JOIN Question q ON q.CanonicalQuestionId = cq.CanonicalQuestionId
But I find it simpler to stick with LEFT JOIN
and order the tables accordingly.
Changing Rows to Columns with STUFF and PIVOT
For the one-to-many relationship between a single canonical question and multiple questions, it’s easy enough to join the two tables and have the canonical question title appear as a column in the result set. The many-to-many relationship between tags and questions is trickier.
To allow one question to have multiple tags, and one tag to be associated with multiple questions, I’m using an associative table, QuestionTag
. The only data in this table is foreign key data: each row has a question ID and a tag ID, representing a single tag associated with a single question.
To retrieve all questions with their associated tags, it’s possible to use a join query like those in the previous section:
SELECT q.QuestionId, q.Title, t.Alias
FROM Question q
LEFT JOIN QuestionTag qt on qt.QuestionId = q.QuestionId
LEFT JOIN Tag t on t.TagId = qt.TagId
Each row in the result set for this query contains a question and an associated tag, or NULL if a question has no tags (which can happen since I’m adding tags manually, not importing them from Quora). The problem is that if a question has five tags, five rows are returned. This doesn’t match the design of the spreadsheet I’m trying to create, which has only one row per question.
Since my target spreadsheet has a column per tag, but my result set has a row per tag, I need a way to convert rows to columns. The Stack Overflow question Efficiently convert rows to columns in SQL server has some useful techniques.
The T-SQL STUFF command is used to insert a string into another string, or just concatenate strings. Here’s how it can be used to create a comma-separated list of tags for each question:
SELECT QuestionId, Alias =
STUFF((SELECT ',' + Alias
FROM Tag t2
JOIN QuestionTag qt2 ON qt2.TagId = t2.TagId
WHERE qt2.QuestionId = qt.QuestionId
FOR XML PATH('')), 1, 1, '')
FROM QuestionTag qt
GROUP By QuestionId
FOR XML PATH converts the result of the subquery into a single value, so it can be assigned to the Alias column. Passing the zero-length string ('')
means we don’t want any XML tags, just a regular string.
I could stop here and just parse the comma-separated list in C#, but there are a few more SQL tricks available in the Stack Overflow answer. Consider this adaption of the example from the answer:
select @cols = STUFF((SELECT ',' + QUOTENAME(t.Alias)
FROM Question q
CROSS JOIN Tag t
LEFT JOIN QuestionTag qt ON qt.QuestionId = q.QuestionId
AND qt.TagId = t.TagId
WHERE q.QuestionId = @qid
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)')
,1,1,'')
set @query = N'SELECT ' + @cols + N' from
(
SELECT CASE q.Title WHEN NULL THEN NULL ELSE ''x'' END
AS [Title], t.Alias
From Question q
JOIN QuestionTag qt on qt.QuestionId = q.QuestionId
JOIN Tag t on t.TagId = qt.TagId
WHERE q.QuestionId = ' + @qid + N'
) x
PIVOT
(
MAX(Title)
for Alias in (' + @cols + N')
) p '
exec sp_executesql @query;
For a single question @qid
, this produces the tag portion of the spreadsheet row: a column for each tag name, marked with an x
if the tag applies to the question.
The idea is to use the STUFF
command as described above to generate a comma-separated list of column names. Since T-SQL queries uses comma-separated lists in this format, the list can then be concatenated into a SELECT
query and executed as dynamic SQL. The PIVOT command is used at the end to turn rows into columns.
The SQL way of doing things can seem complicated to developers who are used to thinking procedurally (i.e., most of them). But it can perform better than the equivalent procedural solution, assuming good database design and indexes.