The title of this post is very cryptic, but you don't worry! I'm just trying to look smart. Now I'll tell you what this is about. I think it's easier if I start by creating a few simple tables and talk by illustration. For it's very hard -- and usually pointless -- to talk abstractally. Abstract talk usually results in text such as the title I've created for this post :P
Tables
- people
- A field id, which is an integer
- A field name, which is a string
- books
- A field id, which is an integer
- A field name, which is a string
- A field person_id, which is a foreign key to person
Relationships
- One person has many books
The relationship described is an one-to-many relationship. One person has many books. That's the idea.
A very pertinent question is: who have read harry potter and lord of the rings? Although the question is quite simple, the answer is not.
If the question was who have read harry potter?, then the answer would be straightforward:
SELECT *
FROM people INNER JOIN books ON people.id=books.person_id
WHERE books.name = 'harry potter'
What about our first question? Who have read harry potter and lord of the rings? Turns out it's not that simple. At first we may be tempted to write
SELECT *
FROM people INNER JOIN books ON people.id=books.person_id
WHERE books.name = 'harry potter' AND books.name = 'lord of the rings'
However, that won't work. That condition will never be true. How can a single row have two values? It can't. We must remember where is applied to each row of the query. We can't refer to different rows in the same where clause.
One possibility is yet another join
SELECT *
FROM people INNER JOIN books as b1 ON people.id=b1.person_id INNER JOIN books as b2 ON people.id=b2.person_id
WHERE b1.name = 'harry potter' AND b2.name = 'lord of the rings'
I'm not sure the syntax is exactly right, but I think you can get the idea (and possibly give me feedback on how the proper syntax goes). For each new conditional you'd need a new join. I find that solution very inellegant. Don't you think?
Search engines such as solr have a prettier syntax for such a query. It is as simple as: book:harry+potter AND book:lord+of+the+rings. That tells the system you want documents containing both harry potter and lord of the rings on book field. That's one of those cases where denormalized data makes it easier (and more efficient) to make a query.
Thinking about denormalization, if you always query for people who have two books (but never for people who have three or more books), then you could denormalize your people table. If people table has two columns: book1 and book2, then the query could take the following form:
SELECT *
FROM people
WHERE (people.book1 = 'harry potter' AND people.book2 = 'lord of the rings') OR (people.book2 = 'harry potter' AND people.book1 = 'lord of the rings')
a little uglier than solr's counterpart, but better than the join chain.
There are better solutions than the multiple joins I proposed before. One that I find particularly elegant is to use the INTERSECT operation:
SELECT *
FROM people INNER JOIN books ON people.id=books.person_id
WHERE books.name = 'harry potter'
INTERSECT
SELECT *
FROM people INNER JOIN books ON people.id=books.person_id
WHERE books.name = 'lord of the rings'
I find that query as elegant as the search engine approach (although probably slower). Unfortunately, it does not work on mysql (which doesn't have INTERSECT). A solution for intersect is given by Carsten in his Random Ramblings blog. It's ugly, but it's what we've got. The idea is to substitute INTERSECT by an inner join:
SELECT *
FROM
(SELECT *
FROM people INNER JOIN books ON books.id=person_id
WHERE books.name = 'harry potter')
INNER JOIN
(SELECT *
FROM people INNER JOIN books ON id=person_id
WHERE books.name = 'lord of the rings')
USING (id)
not really pretty, is it?
Finally, there's one very cryptic approch. I like it because it seems like the fastest solution and, also, because I'm a cryptic man.
SELECT *
FROM people INNER JOIN books ON people.id=books.person_id
WHERE books.name in ('harry potter', 'lord of the rings')
GROUP BY people.id
HAVING count(*) = 2
It's hard to understand, isn't it? It's hard to explain too. The idea is that only people having both entries, harry potter and lord of the rings, will have two rows.
SELECT person_id, COUNT(*) FROM books GROUP BY person_id
will give us the number of books each person has.
SELECT person_id, COUNT(*) FROM books WHERE name='harry potter' GROUP BY person_id
will give us the number of books named harry potter each person has. Notice that, for the cryptic trick to work, we assume that each person have exactly one copy of each book. If that's not true, then we are back to the intersection deal.
Assuming a person can't have two books with the same name, then here's a query that will return 2 for each person who have both harry potter and lord of the rings books, 1 if the person has either harry potter or lord of the rings, but not both, and 0 if the person has neither.
SELECT person_id, COUNT(*) FROM books WHERE name in ('harry potter', 'lord of the rings') GROUP BY person_id
Now it should be easy to understand the cryptic approach.
Notes on disjunction (OR)
Although using conjunction is so difficult, disjunction is much easier. The following query will return the set of people that either have harry potter or lord of the rings books:
SELECT DISTINCT people.*
FROM people INNER JOIN books ON people.id=books.person_id
WHERE books.name in ('harry potter', 'lord of the rings')
much simpler, huh? I guess SQL likes to unite, but not to intersect.