Joins
A query that can
access all the data it needs in a single table is a pretty rare one. John Donne
said "no
man is an
island," and, in relational databases, no table is, either. Usually, a
query will have to go to two
or more tables to
find all the information it requires. This is the way of things with relational
databases.
Data is intentionally
spread out to keep it as modular as possible. There are lots of good reasons
for this
modularization
(formally known as normalization) that I won't go into here, but one of
its downsides is
that what might be a
single conceptual entity (an invoice, for example) is often split into
multiple physical
entities when
constructed in a relational database.
Dealing with this
fragmentation is where joins come in. A join consolidates the data in two
tables into a
Single result set.
The tables aren't actually merged; they just appear to be in the rows returned
by the
query. Multiple joins
can consolidate multiple tables—it's quite common to see joins that are
multiple
levels deep
involving scads of tables.
A join between two
tables is established by linking a column or columns in one table with those in
another
(CROSS JOINs are an
exception, but more on them later). The expression used to join the two tables
constitutes the join
condition or join criterion. When the join is successful, data in
the second table is
combined with the
first to form a composite result set—a set of rows containing data from both
tables. In
short, the two tables
have a baby, albeit an evanescent one.
There are two basic
types of joins, inner joins and outer joins. The key difference
between them is that
Outer joins include
rows in the result set even when the join condition isn't met, while an inner
join
doesn't. How is this?
What data ends up in the result set when the join condition fails? When the
join
criteria in an outer
join aren't met, columns in the first table are returned normally, but columns
from the
second table are
returned with no value—as NULLs. This is handy for finding missing values and
broken
links between
tables.
There are two
families of syntax for constructing joins—legacy and ANSI/ISO SQL-92 compliant.
The
Legacy syntax dates
back to SQL Server's days as a joint venture between Sybase and Microsoft. It's
more succinct than
the ANSI syntax and looks like this:
SELECT
customers.CustomerNumber, orders.Amount
FROM customers,
orders
WHERE
customers.CustomerNumber=orders.CustomerNumber
CustomerNumber
Amount
--------------
--------
1 123.45
2 678.90
3 86753.09
1 678.90
Note the use of the
WHERE clause to join the customers and orders tables together. This is an inner
join.
If an order doesn't
exist for a given customer, that customer is omitted completely from the list.
Here's
the ANSI version of
the same query:
SELECT
customers.CustomerNumber, orders.Amount
FROM customers JOIN
orders ON (customers.CustomerNumber=orders.CustomerNumber)
This one's a bit
loquacious, but the end result is the same: customers and orders are merged
using their
respective
CustomerNumber columns.
As I mentioned
earlier, it's common for queries to construct multilevel joins. Here's an
example of a
Multilevel join that
uses the legacy syntax:
SELECT
customers.CustomerNumber, orders.Amount, items.Description
FROM customers, orders,
items
WHERE
customers.CustomerNumber=orders.CustomerNumber
AND
orders.ItemNumber=items.ItemNumber
CustomerNumber
Amount Description
--------------
-------- -----------
1 123.45 WIDGET A
2 678.90 WIDGET B
3 86753.09 WIDGET C
1 678.90 WIDGET B
This query joins the
composite of the customers table and the orders table with the items table.
Note that
The exact ordering
of the WHERE clause is unimportant. In order to allow servers to fully optimize
queries,
SQL requires that
the ordering of the predicates in a WHERE clause must not affect the result
set. They
must be associative—the
query must return the same result regardless of the order in which they're
processed.
As with the
two-table join, the ANSI syntax for multitable inner joins is similar to the legacy
syntax. Here's
The ANSI syntax for
the multitable join above:
SELECT
customers.CustomerNumber, orders.Amount, items.Description
FROM customers JOIN
orders ON (customers.CustomerNumber=orders.CustomerNumber)
JOIN items ON
(orders.ItemNumber=items.ItemNumber)
Again, it's a bit
wordier, but it performs the same function.
Outer Joins
Thus far, there
hasn't been a stark contrast between the ANSI and legacy join syntaxes. Though
not
syntactically
identical, they seem to be functionally equivalent.
This all changes
with outer joins. The ANSI outer join syntax addresses ambiguities inherent in
using the
WHERE clause—whose
terms are by definition associative—to perform table joins. Here's an example
of
The legacy syntax
that contains such ambiguities:
-- Bad SQL - Don't
run
SELECT
customers.CustomerNumber, orders.Amount, items.Description
FROM customers,
orders, items
WHERE
customers.CustomerNumber*=orders.CustomerNumber
AND
orders.ItemNumber*=items.ItemNumber
Don't bother trying
to run this—SQL Server won't allow it. Why? Because WHERE clause terms are
Required to be
associative, but these aren't. If customers and orders are joined first, those
rows where a
Customer exists but
has no orders will be impossible to join with the items table since their
ItemNumber
column will be NULL.
On the other hand, if orders and items are joined first, the result set will
include
ITEM records it
likely would have otherwise missed. So the order of the terms in the WHERE
clause is
significant when
constructing multilevel joins using the legacy syntax.
It's precisely
because of this ambiguity—whether the ordering of WHERE clause predicates is
significant—
that the SQL-92
standard moved join construction to the FROM clause. Here's the above query
rewritten
using valid ANSI
join syntax:
SELECT
customers.CustomerNumber, orders.Amount, items.Description
FROM customers LEFT
OUTER JOIN orders ON
(customers.CustomerNumber=orders.CustomerNumber)
LEFT OUTER JOIN
items ON (orders.ItemNumber=items.ItemNumber)
CustomerNumber
Amount Description
--------------
-------- -----------
1 123.45 WIDGET A
1 678.90 WIDGET B
2 678.90 WIDGET B
3 86753.09 WIDGET C
Here, the
ambiguities are gone, and it's clear that the query is first supposed to join
the customers and
orders tables, then
join the result with the items table. (Note that the OUTER keyword is
optional.)
To understand how
this shortcoming in the legacy syntax can affect query results, consider the
following
query. We'll set it
up initially so that the outer join works as expected:
SELECT customers.CustomerNumber,
orders.Amount
FROM customers,
orders
WHERE
customers.CustomerNumber*=orders.CustomerNumber
AND
orders.Amount>600
CustomerNumber
Amount
--------------
--------
1 678.90
2 678.90
3 86753.09
Since every row in
customers finds a match in orders, the problem isn't obvious. Now let's change
the
query so that there
are a few mismatches between the tables, like so:
SELECT
customers.CustomerNumber+2, orders.Amount
FROM customers,
orders
WHERE
customers.CustomerNumber+2*=orders.CustomerNumber
AND orders.Amount>600
This version simply
adds 2 to CustomerNumber to ensure that at least a few of the joins will fail
and the
columns in orders
will be returned as NULLs. Here's the result set:
CustomerNumber
Amount
--------------
--------
3 86753.09
4 NULL
5 NULL
See the problem?
Those last two rows shouldn't be there. Amount is NULL in those rows (because
there
are no orders for
customers4 and5), and whether it exceeds $600 is unknown. The query is supposed
to
return only those
rows whose Amount column is known to exceed $600, but that's not the case.
Here's
the ANSI version of
the same query:
SELECT
customers.CustomerNumber+2, orders.Amount
FROM customers LEFT
OUTER JOIN orders ON
(customers.CustomerNumber+2=orders.CustomerNumber)
WHERE
orders.Amount>600
CustomerNumber
Amount
--------------
--------
3 86753.09
The SQL-92 syntax
correctly omits the rows with a NULL Amount. The reason the legacy query fails
here is
that the predicates
in its WHERE clause are evaluated together. When Amount is checked against the >600
predicate, it has not yet been returned as NULL, so it's erroneously
included in the result set.
By the time it's set
to NULL, it's already in the result set, effectively negating the >600 predicate.
Though the inner
join syntax you choose is largely a matter a preference, you should still
use the SQL-92
syntax whenever
possible. It's hard enough keeping up with a single way of joining tables, let
alone two
different ways. And,
as we've seen, there are some real problems with the legacy outer join syntax.
Moreover, Microsoft
strongly recommends the use of the ANSI syntax and has publicly stated that the
legacy outer join syntax
will be dropped in a future release of the product. Jumping on the ANSI/ISO
bandwagon also makes
sense from another perspective: interoperability. Given the way in which the
DBMS world—like the
real world—is shrinking, it's not unusual for an application to communicate
with or
rely upon more than
one vendor's DBMS. Heterogeneous joins, passthrough queries, and vendor-to
vendor replication
are now commonplace. Knowing this, it makes sense to abandon proprietary syntax
elements in favor of
those that play well with others.
Other Types of Joins
Thus far, we've
explored only left joins—both inner and outer. There are a few others that are
worth
mentioning as well.
Transact-SQL also supports RIGHT OUTER JOINs, CROSS JOINs, and FULL OUTER
JOINs.
A RIGHT OUTER JOIN
isn't really that different from a LEFT OUTER JOIN. In fact, it's really just a
LEFT
OUTER JOIN with the
tables reversed. It's very easy to restate a LEFT OUTER JOIN as a RIGHT OUTER
JOIN. Here's the
earlier LEFT OUTER JOIN query restated:
SELECT
customers.CustomerNumber+2, orders.Amount
FROM orders RIGHT
OUTER JOIN customers ON
(customers.CustomerNumber+2=orders.CustomerNumber)
Amount
------ --------
3 86753.09
4 NULL
5 NULL
A RIGHT JOIN returns
the columns in the first table as NULLs when the join condition fails. Since
you
Decide which table
is the first table and which one's the second, whether you use a LEFT JOIN or a
RIGHT
JOIN is largely a
matter a preference.
A CROSS JOIN, by
contrast, is an intentional Cartesian product. The size of a Cartesian product
is the
number of rows in
one table multiplied by those in the other. So for two tables with three rows
each, their
CROSS JOIN or
Cartesian product would consist of nine rows. By definition, CROSS JOINs don't
need or
support the use of
the ON clause that other joins require. Here's a CROSS JOIN of the customers
and
orders tables:
SELECT
customers.CustomerNumber, orders.Amount
FROM orders CROSS
JOIN customers
Guru’s Guide to
Transact-SQL
14
CustomerNumber
Amount
--------------
--------
1 123.45
1 678.90
1 86753.09
1 678.90
2 123.45
2 678.90
2 86753.09
2 678.90
3 123.45
3 678.90
3 86753.09
3 678.90
(12 row(s) affected)
A FULL OUTER JOIN
returns rows from both tables regardless of whether the join condition
succeeds.
When a join column
in the first table fails to find a match in the second, the values from the
second table
are returned as NULL,
just as they are with a LEFT OUTER JOIN. When the join column in the second
table
fails to find a
matching value in the first table, columns in the first table are returned as
NULL, as they are
in a RIGHT OUTER
JOIN. You can think of a FULL OUTER JOIN as the combination of a LEFT JOIN and
a
RIGHT JOIN.
Here's the earlier
LEFT OUTER JOIN restated as a FULL OUTERJOIN:
SELECT
customers.CustomerNumber+2, orders.Amount
FROM customers FULL
OUTER JOIN orders ON
(customers.CustomerNumber+2=orders.CustomerNumber)
Amount
------ --------
3 86753.09
4 NULL
5 NULL
NULL 123.45
NULL 678.90
NULL 678.90
TSQL Tutorial..
No comments:
Post a Comment