SQL Server Interview Questions

Joins

A query that can access all the data it needs in a single table is a pretty rare one. John Donne said "no

man is an island," and, in relational databases, no table is, either. Usually, a query will have to go to two

or more tables to find all the information it requires. This is the way of things with relational databases.

Data is intentionally spread out to keep it as modular as possible. There are lots of good reasons for this

modularization (formally known as normalization) that I won't go into here, but one of its downsides is

that what might be a single conceptual entity (an invoice, for example) is often split into multiple physical

entities when constructed in a relational database.

Dealing with this fragmentation is where joins come in. A join consolidates the data in two tables into a

Single result set. The tables aren't actually merged; they just appear to be in the rows returned by the

query. Multiple joins can consolidate multiple tables—it's quite common to see joins that are multiple

levels deep involving scads of tables.

A join between two tables is established by linking a column or columns in one table with those in another

(CROSS JOINs are an exception, but more on them later). The expression used to join the two tables

constitutes the join condition or join criterion. When the join is successful, data in the second table is

combined with the first to form a composite result set—a set of rows containing data from both tables. In

short, the two tables have a baby, albeit an evanescent one.

There are two basic types of joins, inner joins and outer joins. The key difference between them is that

Outer joins include rows in the result set even when the join condition isn't met, while an inner join

doesn't. How is this? What data ends up in the result set when the join condition fails? When the join

criteria in an outer join aren't met, columns in the first table are returned normally, but columns from the

second table are returned with no value—as NULLs. This is handy for finding missing values and broken

links between tables.

There are two families of syntax for constructing joins—legacy and ANSI/ISO SQL-92 compliant. The

Legacy syntax dates back to SQL Server's days as a joint venture between Sybase and Microsoft. It's

more succinct than the ANSI syntax and looks like this:

SELECT customers.CustomerNumber, orders.Amount

FROM customers, orders

WHERE customers.CustomerNumber=orders.CustomerNumber

CustomerNumber Amount

-------------- --------

1 123.45

2 678.90

3 86753.09

1 678.90

Note the use of the WHERE clause to join the customers and orders tables together. This is an inner join.

If an order doesn't exist for a given customer, that customer is omitted completely from the list. Here's

the ANSI version of the same query:

SELECT customers.CustomerNumber, orders.Amount

FROM customers JOIN orders ON (customers.CustomerNumber=orders.CustomerNumber)

This one's a bit loquacious, but the end result is the same: customers and orders are merged using their

respective CustomerNumber columns.

As I mentioned earlier, it's common for queries to construct multilevel joins. Here's an example of a

Multilevel join that uses the legacy syntax:

SELECT customers.CustomerNumber, orders.Amount, items.Description

FROM customers, orders, items

WHERE customers.CustomerNumber=orders.CustomerNumber

AND orders.ItemNumber=items.ItemNumber

CustomerNumber Amount Description

-------------- -------- -----------

1 123.45 WIDGET A

2 678.90 WIDGET B

3 86753.09 WIDGET C

1 678.90 WIDGET B

This query joins the composite of the customers table and the orders table with the items table. Note that

The exact ordering of the WHERE clause is unimportant. In order to allow servers to fully optimize queries,

SQL requires that the ordering of the predicates in a WHERE clause must not affect the result set. They

must be associative—the query must return the same result regardless of the order in which they're

processed.

As with the two-table join, the ANSI syntax for multitable inner joins is similar to the legacy syntax. Here's

The ANSI syntax for the multitable join above:

SELECT customers.CustomerNumber, orders.Amount, items.Description

FROM customers JOIN orders ON (customers.CustomerNumber=orders.CustomerNumber)

JOIN items ON (orders.ItemNumber=items.ItemNumber)

Again, it's a bit wordier, but it performs the same function.

Outer Joins

Thus far, there hasn't been a stark contrast between the ANSI and legacy join syntaxes. Though not

syntactically identical, they seem to be functionally equivalent.

This all changes with outer joins. The ANSI outer join syntax addresses ambiguities inherent in using the

WHERE clause—whose terms are by definition associative—to perform table joins. Here's an example of

The legacy syntax that contains such ambiguities:

-- Bad SQL - Don't run

SELECT customers.CustomerNumber, orders.Amount, items.Description

FROM customers, orders, items

WHERE customers.CustomerNumber*=orders.CustomerNumber

AND orders.ItemNumber*=items.ItemNumber

Don't bother trying to run this—SQL Server won't allow it. Why? Because WHERE clause terms are

Required to be associative, but these aren't. If customers and orders are joined first, those rows where a

Customer exists but has no orders will be impossible to join with the items table since their ItemNumber

column will be NULL. On the other hand, if orders and items are joined first, the result set will include

ITEM records it likely would have otherwise missed. So the order of the terms in the WHERE clause is

significant when constructing multilevel joins using the legacy syntax.

It's precisely because of this ambiguity—whether the ordering of WHERE clause predicates is significant—

that the SQL-92 standard moved join construction to the FROM clause. Here's the above query rewritten

using valid ANSI join syntax:

SELECT customers.CustomerNumber, orders.Amount, items.Description

FROM customers LEFT OUTER JOIN orders ON

(customers.CustomerNumber=orders.CustomerNumber)

LEFT OUTER JOIN items ON (orders.ItemNumber=items.ItemNumber)

CustomerNumber Amount Description

-------------- -------- -----------

1 123.45 WIDGET A

1 678.90 WIDGET B

2 678.90 WIDGET B

3 86753.09 WIDGET C

Here, the ambiguities are gone, and it's clear that the query is first supposed to join the customers and

orders tables, then join the result with the items table. (Note that the OUTER keyword is optional.)

To understand how this shortcoming in the legacy syntax can affect query results, consider the following

query. We'll set it up initially so that the outer join works as expected:

SELECT customers.CustomerNumber, orders.Amount

FROM customers, orders

WHERE customers.CustomerNumber*=orders.CustomerNumber

AND orders.Amount>600

CustomerNumber Amount

-------------- --------

1 678.90

2 678.90

3 86753.09

Since every row in customers finds a match in orders, the problem isn't obvious. Now let's change the

query so that there are a few mismatches between the tables, like so:

SELECT customers.CustomerNumber+2, orders.Amount

FROM customers, orders

WHERE customers.CustomerNumber+2*=orders.CustomerNumber

AND orders.Amount>600

This version simply adds 2 to CustomerNumber to ensure that at least a few of the joins will fail and the

columns in orders will be returned as NULLs. Here's the result set:

CustomerNumber Amount

-------------- --------

3 86753.09

4 NULL

5 NULL

See the problem? Those last two rows shouldn't be there. Amount is NULL in those rows (because there

are no orders for customers4 and5), and whether it exceeds $600 is unknown. The query is supposed to

return only those rows whose Amount column is known to exceed $600, but that's not the case. Here's

the ANSI version of the same query:

SELECT customers.CustomerNumber+2, orders.Amount

FROM customers LEFT OUTER JOIN orders ON

(customers.CustomerNumber+2=orders.CustomerNumber)

WHERE orders.Amount>600

CustomerNumber Amount

-------------- --------

3 86753.09

The SQL-92 syntax correctly omits the rows with a NULL Amount. The reason the legacy query fails here is

that the predicates in its WHERE clause are evaluated together. When Amount is checked against the >600 predicate, it has not yet been returned as NULL, so it's erroneously included in the result set.

By the time it's set to NULL, it's already in the result set, effectively negating the >600 predicate.

Though the inner join syntax you choose is largely a matter a preference, you should still use the SQL-92

syntax whenever possible. It's hard enough keeping up with a single way of joining tables, let alone two

different ways. And, as we've seen, there are some real problems with the legacy outer join syntax.

Moreover, Microsoft strongly recommends the use of the ANSI syntax and has publicly stated that the

legacy outer join syntax will be dropped in a future release of the product. Jumping on the ANSI/ISO

bandwagon also makes sense from another perspective: interoperability. Given the way in which the

DBMS world—like the real world—is shrinking, it's not unusual for an application to communicate with or

rely upon more than one vendor's DBMS. Heterogeneous joins, passthrough queries, and vendor-to

vendor replication are now commonplace. Knowing this, it makes sense to abandon proprietary syntax

elements in favor of those that play well with others.

Other Types of Joins

Thus far, we've explored only left joins—both inner and outer. There are a few others that are worth

mentioning as well. Transact-SQL also supports RIGHT OUTER JOINs, CROSS JOINs, and FULL OUTER

JOINs.

A RIGHT OUTER JOIN isn't really that different from a LEFT OUTER JOIN. In fact, it's really just a LEFT

OUTER JOIN with the tables reversed. It's very easy to restate a LEFT OUTER JOIN as a RIGHT OUTER

JOIN. Here's the earlier LEFT OUTER JOIN query restated:

SELECT customers.CustomerNumber+2, orders.Amount

FROM orders RIGHT OUTER JOIN customers ON

(customers.CustomerNumber+2=orders.CustomerNumber)

Amount

------ --------

3 86753.09

4 NULL

5 NULL

A RIGHT JOIN returns the columns in the first table as NULLs when the join condition fails. Since you

Decide which table is the first table and which one's the second, whether you use a LEFT JOIN or a RIGHT

JOIN is largely a matter a preference.

A CROSS JOIN, by contrast, is an intentional Cartesian product. The size of a Cartesian product is the

number of rows in one table multiplied by those in the other. So for two tables with three rows each, their

CROSS JOIN or Cartesian product would consist of nine rows. By definition, CROSS JOINs don't need or

support the use of the ON clause that other joins require. Here's a CROSS JOIN of the customers and

orders tables:

SELECT customers.CustomerNumber, orders.Amount

FROM orders CROSS JOIN customers

Guru’s Guide to Transact-SQL

CustomerNumber Amount

-------------- --------

1 123.45

1 678.90

1 86753.09

1 678.90

2 123.45

2 678.90

2 86753.09

2 678.90

3 123.45

3 678.90

3 86753.09

3 678.90

(12 row(s) affected)

A FULL OUTER JOIN returns rows from both tables regardless of whether the join condition succeeds.

When a join column in the first table fails to find a match in the second, the values from the second table

are returned as NULL, just as they are with a LEFT OUTER JOIN. When the join column in the second table

fails to find a matching value in the first table, columns in the first table are returned as NULL, as they are

in a RIGHT OUTER JOIN. You can think of a FULL OUTER JOIN as the combination of a LEFT JOIN and a

RIGHT JOIN.

Here's the earlier LEFT OUTER JOIN restated as a FULL OUTERJOIN:

SELECT customers.CustomerNumber+2, orders.Amount

FROM customers FULL OUTER JOIN orders ON

(customers.CustomerNumber+2=orders.CustomerNumber)

Amount

------ --------

3 86753.09

4 NULL

5 NULL

NULL 123.45

NULL 678.90

TSQL Tutorial..

Pages

cross

Saturday, November 08, 2014

TSQL Tutorial : Joins

No comments:

Post a Comment