The HASH
Distinct
operator does not require that its
input set be sorted on the distinct key columns. It is a nonblocking
operator. Rows are read from the child operator and are hashed on
the distinct key columns. This determines the row’s bucket position. The
corresponding bucket is searched to see if the key already exists.
The row is discarded if it contains a duplicate key, and another row
is fetched from the child operator. The row is added to the bucket
if no duplicate distinct key already exists and the row is passed
up to the parent operator for further processing. Rows are not returned
sorted on the distinct key columns.
The HASH Distinct
operator
is generally used when the input set is not already sorted on the
distinct key columns or when the optimizer cannot use the ordering
coming out of the distinct processing later in the plan.
select distinct au_lname, au_fname from authors where city = "Oakland" go QUERY PLAN FOR STATEMENT 1 (at line 1). STEP 1 The type of query is SELECT. 2 operator(s) under root ROOT:EMIT Operator (VA = 2) |HASH DISTINCT Operator (VA = 1) | Using Worktable1 for internal storage. | | | SCAN Operator (VA = 0) | | FROM TABLE | | authors | | Table Scan. | | Forward Scan. | | Positioning at start of table. | | Using I/O Size 2 Kbytes for data pages. | | With LRU Buffer Replacement Strategy for data pages.
In this example, the output of the authors table
scan is not sorted. The optimizer can choose either a SORT
Distinct
or HASH Distinct
operator
strategy. The ordering provided by a SORT Distinct
strategy
is not useful anywhere else in the plan, so the optimizer will probably
choose a HASH Distinct
strategy.
The optimizer’s decision is ultimately based on cost estimates.
The HASH Distinct
is typically
less expensive for unsorted input streams can eliminate rows on
the fly for resident partitions. The SORT Distinct
operator cannot
eliminate any rows until the entire data set has been sorted.
The showplan output for the HASH
Distinct
operator reports that Worktable1 will
be used. A worktable is needed in case the distinct row result set
cannot fit in memory. In that case, partially processed groups are
written to disk.