The HashDistinctOp operator does not require that its input set be sorted on the distinct key columns. It is a nonblocking operator. Rows are read from the child operator and are hashed on the distinct key columns. This determines the row’s bucket position. The corresponding bucket is searched to see if the key already exists. The row is discarded if it contains a duplicate key and another row is fetched from the child operator. The row is added to the bucket if no duplicate distinct key already exists and the row is passed up to the parent operator for further processing. Rows are not returned sorted on the distinct key columns.
The HashDistinctOp operator is generally used when the input set is not already sorted on the distinct key columns or when the optimizer is not able to exploit the ordering coming out of the distinct processing later in the plan.
1> select distinct au_lname, au_fname 2> from authors a 3> where city = "Oakland" 4> go QUERY PLAN FOR STATEMENT 1 (at line 1). 2 operator(s) under root The type of query is SELECT. ROOT:EMIT Operator |HASH DISTINCT Operator | Using Worktable1 for internal storage. | | |SCAN Operator | | FROM TABLE | | authors | | a | | Table Scan. | | Forward Scan. | | Positioning at start of table. | | Using I/O Size 2 Kbytes for data pages. | | With LRU Buffer Replacement Strategy for data pages.
In this example, the output of the authors table scan is not sorted. The optimizer can choose either a SortOp (Distinct) or HashDistinctOp strategy. The ordering provided by a SortOp (Distinct) strategy is not useful anywhere else in the plan, so the optimizer will probably choose a HashDistinctOp strategy. The optimizer’s decision is ultimately based on cost estimates. The HashDistinctOp is typically less expensive for unsorted input streams as it is a sieve that can eliminate rows on the fly for resident partitions. The SortOp (Distinct) operator cannot eliminate any rows until the entire data set has been sorted.
The showplan output for the HashDistinctOp operator reports that Worktable1 will be used. A worktable is needed in case the distinct row result set cannot fit in memory. In that case, partially processed groups will be spilled to disk.