Overview
Distribution Functions
Inverse Distribution Functions
Top-N Queries
Oracle has made analytic functions available starting with Oracle 8i (8.1.6). This article documents the usage and syntax surrounding the new analytic functions found in Oracle9i. Analytic functions are designed to address problems like:
Although most of these types of problems can be solved in other programming languages like PL/SQL and Java, the performance is often not very effiecient. Analytic functions add extensions to the SQL language to not only make these operations easier to code; they make them faster than could be achieved with pure SQL or PL/SQL. These extensions are currently under review by the ANSI SQL committee for inclusion in the SQL specification.
- Calculate a running total
- Find percentages within a group
- Top-N queries
- Compute moving averages
- and so many more...
Analytic functions enable you to compute aggregate values for a specific group of rows. Groups are formed using the newWITHIN GROUP
clause. Enhancements related to analytic functions new to Oracle9i are explained below. In this section of the document, all examples use the all famous EMP / DEPT tables. I included the DDL script (scott.sql
) to create these tables in the SQL Scripts repository.
As mentioned above, analytic functions compute an aggregate value based on a group of rows. They differ from aggregate functions in that they return multiple rows for each group. The group of rows is called a window and is defined by an analytic clause. For each row, a sliding window of rows is defined. This window determines the range of rows used to perform the calculations for the current row. Window sizes can be based on either a physical number of rows or a logical interval such as time.
Analytic functions are the last set of operations performed in a query except for the final ORDER BY clause. All joins and all WHERE, GROUP BY, and HAVING clauses are completed BEFORE the analytic functions are processed. Therefore, analytic functions can appear only in the SELECT list or ORDER BY clause.
The Syntax
Analytic-Function(, ,...) OVER ( )
- Analytic-Function Specify the name of an analytic function. Oracle provides many analytic functions such as:
- AVG
- CORR
- COVAR_POP
- COVAR_SAMP
- COUNT
- CUME_DIST
- DENSE_RANK
- FIRST
- FIRST_VALUE
- LAG
- LAST
- LAST_VALUE
- LEAD
- MAX
- MIN
- NTILE
- PERCENT_RANK
- PERCENTILE_CONT
- PERCENTILE_DISC
- RANK
- RATIO_TO_REPORT
- STDDEV
- STDDEV_POP
- STDDEV_SAMP
- SUM
- VAR_POP
- VAR_SAMP
- VARIANCE
- Arguments Analytic functions take 0 to 3 arguments.
- Query-Partition-Clause The PARTITION BY clause logically breaks a single result set into N groups, according to the criteria set by the partition expressions. The words "PARTITION" and "GROUP" are used synonymously here. The analytic functions are applied to each group independently, they are reset for each group.
- Order-By-Clause The ORDER BY clause specifies how the data is sorted within each group (PARTITION). This will definitely affect the outcome of any analytic function.
- Windowing-Clause The windowing clause gives us a way to define a sliding or anchored window of data, on which the analytic function will operate, within a group. This clause can be used to have the analytic function compute its value based on any arbitrary sliding or anchored window within a group.
Distribution Functions
The first analytical function to look at is the cumulative distribution function (CUME_DIST). Cumulative distribution is an analytic function that computes the relative position of a specified value in a group of values. For a row R, assuming ascending ordering, the CUME_DIST of R is the number of rows with values lower than or equal to the value of R divided by the number of rows being evaluated (the entire query result set or a partition). The range of values returned by CUME_DIST is >0 to <=1. Tie values always evaluate to the same cumulative distribution value. You can use CUME_DIST as an aggregate function or as an analytic function. When you use CUME_DIST as an aggregate function, it determines the relative position of an apparent row of information within an existing group of rows. CUME_DIST as an analytic function determines the relative position of a specific value within a group of values.
The following example calculates the salary percentile for each employee within each job category excluding job categories PRESIDENT and MANAGER. In the examle below, 50% of CLERKS have salaries less than or equal to JAMES.
SQL> SELECT job, ename, sal, CUME_DIST() OVER (PARTITION BY job ORDER BY sal) AS cume_dist FROM emp WHERE job NOT IN ('MANAGER', 'PRESIDENT'); JOB ENAME SAL CUME_DIST --------- ---------- ---------- ---------- ANALYST SCOTT 3000 1 ANALYST FORD 3000 1 CLERK SMITH 800 .25 CLERK JAMES 950 .5 CLERK ADAMS 1100 .75 CLERK MILLER 1300 1 SALESMAN WARD 1250 .5 SALESMAN MARTIN 1250 .5 SALESMAN TURNER 1500 .75 SALESMAN ALLEN 1600 1 10 rows selected.
Inverse Distribution Functions
Oracle9i introduces the PERCENTILE_CONT and PERCENTILE_DIST which are both inverse distribution functions. The PERCENTILE_CONT function is an inverse distribution function that assumes a continuous distribution model. It takes a percentile value and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification. Nulls are ignored in the calculation.
The PERCENTILE_DISC function is an inverse distribution function that assumes a discrete distribution model. It takes a percentile value and a sort specification and returns an element from the set. Nulls are ignored in the calculation.
Both functions, given a percentile and a sequence, will determine the interpolated value at that percentile. In both functions, interpolation ignores NULL values. All interpolated values will be between 0 and 1.
Aggregate Example
The following example computes the median salary in each department:
SQL> SELECT deptno, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sal DESC) "Median cont" , PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY sal DESC) "Median disc" FROM emp GROUP BY deptno; DEPTNO Median cont Median disc ---------- ----------- ----------- 10 2450 2450 20 2975 2975 30 1375 1500Analytic Example - (PERCENTILE_CONT)
In the following example, the median for Department 30 is 1375, which has a corresponding percentile (Percent_Rank) of 0.5. None of the salaries in Department 30 have a percentile of 0.5, so the median value must be interpolated between 2400 (percentile 0.4) and 2500 (percentile 0.6), which evaluates to 2450.
SQL> SELECT ename , sal , deptno , PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sal DESC) OVER (PARTITION BY deptno) "Percentile_Cont" , PERCENT_RANK() OVER (PARTITION BY deptno ORDER BY sal DESC) "Percent_Rank" FROM emp WHERE deptno IN (10, 30); ENAME SAL DEPTNO Percentile_Cont Percent_Rank ---------- ---------- ---------- --------------- ------------ KING 5000 10 2450 0 CLARK 2450 10 2450 .5 MILLER 1300 10 2450 1 BLAKE 2850 30 1375 0 ALLEN 1600 30 1375 .2 TURNER 1500 30 1375 .4 WARD 1250 30 1375 .6 MARTIN 1250 30 1375 .6 JAMES 950 30 1375 1 9 rows selected.Analytic Example - (PERCENTILE_DISC)
The following example calculates the median discrete percentile of the salary of each employee in the sample table emp.
The median value for Department 10 is 2450, which is the value whose corresponding percentile (Cume_Dist) is the smallest value greater than or equal to 0.5. The median value for Department 30 is 1500, which is the value whose corresponding percentile is the smallest value greater than or equal to 0.5.
SQL> SELECT ename , sal , deptno , PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY sal DESC) OVER (PARTITION BY deptno) "Percentile_Disc" , CUME_DIST() OVER (PARTITION BY deptno ORDER BY sal DESC) "Cume_Dist" FROM emp where deptno in (10, 30); ENAME SAL DEPTNO Percentile_Disc Cume_Dist ---------- ---------- ---------- --------------- ---------- KING 5000 10 2450 .333333333 CLARK 2450 10 2450 .666666667 MILLER 1300 10 2450 1 BLAKE 2850 30 1500 .166666667 ALLEN 1600 30 1500 .333333333 TURNER 1500 30 1500 .5 WARD 1250 30 1500 .833333333 MARTIN 1250 30 1500 .833333333 JAMES 950 30 1500 1 9 rows selected.
Top-N Queries
Using Top-N Queries, you are able to get the Top-N records by some set of fields. Let's take a look at several examples.
Example 1
This example will sort the sales people by salary from greatest to least. Give the first three rows. If there are less then three people in a department, this will return less than three records.
set autotrace on explain break on deptno skip 1 SELECT * FROM ( SELECT deptno , ename , sal , ROW_NUMBER() OVER ( PARTITION BY deptno ORDER BY sal DESC ) Top3 FROM emp ) WHERE Top3 <= 3; DEPTNO ENAME SAL TOP3 ---------- ---------- ---------- ---------- 10 KING 5000 1 CLARK 2450 2 MILLER 1300 3 20 SCOTT 3000 1 FORD 3000 2 JONES 2975 3 30 BLAKE 2850 1 ALLEN 1600 2 TURNER 1500 3 9 rows selected. Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE 1 0 VIEW 2 1 WINDOW (SORT PUSHED RANK) 3 2 TABLE ACCESS (FULL) OF 'EMP'This query works by sorting each partition (or group, which is the deptno), in a descending order, based on the salary column and then assigning a sequential row number to each row in the group as it is processed. The use of a WHERE clause after doing this to get just the first three rows in each partition. Example 2
In this example, I want to see the set of sales people who make the top 3 salaries - that is, find the set of distinct salary amounts, sort them, take the largest three, and list everyone who makes one of those values.
SELECT * FROM ( SELECT deptno , ename , sal ,DENSE_RANK() OVER ( PARTITION BY deptno ORDER BY sal desc ) TopN FROM emp ) WHERE TopN <= 3 ORDER BY deptno, sal DESC; DEPTNO ENAME SAL TOPN ---------- ---------- ---------- ---------- 10 KING 5000 1 CLARK 2450 2 MILLER 1300 3 20 SCOTT 3000 1 <--- --------------------------------------------="" 0="" 10="" 1100="" 1500="" 1600="" 1="" 2850="" 2975="" 2="" 3000="" 30="" 3="" access="" adams="" allen="" blake="" execution="" ford="" jones="" of="" optimizer="CHOOSE" plan="" pre="" pushed="" rank="" rows="" select="" selected.="" statement="" table="" turner="" view="" window=""> In the example above, the DENSE_RANK function was used to get the top three salaries. We assigned the dense rank to the salary column and sorted it in a descending order. The DENSE_RANK() function computes the rank of a row in an ordered group of rows. The ranks are consecutive integers beginning with 1. The largest rank value is the number of unique values returned by the query. Rank values are not skipped in the event of ties. Rows with equal values for the ranking criteria receive the same rank.
The DENSE_RANK function does not skip numbers and will assign the same number to those rows with the same value. Hence, after the result set is built in the inline view, we can simply select all of the rows with a dense rank of three or less, this gives us everyone who makes the top three salaries by department number. --->
No comments:
Post a Comment