Wednesday 23 October 2013

Using Analytic Functions in Oracle9i

Overview
Oracle has made analytic functions available starting with Oracle 8i (8.1.6). This article documents the usage and syntax surrounding the new analytic functions found in Oracle9i. Analytic functions are designed to address problems like:

  • Calculate a running total
  • Find percentages within a group
  • Top-N queries
  • Compute moving averages
  • and so many more...
Although most of these types of problems can be solved in other programming languages like PL/SQL and Java, the performance is often not very effiecient. Analytic functions add extensions to the SQL language to not only make these operations easier to code; they make them faster than could be achieved with pure SQL or PL/SQL. These extensions are currently under review by the ANSI SQL committee for inclusion in the SQL specification.
Analytic functions enable you to compute aggregate values for a specific group of rows. Groups are formed using the new WITHIN GROUP clause. Enhancements related to analytic functions new to Oracle9i are explained below. In this section of the document, all examples use the all famous EMP / DEPT tables. I included the DDL script (scott.sql) to create these tables in the SQL Scripts repository.
As mentioned above, analytic functions compute an aggregate value based on a group of rows. They differ from aggregate functions in that they return multiple rows for each group. The group of rows is called a window and is defined by an analytic clause. For each row, a sliding window of rows is defined. This window determines the range of rows used to perform the calculations for the current row. Window sizes can be based on either a physical number of rows or a logical interval such as time.
Analytic functions are the last set of operations performed in a query except for the final ORDER BY clause. All joins and all WHERE, GROUP BY, and HAVING clauses are completed BEFORE the analytic functions are processed. Therefore, analytic functions can appear only in the SELECT list or ORDER BY clause.
The Syntax

Analytic-Function(,,...)
OVER (
  
  
  
)

  • Analytic-Function Specify the name of an analytic function. Oracle provides many analytic functions such as:
    • AVG
    • CORR
    • COVAR_POP
    • COVAR_SAMP
    • COUNT
    • CUME_DIST
    • DENSE_RANK
    • FIRST
    • FIRST_VALUE
    • LAG
    • LAST
    • LAST_VALUE
    • LEAD
    • MAX
    • MIN
    • NTILE
    • PERCENT_RANK
    • PERCENTILE_CONT
    • PERCENTILE_DISC
    • RANK
    • RATIO_TO_REPORT
    • STDDEV
    • STDDEV_POP
    • STDDEV_SAMP
    • SUM
    • VAR_POP
    • VAR_SAMP
    • VARIANCE
  • Arguments Analytic functions take 0 to 3 arguments.
  • Query-Partition-Clause The PARTITION BY clause logically breaks a single result set into N groups, according to the criteria set by the partition expressions. The words "PARTITION" and "GROUP" are used synonymously here. The analytic functions are applied to each group independently, they are reset for each group.
  • Order-By-Clause The ORDER BY clause specifies how the data is sorted within each group (PARTITION). This will definitely affect the outcome of any analytic function.
  • Windowing-Clause The windowing clause gives us a way to define a sliding or anchored window of data, on which the analytic function will operate, within a group. This clause can be used to have the analytic function compute its value based on any arbitrary sliding or anchored window within a group.



Distribution Functions
The first analytical function to look at is the cumulative distribution function (CUME_DIST). Cumulative distribution is an analytic function that computes the relative position of a specified value in a group of values. For a row R, assuming ascending ordering, the CUME_DIST of R is the number of rows with values lower than or equal to the value of R divided by the number of rows being evaluated (the entire query result set or a partition). The range of values returned by CUME_DIST is >0 to <=1. Tie values always evaluate to the same cumulative distribution value. You can use CUME_DIST as an aggregate function or as an analytic function. When you use CUME_DIST as an aggregate function, it determines the relative position of an apparent row of information within an existing group of rows. CUME_DIST as an analytic function determines the relative position of a specific value within a group of values.
The following example calculates the salary percentile for each employee within each job category excluding job categories PRESIDENT and MANAGER. In the examle below, 50% of CLERKS have salaries less than or equal to JAMES.
SQL> SELECT job, ename, sal, CUME_DIST()
     OVER (PARTITION BY job ORDER BY sal) AS cume_dist
     FROM emp
     WHERE job NOT IN ('MANAGER', 'PRESIDENT'); 


JOB       ENAME             SAL  CUME_DIST
--------- ---------- ---------- ----------
ANALYST   SCOTT            3000          1
ANALYST   FORD             3000          1
CLERK     SMITH             800        .25
CLERK     JAMES             950         .5
CLERK     ADAMS            1100        .75
CLERK     MILLER           1300          1
SALESMAN  WARD             1250         .5
SALESMAN  MARTIN           1250         .5
SALESMAN  TURNER           1500        .75
SALESMAN  ALLEN            1600          1

10 rows selected.



Inverse Distribution Functions
Oracle9i introduces the PERCENTILE_CONT and PERCENTILE_DIST which are both inverse distribution functions. The PERCENTILE_CONT function is an inverse distribution function that assumes a continuous distribution model. It takes a percentile value and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification. Nulls are ignored in the calculation.
The PERCENTILE_DISC function is an inverse distribution function that assumes a discrete distribution model. It takes a percentile value and a sort specification and returns an element from the set. Nulls are ignored in the calculation.
Both functions, given a percentile and a sequence, will determine the interpolated value at that percentile. In both functions, interpolation ignores NULL values. All interpolated values will be between 0 and 1.
Aggregate Example
The following example computes the median salary in each department:
SQL> SELECT deptno,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sal DESC)
          "Median cont"
      , PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY sal DESC)
          "Median disc"
    FROM emp
        GROUP BY deptno;


    DEPTNO Median cont Median disc
---------- ----------- -----------
        10        2450        2450
        20        2975        2975
        30        1375        1500
Analytic Example - (PERCENTILE_CONT)
In the following example, the median for Department 30 is 1375, which has a corresponding percentile (Percent_Rank) of 0.5. None of the salaries in Department 30 have a percentile of 0.5, so the median value must be interpolated between 2400 (percentile 0.4) and 2500 (percentile 0.6), which evaluates to 2450.
SQL> SELECT
         ename
       , sal
       , deptno
       , PERCENTILE_CONT(0.5)
             WITHIN GROUP (ORDER BY sal DESC)
             OVER (PARTITION BY deptno) "Percentile_Cont"
       , PERCENT_RANK()
             OVER (PARTITION BY deptno 
                   ORDER BY sal DESC)   "Percent_Rank"
      FROM emp WHERE deptno IN (10, 30);


ENAME             SAL     DEPTNO Percentile_Cont Percent_Rank
---------- ---------- ---------- --------------- ------------
KING             5000         10            2450            0
CLARK            2450         10            2450           .5
MILLER           1300         10            2450            1
BLAKE            2850         30            1375            0
ALLEN            1600         30            1375           .2
TURNER           1500         30            1375           .4
WARD             1250         30            1375           .6
MARTIN           1250         30            1375           .6
JAMES             950         30            1375            1

9 rows selected.
Analytic Example - (PERCENTILE_DISC)
The following example calculates the median discrete percentile of the salary of each employee in the sample table emp.
The median value for Department 10 is 2450, which is the value whose corresponding percentile (Cume_Dist) is the smallest value greater than or equal to 0.5. The median value for Department 30 is 1500, which is the value whose corresponding percentile is the smallest value greater than or equal to 0.5.
SQL> SELECT
         ename
       , sal
       , deptno
       , PERCENTILE_DISC(0.5)
             WITHIN GROUP (ORDER BY sal DESC)
             OVER (PARTITION BY deptno) "Percentile_Disc"
       , CUME_DIST()
             OVER (PARTITION BY deptno 
                   ORDER BY sal DESC) "Cume_Dist"
     FROM emp where deptno in (10, 30);


ENAME             SAL     DEPTNO Percentile_Disc  Cume_Dist
---------- ---------- ---------- --------------- ----------
KING             5000         10            2450 .333333333
CLARK            2450         10            2450 .666666667
MILLER           1300         10            2450          1
BLAKE            2850         30            1500 .166666667
ALLEN            1600         30            1500 .333333333
TURNER           1500         30            1500         .5
WARD             1250         30            1500 .833333333
MARTIN           1250         30            1500 .833333333
JAMES             950         30            1500          1

9 rows selected.



Top-N Queries
Using Top-N Queries, you are able to get the Top-N records by some set of fields. Let's take a look at several examples.
Example 1
This example will sort the sales people by salary from greatest to least. Give the first three rows. If there are less then three people in a department, this will return less than three records.
set autotrace on explain
break on deptno skip 1

SELECT *
FROM (
  SELECT
      deptno
    , ename
    , sal
    , ROW_NUMBER()
  OVER (
    PARTITION BY deptno ORDER BY sal DESC
  ) Top3 FROM emp
)
WHERE Top3 <= 3;


    DEPTNO ENAME             SAL       TOP3
---------- ---------- ---------- ----------
        10 KING             5000          1
           CLARK            2450          2
           MILLER           1300          3

        20 SCOTT            3000          1
           FORD             3000          2
           JONES            2975          3

        30 BLAKE            2850          1
           ALLEN            1600          2
           TURNER           1500          3

9 rows selected.


Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE
   1    0   VIEW
   2    1     WINDOW (SORT PUSHED RANK)
   3    2       TABLE ACCESS (FULL) OF 'EMP'
This query works by sorting each partition (or group, which is the deptno), in a descending order, based on the salary column and then assigning a sequential row number to each row in the group as it is processed. The use of a WHERE clause after doing this to get just the first three rows in each partition. Example 2
In this example, I want to see the set of sales people who make the top 3 salaries - that is, find the set of distinct salary amounts, sort them, take the largest three, and list everyone who makes one of those values.
SELECT *
FROM (
  SELECT
      deptno
    , ename
    , sal
    ,DENSE_RANK()
  OVER (
    PARTITION BY deptno ORDER BY sal desc
  ) TopN FROM emp
)
WHERE TopN <= 3
ORDER BY deptno, sal DESC;

    DEPTNO ENAME             SAL       TOPN
---------- ---------- ---------- ----------
        10 KING             5000          1
           CLARK            2450          2
           MILLER           1300          3

        20 SCOTT            3000          1  <--- --------------------------------------------="" 0="" 10="" 1100="" 1500="" 1600="" 1="" 2850="" 2975="" 2="" 3000="" 30="" 3="" access="" adams="" allen="" blake="" execution="" ford="" jones="" of="" optimizer="CHOOSE" plan="" pre="" pushed="" rank="" rows="" select="" selected.="" statement="" table="" turner="" view="" window="">


    In the example above, the DENSE_RANK function was used to get 
    the top three salaries. We assigned the dense rank to the salary 
    column and sorted it in a descending order.
    
    The DENSE_RANK() function computes the rank of a row in an ordered 
    group of rows. The ranks are consecutive integers beginning with 1. 
    The largest rank value is the number of unique values returned by the 
    query. Rank values are not skipped in the event of ties. Rows with equal 
    values for the ranking criteria receive the same rank.
    
The DENSE_RANK function does not skip numbers and will assign the same number to those rows with the same value. Hence, after the result set is built in the inline view, we can simply select all of the rows with a dense rank of three or less, this gives us everyone who makes the top three salaries by department number.

No comments:

Post a Comment