**What is?**
*SQL* is a very capable language and there are very few questions that it cannot answer. I find that I can come up with some convoluted SQL query to answer virtually any question you could ask from the data.
However, the performance of some of these queries is not what it should be -
nor is the query itself easy to write in the first place. Some of the things that are hard to do in straight SQL are actually very commonly requested operations, including:
*:Calculate a running total - Show the cumulative salary within a department row
by row, with each row including a summation of the prior rows' salary.
*:Find percentages within a group - Show the percentage of the total salary paid
to an individual in a certain department. Take their salary and divide it by the sum of the salary in the department.
*:Top-N queries - Find the top N highest-paid people or the top N sales by
region.
*:Compute a moving average - Average the current row's value and the previous N
rows values together.
*:Perform ranking queries - Show the relative rank of an individual's salary
within their department.
Both the "Top-N" and ranking queries could perhaps be implemented by simply having a result row number be returned (after sorting). The row number can then be used to calculate positional-based info. It would be similar to Oracle's "rownum" pseudo-column, but without the limits of Oracle's take.
Analytic functions, are designed to address these issues. They add extensions
to the SQL language that not only make these operations easier to code; they make them faster than could be achieved with the pure SQL approach. These extensions are currently under review by the ANSI SQL committee for inclusion in the SQL specification.
The syntax of the analytic function is rather straightforward in appearance,
but looks can be deceiving.
It starts with:
FUNCTION_NAME(,,?)
OVER
( )
*::The PARTITION BY clause logically breaks a single result set into N groups,
according to the criteria set by the partition expressions. The words 'partition' and 'group' are used synonymously.
*::The ORDER BY clause specifies how the data is sorted within each group
(partition).
*::The windowing clause gives us a way to define a sliding or anchored window of
data, on which the analytic function will operate, within a group. This clause can be used to have the analytic function compute its value based on any arbitrary sliding or anchored window within a group.
_:Example:
_:This example shows how to use the analytical function *SUM* to perform a
cumulative sum. First, we fill some values in a table. The table is very simple and consists of the field _dt_ and _xy_ only. Note, that for a given date it is possible to insert multiple rows which is exactly what I do here. What I am interested is to extract the cumulative sum for each day in the table. That is, if I have three entries for the same date, for example 3, 4 and 5, I don't want the sum to only be 3+4+5 for each row, but 3 for the first row, 3+4 for the second row and 3+4+5 for the third row.
create table sum_example (
dt date,
xy number
);
insert into sum_example values (to_date('27.08.1970','DD.MM.YYYY'),4);
insert into sum_example values (to_date('02.09.1970','DD.MM.YYYY'),1);
insert into sum_example values (to_date('09.09.1970','DD.MM.YYYY'),5);
insert into sum_example values (to_date('26.08.1970','DD.MM.YYYY'),3);
insert into sum_example values (to_date('28.08.1970','DD.MM.YYYY'),4);
insert into sum_example values (to_date('26.08.1970','DD.MM.YYYY'),6);
insert into sum_example values (to_date('29.08.1970','DD.MM.YYYY'),9);
insert into sum_example values (to_date('30.08.1970','DD.MM.YYYY'),2);
insert into sum_example values (to_date('12.09.1970','DD.MM.YYYY'),7);
insert into sum_example values (to_date('23.08.1970','DD.MM.YYYY'),2);
insert into sum_example values (to_date('27.08.1970','DD.MM.YYYY'),5);
insert into sum_example values (to_date('09.09.1970','DD.MM.YYYY'),9);
insert into sum_example values (to_date('01.09.1970','DD.MM.YYYY'),3);
insert into sum_example values (to_date('07.09.1970','DD.MM.YYYY'),1);
insert into sum_example values (to_date('12.09.1970','DD.MM.YYYY'),4);
insert into sum_example values (to_date('03.09.1970','DD.MM.YYYY'),5);
insert into sum_example values (to_date('03.09.1970','DD.MM.YYYY'),8);
insert into sum_example values (to_date('07.09.1970','DD.MM.YYYY'),7);
insert into sum_example values (to_date('04.09.1970','DD.MM.YYYY'),8);
insert into sum_example values (to_date('09.09.1970','DD.MM.YYYY'),1);
insert into sum_example values (to_date('29.08.1970','DD.MM.YYYY'),3);
insert into sum_example values (to_date('30.08.1970','DD.MM.YYYY'),7);
insert into sum_example values (to_date('24.08.1970','DD.MM.YYYY'),7);
insert into sum_example values (to_date('07.09.1970','DD.MM.YYYY'),9);
insert into sum_example values (to_date('26.08.1970','DD.MM.YYYY'),2);
insert into sum_example values (to_date('09.09.1970','DD.MM.YYYY'),8);
select dt,
sum(xy) over (partition by trunc(dt) order by dt rows between
unbounded preceding and current row) s,
xy
from sum_example;
drop table sum_example;
_:The the analytical function:
sum(xy) over (partition by trunc(dt) order by dt rows between unbounded preceding and current row)
_:The select statement will return:
|23.08.70| 2| 2|
|24.08.70| 7| 7|
|26.08.70| 3| 3|
|26.08.70| 5| 2|
|26.08.70|11| 6|
|27.08.70| 4| 4|
|27.08.70| 9| 5|
|28.08.70| 4| 4|
|29.08.70| 9| 9|
|29.08.70|12| 3|
|30.08.70| 2| 2|
|30.08.70| 9| 7|
|01.09.70| 3| 3|
|02.09.70| 1| 1|
|03.09.70| 5| 5|
|03.09.70|13| 8|
|04.09.70| 8| 8|
|07.09.70| 1| 1|
|07.09.70| 8| 7|
|07.09.70|17| 9|
|09.09.70| 5| 5|
|09.09.70|14| 9|
|09.09.70|15| 1|
|09.09.70|23| 8|
|12.09.70| 7| 7|
|12.09.70|11| 4|
_:The third column correspondents to xy (the values inserted with the insert
into ... above). The interesting column is the second. For example on the 26th of August in 1970,
the first row for that date
is 3 (equals xy), the second is 5 (equals xy+3) and the third is 11 (equals
xy+3+5).
*A list of analytic functions* we could find in Oracle Express 10:
*:AVG ( expression )
Used to compute an average of an expression within a group and window.
Distinct may be used to find the average of the values in a group after duplicates have been removed.
*:CORR (expression, expression)
Returns the coefficient of correlation of a pair of expressions that return
numbers. It is shorthand for:
*:COVAR_POP(expr1, expr2) /
*:STDDEV_POP(expr1) * STDDEV_POP(expr2)).
Statistically speaking, a correlation is the strength of an association between
variables. An association between variables means that the value of one variable can be predicted, to some extent, by the value of the other. The correlation coefficient gives the strength of the association by returning a number
between -1 (strong inverse correlation) and 1 (strong correlation). A value of
0 would indicate no correlation.
*:COUNT ( <*> )
This will count occurrences within a group. If you specify * or some non-null
constant, count will count all rows. If you specify an expression, count returns the count of non-null evaluations of expression.
You may use the DISTINCT modifier to count occurrences of rows in a group after
duplicates have been removed.
*:COVAR_POP (expression, expression)
This returns the population covariance of a pair of expressions that return numbers.
*:COVAR_SAMP (expression, expression)
This returns the sample covariance of a pair of expressions that return
numbers.
*:CUME_DIST
This computes the relative position of a row in a group. CUME_DIST will always
return a number greater then 0 and less then or equal to 1. This number represents the 'position' of the row in the group of N rows. In a group of three rows, the cumulate distribution values returned would be 1/3, 2/3, and 3/3 for example.
*:DENSE_RANK
This function computes the relative rank of each row returned from a query
with respect to the other rows, based on the values of the expressions in the ORDER BY clause. The data within a group is sorted by the ORDER BY clause and then a numeric ranking is assigned to each row in turn starting with 1 and
continuing on up. The rank is incremented every time the values of the ORDER BY
expressions change.
Rows with equal values receive the same rank (nulls are considered equal in
this comparison). A dense rank returns a ranking number without any gaps. This is in comparison to RANK below.
*:FIRST_VALUE
This simply returns the first value from a group.
*:LAG (expression, , )
LAG gives you access to other rows in a resultset without doing a self-join.
It allows you to treat the cursor as if it were an array in effect. You can reference rows that come before the current row in a given group. This would allow you to select 'the previous rows' from a group along with the current
row. See LEAD for how to get 'the next rows'.
Offset is a positive integer that defaults to 1 (the previous row). Default is
the value to be returned if the index is out of range of the window (for the first row in a group, the default will be returned)
*:LAST_VALUE
This simply returns the last value from a group.
*:LEAD (expression, , )
LEAD is the opposite of LAG. Whereas LAG gives you access to the a row preceding yours in a group - LEAD gives you access to the a row that comes after your row.
Offset is a positive integer that defaults to 1 (the next row). Default is the
value to be returned if the index is out of range of the window (for the last row in a group, the default will be returned).
*:MAX(expression)
Finds the maximum value of expression within a window of a group.
*:MIN(expression)
Finds the minimum value of expression within a window of a group.
*:NTILE (expression)
Divides a group into 'value of expression' buckets.
For example; if expression = 4, then each row in the group would be assigned a
number from 1 to 4 putting it into a percentile. If the group had 20 rows in it, the first 5 would be assigned 1, the next 5 would be assigned 2 and so on. In the event the cardinality of the group is not evenly divisible by the
expression, the rows are distributed such that no percentile has more than 1
row more then any other percentile in that group and the lowest percentiles are the ones that will have 'extra' rows. For example, using expression = 4 again and the number of rows = 21, percentile = 1 will have 6 rows, percentile =
2 will have 5, and so on.
*:PERCENT_RANK
This is similar to the CUME_DIST (cumulative distribution) function. For a
given row in a group, it calculates the rank of that row minus 1, divided by 1 less than the number of rows being evaluated in the group. This function will always return values from 0 to 1 inclusive.
*:RANK
This function computes the relative rank of each row returned from a query
with respect to the other rows, based on the values of the expressions in the ORDER BY clause. The data within a group is sorted by the ORDER BY clause and then a numeric ranking is assigned to each row in turn starting with 1 and
continuing on up. Rows with the same values of the ORDER BY expressions receive
the same rank; however, if two rows do receive the same rank the rank numbers will subsequently 'skip'. If two rows are number 1, there will be no number 2 - rank will assign the value of 3 to the next row in the group.
This is in contrast to DENSE_RANK, which does not skip values.
*:RATIO_TO_REPORT (expression)
This function computes the value of expression / (sum(expression)) over the
group. This gives you the percentage of the total the current row contributes to the sum(expression).
*:REGR_ xxxxxxx (expression, expression)
These linear regression functions fit an ordinary-least-squares regression
line to a pair of expressions. There are different regression functions available for use.
*:ROW_NUMBER
Returns the offset of a row in an ordered group. Can be used to sequentially
number rows, ordered by certain criteria.
*:STDDEV (expression)
Computes the standard deviation of the current row with respect to the group.
*:STDDEV_POP (expression)
This function computes the population standard deviation and returns the
square root of the population variance. Its return value is same as the square root of the VAR_POP function.
*:STDDEV_SAMP (expression)
This function computes the cumulative sample standard deviation and returns
the square root of the sample variance. This function returns the same value as the square root of the VAR_SAMP function would.
*:SUM(expression)
This function computes the cumulative sum of expression in a group.
*:VAR_POP (expression)
This function returns the population variance of a non-null set of numbers
(nulls are ignored). VAR_POP function makes the following calculation for us:
(SUM(expr*expr) - SUM(expr)*SUM(expr) / COUNT(expr)) / COUNT(expr)
*:VAR_SAMP (expression)
This function returns the sample variance of a non-null set of numbers (nulls
in the set are ignored). This function makes the following calculation for us:
(SUM(expr*expr) - SUM(expr)*SUM(expr) / COUNT(expr)) / (COUNT(expr) - 1)
*:VARIANCE (expression)
This function returns the variance of expression.
This list of function is often used to improve performance :
_: SUM, RANK, DENSE_RANK, LAG, LEAD, FIRST_VALUE, LAST_VALUE)