That Guy From Delhi: Basic OLAP Support in PostgreSQL

While reviewing an existing application, I thought it'd be worthwhile to review how good / bad PostgreSQL is in terms of OLAP. This (growing) post is going to be my (un)learning of how ready is PostgreSQL.

Row Numbering

Support: Yes.
Use: Row_Number() function numbers rows generated in a result-set.
Example:

SELECT
row_number() OVER (ORDER BY marks DESC) AS rn,
name
FROM x;
Review: Some databases have different variants that accomplish this (for e.g. Oracle has a pseudo column called ROWNUM), but PostgreSQL fully supports the SQL Compliant syntax.

Rank()

Support: Yes.
Use: Rank() and Dense_Rank() functions number the rank of the compared item.
Example:

SELECT
rank() OVER (ORDER BY marks DESC) AS rn,
dense_rank() OVER (ORDER BY marks DESC) AS drn,
name
FROM x;
Review: Its useful and fully supported.

Window Clause

Support

OVER (PARTITION BY): Yes
OVER (ORDER BY): Yes
OVER (RANGE): Yes

Use: Read more here.
Example:
Review: These are extremely helpful for people serious about data-extraction / reporting and fully supported.

NTile

Support: . Yes
Use: Ntile().
Example:

SELECT
ntile(4) OVER (ORDER BY marks DESC) AS quartile,
ntile(10) OVER (ORDER BY marks DESC) AS decile,
ntile(100) OVER (ORDER BY marks DESC) AS percentile,
name
FROM x;
Review: Versatile and fully supported.

Nested OLAP Aggregations

Support: No

But doable with alternative SQL? : Yes

Is that as Performant? : Mostly No

Description: Allow something like

SELECT
subject,
AVG(SUM(marks) GROUP BY class)
FROM marks
GROUP BY subject;
Alternative: This could be done with Sub-Selects like this:

SELECT
subject,
AVG(sum_marks) AS avg
FROM (
SELECT
subject
class,
SUM(marks) AS sum_marks
FROM marks
GROUP BY subject, class
) mrk
GROUP BY subject;
Review: In the two examples we are trying to calculate the Per-Subject-Average of (Total marks obtained in different classes). Although PostgreSQL doesn't support this form of nested-aggregates, it clearly is a neat form of doing things. The alternative, acceptably looks like a kludge, and it would be a nice to have feature.

GROUPING SETS

Support: Yes (in 9.5)

Alternative: This could be alternatively done with UNION ALL like this:
SELECT SubjectID,
NULL AS StudentID,
AVG(marks)FROM marksGROUP BY SubjectID
UNION ALL

SELECT NULL AS SubjectID,
StudentID,
AVG(marks)FROM marksGROUP BY StudentID;
Review: Popular databases (Oracle / MSSQL) support this well. PostgreSQL has had this has on the ToDo list from at least a decade ! Looking at the alternative, one can see that this is not just lengthy (and repetitive .. thus error-prone), but also non-performant (simply because it requires multiple-runs of the same data-set).
History:

Already in PostgreSQL TODO list
Discussions started (at least) way back in 2003.
Patch:

A 2008 patch that didn't make it.
A 2014 patch was heavily in discussion since and finally just got through to PostgreSQL 9.5.

ROLLUP

Description: An obvious extension to GROUPING BY (explained above), ROLLUP could be explained with a simple example:

GROUP BY ROLLUP (Year, SubjectID, StudentID)

is equivalent to

GROUP BY GROUPING SETS
(Year, SubjectID, StudentID)
(Year, SubjectID)(Year)
();
Support: Yes (in 9.5)
Alternative: This could be alternatively done with CTEs.

WITH x AS (
SELECT Year, SubjectID, StudentID, marks
FROM marks
WHERE passed
AND NOT inactive
)
SELECT *
FROM x

UNION ALL

SELECT
Year, SubjectID, StudentID, AVG(marks)
FROM x
GROUP BY Year, SubjectID, StudentID

UNION ALL

SELECT
Year, SubjectID, NULL AS StudentID, AVG(marks)
FROM x
GROUP BY Year, SubjectID

UNION ALL

SELECT
Year, NULL AS SubjectID, NULL AS StudentID,
AVG(marks)
FROM marks
GROUP BY Year;
Review: ROLLUPs are ideal to generate things like Sub-totals, which at times form key performance factors when generating large Reports. The alternative essentially uses a CTE, which is subsequently used to calculate subtotals and totals. For multiple-reasons, this is sub-optimal and can be sped up, if only for in-built support. Besides, the alternative is lengthy & repetitive (thus error-prone).
History:

Discussions started (at least) way back in 2003.
Patches submitted

The 2010 patch seemingly didn't make it.
The 2014 attempt finally got through.

CUBE

Support: Yes (in 9.5)
Description: Just like ROLLUP (was an extension of GROUPING SETS), CUBEs are an extension of ROLLUP (and thereby GROUPING SETS) and could be explained with the following example:

GROUP BY CUBE (Year, SubjectID, StudentID)
is equivalent to

GROUP BY GROUPING SETS
(Year, SubjectID, StudentID)
(Year, SubjectID)
(Year, StudentID)
(Year)
(SubjectID, StudentID)
(SubjectID)
(StudentID)
();
Review: The alternative (not provided for obvious reasons) is not just lengthy & repetitive (thus error-prone) but primarily not as performant as is otherwise possible.

MERGE INTO / UPSERT

Support: Yes (in 9.5)
Doable with alternative SQL? : Yes (for 9.4 and below)

Is the alternative as Performant?

No: This is because the alternative (URL given below) is a BEGIN/EXCEPTION based solution which is (time-wise) costly and an in-built support would certainly be faster.

Description: For those new to the complexity of MERGE (or UPSERT) please read this first.

TLDR: In the face of Concurrent Use, MERGE is difficult when a trying to balance Performance vs Integrity.

Unlike some other Database engines (that are sometimes okay with trading-off Integrity when it conflicts with Performance), PostgreSQL consistently prioritizes Data Integrity. The 'best' solution seems to have taken longer than expected, but considering that when a complicated open-source development model needs to coherently agree upon core feature additions, it really takes a few falling stars to get this piece-of-code in, with most people in support of it.
Example (SQL to create scenario + below SQL taken from here) MERGE INTO bonuses B

USING (

SELECT employee_id, salary

FROM employee

WHERE dept_no =20) E

ON (B.employee_id = E.employee_id)

WHEN MATCHED THEN

UPDATE SET B.bonus = E.salary * 0.1

WHEN NOT MATCHED THEN

INSERT (B.employee_id, B.bonus)
VALUES (E.employee_id, E.salary * 0.05);
Alternative: The PostgreSQL documentation mentions one recommended way of doing UPSERT / MERGE here. But again, this is non-performant and 9.5 based support for INSERT .. ON CONFLICT (a.k.a. UPSERT).
History:

MySQL / Oracle / MSSQL support this very well.
Long-pending requirement as per Wiki and now finally has made through!

1 comment:

Unknown said...: Some features are not supported, for example LEAD, LAG and NTH Value can't ignore/respect NULL, so it will be interesting if exists workaround for it.
Also, in windows partition, right now there is no support for ROWS BETWEEN and CURRENT ROW (but there is patch for PostgreSQL 11).; 6:12 pm, November 28, 2017

That Guy From Delhi

17 May 2015

Basic OLAP Support in PostgreSQL

1 comment:

Installing pg_tle on Ubuntu: A Quick Guide

Search This Blog