17 May 2015

Basic OLAP Support in PostgreSQL

While reviewing an existing application, I thought it'd be worthwhile to review how good / bad PostgreSQL is in terms of OLAP. This (growing) post is going to be my (un)learning of how ready is PostgreSQL.

  1. Row Numbering
    1. Support: Yes. 
    2. Use: Row_Number() function numbers rows generated in a result-set.
    3. Example:

      SELECT
        row_number() OVER (ORDER BY marks DESC) AS rn,
        name
      FROM x;
    4. Review: Some databases have different variants that accomplish this (for e.g. Oracle has a pseudo column called ROWNUM), but PostgreSQL fully supports the SQL Compliant syntax.
  2. Rank()
    1. Support: Yes. 
    2. Use: Rank() and Dense_Rank() functions number the rank of the compared item. 
    3. Example:

      SELECT 
        rank() OVER (ORDER BY marks DESC) AS rn,
        dense_
      rank() OVER (ORDER BY marks DESC) AS drn,
        
      name

      FROM x;
       
    4. Review: Its useful and fully supported.
  3. Window Clause
    1. Support 
      1. OVER (PARTITION BY): Yes
      2. OVER (ORDER BY): Yes
      3. OVER (RANGE): Yes
    2. Use:  Read more here.
    3. Example:  
    4. Review: These are extremely helpful for people serious about data-extraction / reporting and fully supported.
  4. NTile
    1. Support: . Yes
    2. UseNtile().
    3. Example:

      SELECT 
        ntile(4) OVER (ORDER BY marks DESC) AS quartile,

        ntile(10) OVER (ORDER BY marks DESC) AS decile,

        ntile(100) OVER (ORDER BY marks DESC) AS percentile,

        
      name

      FROM x;
       
    4. Review: Versatile and fully supported.
  5. Nested OLAP Aggregations
    1. Support: No
      1. But doable with alternative SQL? : Yes
        1. Is that as Performant? : Mostly No
    2. Description: Allow something like

      SELECT
        subject,
        AVG(SUM(marks) GROUP BY class)
      FROM marks
      GROUP BY subject;
    3. Alternative:  This could be done with Sub-Selects like this:

      SELECT
        subject,
        AVG(sum_marks) AS avg
      FROM (
        SELECT
         subject
         class,
         SUM(marks) AS sum_marks   
        FROM marks   
        GROUP BY subject, class
        ) mrk
      GROUP BY subject;
    4. Review: In the two examples we are trying to calculate the Per-Subject-Average of (Total marks obtained in different classes). Although PostgreSQL doesn't support this form of nested-aggregates, it clearly is a neat form of doing things. The alternative, acceptably looks like a kludge, and it would be a nice to have feature. 
  6. GROUPING SETS
    1. SupportYes (in 9.5)
    2. Alternative:  This could be alternatively done with UNION ALL like this:
      SELECT
       SubjectID,
       NULL AS StudentID,
       AVG(marks)
      FROM marksGROUP BY SubjectID
      UNION ALL

      SELECT
       NULL AS SubjectID,
       StudentID,
       AVG(marks)
      FROM marksGROUP BY StudentID;
    3. Review: Popular databases (Oracle / MSSQL) support this well. PostgreSQL has had this has on the ToDo list from at least a decade ! Looking at the alternative, one can see that this is not just lengthy (and repetitive .. thus error-prone), but also non-performant (simply because it requires multiple-runs of the same data-set).
    4. History:
      1. Already in PostgreSQL TODO list
      2. Discussions started (at least) way back in 2003.
      3. Patch:
        1. 2008 patch that didn't make it.
        2. 2014 patch was heavily in discussion since and finally just got through to PostgreSQL 9.5.
  7. ROLLUP
    1. Description: An obvious extension to GROUPING BY (explained above), ROLLUP could be explained with a simple example:

       GROUP BY ROLLUP (Year, SubjectID, StudentID)

      is equivalent to

       GROUP BY GROUPING SETS
       (Year, SubjectID, StudentID)
       (Year, SubjectID)(Year)
       ();
    2. Support: Yes (in 9.5)
    3. Alternative:  This could be alternatively done with CTEs.

      WITH x AS (
        SELECT Year, SubjectID, StudentID, marks
        FROM marks
        WHERE passed
          AND NOT inactive
      )
      SELECT *
        FROM x

      UNION ALL

      SELECT
          Year, SubjectID, StudentID, AVG(marks)
        FROM x
        GROUP BY Year, SubjectID, StudentID

      UNION ALL

      SELECT
          Year, SubjectID, NULL AS StudentID, AVG(marks)
        FROM x
        GROUP BY Year, SubjectID

      UNION ALL

      SELECT
          Year, NULL AS SubjectID, NULL AS StudentID,
      AVG(marks)
        FROM marks
        GROUP BY Year;
    4. Review: ROLLUPs are ideal to generate things like Sub-totals, which at times form key performance factors when generating large Reports. The alternative essentially uses a CTE, which is subsequently used to calculate subtotals and totals. For multiple-reasons, this is sub-optimal and can be sped up, if only for in-built support. Besides, the alternative is lengthy & repetitive (thus error-prone).
    5. History:
      1. Discussions started (at least) way back in 2003.
      2. Patches submitted
        1. The 2010 patch seemingly didn't make it.
        2. The 2014 attempt finally got through.
  8. CUBE
    1. SupportYes (in 9.5)
    2. Description: Just like ROLLUP (was an extension of GROUPING SETS), CUBEs are an extension of ROLLUP (and thereby GROUPING SETS) and could be explained with the following example:

       GROUP BY CUBE (Year, SubjectID, StudentID)
      is equivalent to

       GROUP BY GROUPING SETS
        (Year, SubjectID, StudentID)
        (Year, SubjectID)
        (Year, StudentID)
        (Year)
        (SubjectID, StudentID)
        (SubjectID)
        (StudentID)
        ();
    3. Review: The alternative (not provided for obvious reasons) is not just lengthy & repetitive (thus error-prone) but primarily not as performant as is otherwise possible.
  9. MERGE INTO / UPSERT
    1. SupportYes (in 9.5)
    2. Doable with alternative SQL? : Yes (for 9.4 and below)
      1. Is the alternative as Performant?
        1. No: This is because the alternative (URL given below) is a BEGIN/EXCEPTION based solution which is (time-wise) costly and an in-built support would certainly be faster.
    3. Description: For those new to the complexity of MERGE (or UPSERT) please read this first.

      TLDR: In the face of
      Concurrent Use, MERGE is difficult when a trying to balance Performance vs Integrity.

      Unlike some other Database engines (that are sometimes okay with trading-off Integrity when it conflicts with Performance), PostgreSQL consistently prioritizes Data Integrity. The 'best' solution seems to have taken longer than expected, but considering that when a complicated open-source development model needs to coherently agree upon core feature additions, it really takes a few falling stars to get this piece-of-code in, with most people in support of it.
    4. Example (SQL to create scenario + below SQL taken from here)
      MERGE INTO bonuses B
      USING (
       SELECT employee_id, salary
       FROM employee
       WHERE dept_no =20) E
      ON (B.employee_id = E.employee_id)
      WHEN MATCHED THEN
       UPDATE SET B.bonus = E.salary * 0.1
      WHEN NOT MATCHED THEN
       INSERT (B.employee_id, B.bonus)
       VALUES (E.employee_id, E.salary * 0.05);
    5. Alternative: The PostgreSQL documentation mentions one recommended way of doing UPSERT / MERGE here. But again, this is non-performant and 9.5 based support for INSERT .. ON CONFLICT (a.k.a. UPSERT).
    6. History
      1. MySQL / Oracle / MSSQL support this very well.
      2. Long-pending requirement as per Wiki and now finally has made through!

1 comment:

Unknown said...

Some features are not supported, for example LEAD, LAG and NTH Value can't ignore/respect NULL, so it will be interesting if exists workaround for it.
Also, in windows partition, right now there is no support for ROWS BETWEEN and CURRENT ROW (but there is patch for PostgreSQL 11).

Find Database DNS / Endpoint via SQL

How to get Database identifier using SQL Often there is a need for client programs to find "where am I logged into?". This blog po...