Showing posts with label olap. Show all posts
Showing posts with label olap. Show all posts

28 Apr 2024

Boost Database Security: Restrict Users to Read Replicas

Only Allow Login to Read-Replicas and Standbys

When you're working with large databases in production, it is incredibly common to use read-replicas to improve performance. These read-replicas are a copy of your primary (main) database and let your applications offload read-heavy queries, which in-turn reduces strain on your primary database, effectively making the application faster and snappier.

Sometimes, you may want to restrict specific database users so they can connect ONLY to these read-replicas, and not to the primary database server. This can be tricky to implement, since any permissions configured for this use-case, whether on the user-level, the database level, the schema-level or even the table level would be quickly replicated to the read-replicas and thus would not work as expected.

This guide will show how to configure a database user to only login successfully on a read-replica. The only requirement is to enable the pg_tle extension [3] on your PostgreSQL database. This is simple to do on your Ubuntu based Laptop (see how to do that here [2]) or virtual-machines offered by your favourite cloud-provider. Furthermore, you could apply your login rules using Pl/PgSQL, PL/v8 or even PL/Rust - See here[1].

Why Restrict Access?

There are many good reasons for restricting users to read-replicas:

  • Performance: You can dedicate your primary database server to handling write operations (like updating data), ensuring those operations happen as fast as possible.

  • Reporting / Analytics: Production environments often have dedicated users for ancillary tasks, such as monitoring, reporting dashboards, read-only tenants etc. Restricting these database users to read-replica helpsreducing extra load on the primary database.

  • Security: In some cases, granting direct access to the primary database might be considered a security risk. Further, you may not be able to force login hygeine for all your database users, and then having a lockdown system to reject those database users to login to primary is crucial for application rollout.

Prerequisites

  • An existing PostgreSQL database instance with at least one read-replica.
    • You could also try this on your own Postgres database with pg_tle extension. Read here [2] for more on how to install pg_tle on your Ubuntu system.
  • Basic understanding of users and permissions within a database.

Steps

  1. Identify Target Database and Users: First we need to define how to implement the restriction. i.e. Which users (and database) are to be restricted to login only to read-replica. In the example below, we would restrict the user standby_only_user to only be able to login to Standbys / Read-Replicas on database prod_db.
psql <<SQL
  \c prod_db
  CREATE EXTENSION pg_tle;
SQL 
  1. Ensure that shared_preload_libraries is properly set to allow pg_tle. Also make sure that the pgtle.clientauth_db_name is appropriately set to the desired database (here prod_db):
cat <<EOL >> data/postgresql.conf
  shared_preload_libraries='pg_tle'
  pgtle.enable_clientauth=require
  pgtle.clientauth_db_name=prod_db
  pgtle.clientauth_users_to_skip=robins
  pgtle.clientauth_databases_to_skip=''
EOL
  1. Secret Sauce:

Next we create the key pg_tle function that restricts the user standby_only_user to login successfully only if this is a standby / read-replica:

SELECT pgtle.install_extension (
  'standbyusercheck',
  '1.0',
  'Allow some users to login only to standby / read-replicas',
$_pgtle_$
  CREATE SCHEMA standbycheck_schema;

  REVOKE ALL ON SCHEMA standbycheck_schema FROM PUBLIC;
  GRANT USAGE ON SCHEMA standbycheck_schema TO PUBLIC;

  CREATE OR REPLACE FUNCTION standbycheck_schema.standbycheck_hook(port pgtle.clientauth_port_subset, status integer)
  RETURNS void AS $$
    DECLARE
      is_standby bool := TRUE;
    BEGIN
      IF port.user_name = 'standby_only_user' THEN
        SELECT pg_is_in_recovery()
          INTO is_standby;
        IF is_standby THEN
          RAISE NOTICE 'User allowed to login';
        ELSE
          RAISE EXCEPTION 'User can only login to Standby / Read-Replicas';
        END IF;
      END IF;
    END
  $$ LANGUAGE plpgsql SECURITY DEFINER;

  GRANT EXECUTE ON FUNCTION standbycheck_schema.standbycheck_hook TO PUBLIC;
  SELECT pgtle.register_feature('standbycheck_schema.standbycheck_hook', 'clientauth');
  REVOKE ALL ON SCHEMA standbycheck_schema FROM PUBLIC;
$_pgtle_$
);

And now that the function is defined,CREATE EXTENSION would install the function and bind it to future login attempts.

CREATE EXTENSION standbyusercheck;
SHOW pgtle.clientauth_db_name;
  1. Test Connection:
  • Attempting to connect as a privileged user (here robins) to either of primary or read-replica should succeed.
Logging into Replica as robins
 login  | current_database | pg_is_in_recovery
--------+------------------+-------------------
 robins | prod_db          | t
(1 row)

Logging into Primary as robins
 login  | current_database | pg_is_in_recovery
--------+------------------+-------------------
 robins | prod_db          | f
(1 row)
  • However, the user standby_only_user should NOT be able to login to the primary.
Logging into Primary as standby_only_user
psql: error: connection to server at "localhost" (127.0.0.1), port 6432 failed: FATAL:  User can only login to Standby / Read-Replicas
  • While the user (standby_only_user) should only be able to login to any read-replica.
Logging into Replica as standby_only_user
       login       | current_database | pg_is_in_recovery
-------------------+------------------+-------------------
 standby_only_user | prod_db          | t
(1 row)

Other important aspects of this feature

  • You could force clientauth for all logins by setting the parameter pgtle.enable_clientauth = require

  • You could configure some users to always be allowed to login to either of Primary / Read-replica in cases of emergency, by adding that user to the pgtle.clientauth_users_to_skip. Ideally you would want your admin database roles to this list.

  • Orthogonally, you could configure some databases to always allow users to skip clientauth by setting the pgtle.clientauth_databases_to_skip feature.

  • Note, that both clientauth_databases_to_skip and clientauth_databases_to_skip can be utilised together. This is a good way to ensure that some set of database users (and some databases) are exempt from such a login restriction.

  • If pgtle.enable_clientauth is set to on or require and if the database mentioned in pgtle.clientauth_db_name is not configured correctly, postgres would complain with the messsage FATAL: pgtle.enable_clientauth is set to require, but pg_tle is not installed or there are no functions registered with the clientauth feature. This is a good engine check, helping us avoid basic misconfigurations.

  • If you're anticipating connection storms, you can also increase the workers (that would help enforce the login restriction) by setting the pgtle.clientauth_num_parallel_workers parameter to greater than 1.

Conclusion

By following the above steps, you've now successfully configured your PostgreSQL environment to restrict certain users to only login to the read-replicas. This helps not just optimize your database performance, but also bolster security.

Let me know if you'd like to explore more advanced scenarios or discuss IAM integration for fine-grained access control!

Reference

  1. Clientauth Hook Documentation - https://github.com/aws/pg_tle/blob/main/docs/04_hooks.md'
  2. Install pg_tle On Ubuntu - https://www.thatguyfromdelhi.com/2024/04/installing-pgtle-on-ubuntu-quick-guide.html
  3. Unlock PostgreSQL Super Powers with pg_tle - https://www.thatguyfromdelhi.com/2024/04/unlock-postgresql-superpowers-with-pgtle.html

2 Sept 2017

psql \d now supports Interleaved / Compound SORTKEYs (in Redshift)

In continuation of support for Redshift series, now Describe Table (for e.g. \d tbl) shows SORTKEY details. This resolves Issue #6 and shows both COMPOUND / INTERLEAVED variations along with all the column names.

This change was complicated because Redshift doesn't natively support LISTAGG() function on System / Catalog tables, which meant that I had to resort to a pretty verbose workaround. This in-turn meant that this patch shows only the first ten COMPOUND SORTKEYs of a table. Seriously speaking, it would really take an extreme corner-case, for someone to genuinely require a SORTKEY with 10+ columns.

This is not a limitation for INTERLEAVED SORTKEY since it only supports a maximum of 8 Columns.


db=# CREATE TABLE tbl_pk(custkey SMALLINT PRIMARY KEY);
CREATE TABLE
db=# \d tbl_pk
                                           Table "public.tbl_pk"
 Column  |   Type   | Encoding | DistKey | SortKey | Preload | Encryption | Collation | Nullable | Default
---------+----------+----------+---------+---------+---------+------------+-----------+----------+---------
 custkey | smallint | lzo      | f       | 0       | f       | none       |           | not null |
Indexes:
 PRIMARY KEY, btree (custkey)

db=# CREATE TABLE tbl_compound(
db(#   custkey   SMALLINT                ENCODE delta NOT NULL,
db(#   custname  INTEGER DEFAULT 10      ENCODE raw NULL,
db(#   gender    BOOLEAN                 ENCODE RAW,
db(#   address   CHAR(5)                 ENCODE LZO,
db(#   city      BIGINT identity(0, 1)   ENCODE DELTA,
db(#   state     DOUBLE PRECISION        ENCODE Runlength,
db(#   zipcode   REAL,
db(#   tempdel1  DECIMAL                 ENCODE Mostly16,
db(#   tempdel2  BIGINT                  ENCODE Mostly32,
db(#   tempdel3  DATE                    ENCODE DELTA32k,
db(#   tempdel4  TIMESTAMP               ENCODE Runlength,
db(#   tempdel5  TIMESTAMPTZ             ENCODE DELTA,
db(#   tempdel6  VARCHAR(MAX)            ENCODE text32k,
db(#   start_date VARCHAR(10)            ENCODE TEXT255
db(# )
db-# DISTSTYLE KEY
db-# DISTKEY (custname)
db-# COMPOUND SORTKEY (custkey, custname, gender, address, city, state, zipcode, tempdel1, tempdel2, tempdel3, tempdel4, tempdel5, start_date);
CREATE TABLE
db=#
db=# \d tbl_compound
                                                                 Table "public.tbl_compound"
   Column   |            Type             | Encoding  | DistKey | SortKey | Preload | Encryption | Collation | Nullable |              Default
------------+-----------------------------+-----------+---------+---------+---------+------------+-----------+----------+------------------------------------
 custkey    | smallint                    | delta     | f       | 1       | f       | none       |           | not null |
 custname   | integer                     | none      | t       | 2       | f       | none       |           |          | 10
 gender     | boolean                     | none      | f       | 3       | f       | none       |           |          |
 address    | character(5)                | lzo       | f       | 4       | f       | none       |           |          |
 city       | bigint                      | delta     | f       | 5       | f       | none       |           |          | "identity"(494055, 4, '0,1'::text)
 state      | double precision            | runlength | f       | 6       | f       | none       |           |          |
 zipcode    | real                        | none      | f       | 7       | f       | none       |           |          |
 tempdel1   | numeric(18,0)               | mostly16  | f       | 8       | f       | none       |           |          |
 tempdel2   | bigint                      | mostly32  | f       | 9       | f       | none       |           |          |
 tempdel3   | date                        | delta32k  | f       | 10      | f       | none       |           |          |
 tempdel4   | timestamp without time zone | runlength | f       | 11      | f       | none       |           |          |
 tempdel5   | timestamp with time zone    | delta     | f       | 12      | f       | none       |           |          |
 tempdel6   | character varying(65535)    | text32k   | f       | 0       | f       | none       |           |          |
 start_date | character varying(10)       | text255   | f       | 13      | f       | none       |           |          |
Indexes:
 COMPOUND SORTKEY (address,tempdel2,start_date,custkey,zipcode,tempdel4,city,state,tempdel3,custname)

db=# CREATE TABLE tbl_interleaved(custkey SMALLINT) INTERLEAVED SORTKEY (custkey);
CREATE TABLE
db=# \d tbl_interleaved
                                      Table "public.tbl_interleaved"
 Column  |   Type   | Encoding | DistKey | SortKey | Preload | Encryption | Collation | Nullable | Default
---------+----------+----------+---------+---------+---------+------------+-----------+----------+---------
 custkey | smallint | none     | f       | 1       | f       | none       |           |          |
Indexes:
 INTERLEAVED SORTKEY (custkey)

As a side-note, there is a consideration as to whether this should be on a separate section of its own (and not under Indexes, which it clearly isn't). May be another day. Happy Redshifting :) !

Update (15th Sep 2017):
This project has now been named PsqlForks!

17 May 2015

Basic OLAP Support in PostgreSQL

While reviewing an existing application, I thought it'd be worthwhile to review how good / bad PostgreSQL is in terms of OLAP. This (growing) post is going to be my (un)learning of how ready is PostgreSQL.

  1. Row Numbering
    1. Support: Yes. 
    2. Use: Row_Number() function numbers rows generated in a result-set.
    3. Example:

      SELECT
        row_number() OVER (ORDER BY marks DESC) AS rn,
        name
      FROM x;
    4. Review: Some databases have different variants that accomplish this (for e.g. Oracle has a pseudo column called ROWNUM), but PostgreSQL fully supports the SQL Compliant syntax.
  2. Rank()
    1. Support: Yes. 
    2. Use: Rank() and Dense_Rank() functions number the rank of the compared item. 
    3. Example:

      SELECT 
        rank() OVER (ORDER BY marks DESC) AS rn,
        dense_
      rank() OVER (ORDER BY marks DESC) AS drn,
        
      name

      FROM x;
       
    4. Review: Its useful and fully supported.
  3. Window Clause
    1. Support 
      1. OVER (PARTITION BY): Yes
      2. OVER (ORDER BY): Yes
      3. OVER (RANGE): Yes
    2. Use:  Read more here.
    3. Example:  
    4. Review: These are extremely helpful for people serious about data-extraction / reporting and fully supported.
  4. NTile
    1. Support: . Yes
    2. UseNtile().
    3. Example:

      SELECT 
        ntile(4) OVER (ORDER BY marks DESC) AS quartile,

        ntile(10) OVER (ORDER BY marks DESC) AS decile,

        ntile(100) OVER (ORDER BY marks DESC) AS percentile,

        
      name

      FROM x;
       
    4. Review: Versatile and fully supported.
  5. Nested OLAP Aggregations
    1. Support: No
      1. But doable with alternative SQL? : Yes
        1. Is that as Performant? : Mostly No
    2. Description: Allow something like

      SELECT
        subject,
        AVG(SUM(marks) GROUP BY class)
      FROM marks
      GROUP BY subject;
    3. Alternative:  This could be done with Sub-Selects like this:

      SELECT
        subject,
        AVG(sum_marks) AS avg
      FROM (
        SELECT
         subject
         class,
         SUM(marks) AS sum_marks   
        FROM marks   
        GROUP BY subject, class
        ) mrk
      GROUP BY subject;
    4. Review: In the two examples we are trying to calculate the Per-Subject-Average of (Total marks obtained in different classes). Although PostgreSQL doesn't support this form of nested-aggregates, it clearly is a neat form of doing things. The alternative, acceptably looks like a kludge, and it would be a nice to have feature. 
  6. GROUPING SETS
    1. SupportYes (in 9.5)
    2. Alternative:  This could be alternatively done with UNION ALL like this:
      SELECT
       SubjectID,
       NULL AS StudentID,
       AVG(marks)
      FROM marksGROUP BY SubjectID
      UNION ALL

      SELECT
       NULL AS SubjectID,
       StudentID,
       AVG(marks)
      FROM marksGROUP BY StudentID;
    3. Review: Popular databases (Oracle / MSSQL) support this well. PostgreSQL has had this has on the ToDo list from at least a decade ! Looking at the alternative, one can see that this is not just lengthy (and repetitive .. thus error-prone), but also non-performant (simply because it requires multiple-runs of the same data-set).
    4. History:
      1. Already in PostgreSQL TODO list
      2. Discussions started (at least) way back in 2003.
      3. Patch:
        1. 2008 patch that didn't make it.
        2. 2014 patch was heavily in discussion since and finally just got through to PostgreSQL 9.5.
  7. ROLLUP
    1. Description: An obvious extension to GROUPING BY (explained above), ROLLUP could be explained with a simple example:

       GROUP BY ROLLUP (Year, SubjectID, StudentID)

      is equivalent to

       GROUP BY GROUPING SETS
       (Year, SubjectID, StudentID)
       (Year, SubjectID)(Year)
       ();
    2. Support: Yes (in 9.5)
    3. Alternative:  This could be alternatively done with CTEs.

      WITH x AS (
        SELECT Year, SubjectID, StudentID, marks
        FROM marks
        WHERE passed
          AND NOT inactive
      )
      SELECT *
        FROM x

      UNION ALL

      SELECT
          Year, SubjectID, StudentID, AVG(marks)
        FROM x
        GROUP BY Year, SubjectID, StudentID

      UNION ALL

      SELECT
          Year, SubjectID, NULL AS StudentID, AVG(marks)
        FROM x
        GROUP BY Year, SubjectID

      UNION ALL

      SELECT
          Year, NULL AS SubjectID, NULL AS StudentID,
      AVG(marks)
        FROM marks
        GROUP BY Year;
    4. Review: ROLLUPs are ideal to generate things like Sub-totals, which at times form key performance factors when generating large Reports. The alternative essentially uses a CTE, which is subsequently used to calculate subtotals and totals. For multiple-reasons, this is sub-optimal and can be sped up, if only for in-built support. Besides, the alternative is lengthy & repetitive (thus error-prone).
    5. History:
      1. Discussions started (at least) way back in 2003.
      2. Patches submitted
        1. The 2010 patch seemingly didn't make it.
        2. The 2014 attempt finally got through.
  8. CUBE
    1. SupportYes (in 9.5)
    2. Description: Just like ROLLUP (was an extension of GROUPING SETS), CUBEs are an extension of ROLLUP (and thereby GROUPING SETS) and could be explained with the following example:

       GROUP BY CUBE (Year, SubjectID, StudentID)
      is equivalent to

       GROUP BY GROUPING SETS
        (Year, SubjectID, StudentID)
        (Year, SubjectID)
        (Year, StudentID)
        (Year)
        (SubjectID, StudentID)
        (SubjectID)
        (StudentID)
        ();
    3. Review: The alternative (not provided for obvious reasons) is not just lengthy & repetitive (thus error-prone) but primarily not as performant as is otherwise possible.
  9. MERGE INTO / UPSERT
    1. SupportYes (in 9.5)
    2. Doable with alternative SQL? : Yes (for 9.4 and below)
      1. Is the alternative as Performant?
        1. No: This is because the alternative (URL given below) is a BEGIN/EXCEPTION based solution which is (time-wise) costly and an in-built support would certainly be faster.
    3. Description: For those new to the complexity of MERGE (or UPSERT) please read this first.

      TLDR: In the face of
      Concurrent Use, MERGE is difficult when a trying to balance Performance vs Integrity.

      Unlike some other Database engines (that are sometimes okay with trading-off Integrity when it conflicts with Performance), PostgreSQL consistently prioritizes Data Integrity. The 'best' solution seems to have taken longer than expected, but considering that when a complicated open-source development model needs to coherently agree upon core feature additions, it really takes a few falling stars to get this piece-of-code in, with most people in support of it.
    4. Example (SQL to create scenario + below SQL taken from here)
      MERGE INTO bonuses B
      USING (
       SELECT employee_id, salary
       FROM employee
       WHERE dept_no =20) E
      ON (B.employee_id = E.employee_id)
      WHEN MATCHED THEN
       UPDATE SET B.bonus = E.salary * 0.1
      WHEN NOT MATCHED THEN
       INSERT (B.employee_id, B.bonus)
       VALUES (E.employee_id, E.salary * 0.05);
    5. Alternative: The PostgreSQL documentation mentions one recommended way of doing UPSERT / MERGE here. But again, this is non-performant and 9.5 based support for INSERT .. ON CONFLICT (a.k.a. UPSERT).
    6. History
      1. MySQL / Oracle / MSSQL support this very well.
      2. Long-pending requirement as per Wiki and now finally has made through!

16 May 2015

Postgres finally has CUBE / ROLLUP / GROUPING SETS !

Finally !

A *much* awaited feature, this attempt at adding the GROUPING SETS / ROLLUP / CUBE feature to PostgreSQL has been in the works for about a year (besides the so many in the past decade and a half that didn't get through), and thankfully this has finally got the approval of the powers that be, so the upcoming Postgres 9.5 would finally have this long pending SQL feature.

MSSQL and Oracle have had this for a while and then its time that PostgreSQL sport this as well. A big boon for Report generating SQLs this feature basically makes (what was earlier possible with lots of unmanageable hack of SQL), now possible with much cleaner code, and with much better (at times single pass) performance.

Read here to know more about OLAP support in PostgreSQL.

Thanks a ton Andrew Gierth and Atri Sharma and so many others who directly or indirectly assisted in getting this patch out of the door!

Andrew / Atri... take that long pending break... one look at that mail thread and it seems you deserve it :D !

What's in an empty table?

How much storage does an empty table in Postgres take? This is a post about Postgres tables that store ... well basically ...  Nothing . The...