That Guy From Delhi: olap

Showing posts with label olap. Show all posts

28 Apr 2024

Boost Database Security: Restrict Users to Read Replicas

When you're working with large databases in production, it is incredibly common to use read-replicas to improve performance. These read-replicas are a copy of your primary (main) database and let your applications offload read-heavy queries, which in-turn reduces strain on your primary database, effectively making the application faster and snappier.

Sometimes, you may want to restrict specific database users so they can connect ONLY to these read-replicas, and not to the primary database server. This can be tricky to implement, since any permissions configured for this use-case, whether on the user-level, the database level, the schema-level or even the table level would be quickly replicated to the read-replicas and thus would not work as expected.

This guide will show how to configure a database user to only login successfully on a read-replica. The only requirement is to enable the pg_tle extension [3] on your PostgreSQL database. This is simple to do on your Ubuntu based Laptop (see how to do that here [2]) or virtual-machines offered by your favourite cloud-provider. Furthermore, you could apply your login rules using Pl/PgSQL, PL/v8 or even PL/Rust - See here[1].

Why Restrict Access?

There are many good reasons for restricting users to read-replicas:

Performance: You can dedicate your primary database server to handling write operations (like updating data), ensuring those operations happen as fast as possible.
Reporting / Analytics: Production environments often have dedicated users for ancillary tasks, such as monitoring, reporting dashboards, read-only tenants etc. Restricting these database users to read-replica helpsreducing extra load on the primary database.
Security: In some cases, granting direct access to the primary database might be considered a security risk. Further, you may not be able to force login hygeine for all your database users, and then having a lockdown system to reject those database users to login to primary is crucial for application rollout.

Prerequisites

An existing PostgreSQL database instance with at least one read-replica.
- You could also try this on your own Postgres database with pg_tle extension. Read here [2] for more on how to install pg_tle on your Ubuntu system.
Basic understanding of users and permissions within a database.

Steps

Identify Target Database and Users: First we need to define how to implement the restriction. i.e. Which users (and database) are to be restricted to login only to read-replica. In the example below, we would restrict the user standby_only_user to only be able to login to Standbys / Read-Replicas on database prod_db.

psql <<SQL
  \c prod_db
  CREATE EXTENSION pg_tle;
SQL

Ensure that shared_preload_libraries is properly set to allow pg_tle. Also make sure that the pgtle.clientauth_db_name is appropriately set to the desired database (here prod_db):

cat <<EOL >> data/postgresql.conf
  shared_preload_libraries='pg_tle'
  pgtle.enable_clientauth=require
  pgtle.clientauth_db_name=prod_db
  pgtle.clientauth_users_to_skip=robins
  pgtle.clientauth_databases_to_skip=''
EOL

Secret Sauce:

Next we create the key pg_tle function that restricts the user standby_only_user to login successfully only if this is a standby / read-replica:

SELECT pgtle.install_extension (
  'standbyusercheck',
  '1.0',
  'Allow some users to login only to standby / read-replicas',
$_pgtle_$
  CREATE SCHEMA standbycheck_schema;

  REVOKE ALL ON SCHEMA standbycheck_schema FROM PUBLIC;
  GRANT USAGE ON SCHEMA standbycheck_schema TO PUBLIC;

  CREATE OR REPLACE FUNCTION standbycheck_schema.standbycheck_hook(port pgtle.clientauth_port_subset, status integer)
  RETURNS void AS $$
    DECLARE
      is_standby bool := TRUE;
    BEGIN
      IF port.user_name = 'standby_only_user' THEN
        SELECT pg_is_in_recovery()
          INTO is_standby;
        IF is_standby THEN
          RAISE NOTICE 'User allowed to login';
        ELSE
          RAISE EXCEPTION 'User can only login to Standby / Read-Replicas';
        END IF;
      END IF;
    END
  $$ LANGUAGE plpgsql SECURITY DEFINER;

  GRANT EXECUTE ON FUNCTION standbycheck_schema.standbycheck_hook TO PUBLIC;
  SELECT pgtle.register_feature('standbycheck_schema.standbycheck_hook', 'clientauth');
  REVOKE ALL ON SCHEMA standbycheck_schema FROM PUBLIC;
$_pgtle_$
);

And now that the function is defined,CREATE EXTENSION would install the function and bind it to future login attempts.

CREATE EXTENSION standbyusercheck;
SHOW pgtle.clientauth_db_name;

Test Connection:

Attempting to connect as a privileged user (here robins) to either of primary or read-replica should succeed.

Logging into Replica as robins
 login  | current_database | pg_is_in_recovery
--------+------------------+-------------------
 robins | prod_db          | t
(1 row)

Logging into Primary as robins
 login  | current_database | pg_is_in_recovery
--------+------------------+-------------------
 robins | prod_db          | f
(1 row)

However, the user standby_only_user should NOT be able to login to the primary.

Logging into Primary as standby_only_user
psql: error: connection to server at "localhost" (127.0.0.1), port 6432 failed: FATAL:  User can only login to Standby / Read-Replicas

While the user (standby_only_user) should only be able to login to any read-replica.

Logging into Replica as standby_only_user
       login       | current_database | pg_is_in_recovery
-------------------+------------------+-------------------
 standby_only_user | prod_db          | t
(1 row)

Other important aspects of this feature

You could force clientauth for all logins by setting the parameter pgtle.enable_clientauth = require
You could configure some users to always be allowed to login to either of Primary / Read-replica in cases of emergency, by adding that user to the pgtle.clientauth_users_to_skip. Ideally you would want your admin database roles to this list.
Orthogonally, you could configure some databases to always allow users to skip clientauth by setting the pgtle.clientauth_databases_to_skip feature.
Note, that both clientauth_databases_to_skip and clientauth_databases_to_skip can be utilised together. This is a good way to ensure that some set of database users (and some databases) are exempt from such a login restriction.
If pgtle.enable_clientauth is set to on or require and if the database mentioned in pgtle.clientauth_db_name is not configured correctly, postgres would complain with the messsage FATAL: pgtle.enable_clientauth is set to require, but pg_tle is not installed or there are no functions registered with the clientauth feature. This is a good engine check, helping us avoid basic misconfigurations.
If you're anticipating connection storms, you can also increase the workers (that would help enforce the login restriction) by setting the pgtle.clientauth_num_parallel_workers parameter to greater than 1.

Conclusion

By following the above steps, you've now successfully configured your PostgreSQL environment to restrict certain users to only login to the read-replicas. This helps not just optimize your database performance, but also bolster security.

Let me know if you'd like to explore more advanced scenarios or discuss IAM integration for fine-grained access control!

Reference

Clientauth Hook Documentation - https://github.com/aws/pg_tle/blob/main/docs/04_hooks.md'
Install pg_tle On Ubuntu - https://www.thatguyfromdelhi.com/2024/04/installing-pgtle-on-ubuntu-quick-guide.html
Unlock PostgreSQL Super Powers with pg_tle - https://www.thatguyfromdelhi.com/2024/04/unlock-postgresql-superpowers-with-pgtle.html

2 Sept 2017

psql \d now supports Interleaved / Compound SORTKEYs (in Redshift)

As a side-note, there is a consideration as to whether this should be on a separate section of its own (and not under Indexes, which it clearly isn't). May be another day. Happy Redshifting :) !

Update (15th Sep 2017):
This project has now been named PsqlForks!

17 May 2015

Basic OLAP Support in PostgreSQL

While reviewing an existing application, I thought it'd be worthwhile to review how good / bad PostgreSQL is in terms of OLAP. This (growing) post is going to be my (un)learning of how ready is PostgreSQL.

Row Numbering

Support: Yes.
Use: Row_Number() function numbers rows generated in a result-set.
Example:

SELECT
row_number() OVER (ORDER BY marks DESC) AS rn,
name
FROM x;
Review: Some databases have different variants that accomplish this (for e.g. Oracle has a pseudo column called ROWNUM), but PostgreSQL fully supports the SQL Compliant syntax.

Rank()

Support: Yes.
Use: Rank() and Dense_Rank() functions number the rank of the compared item.
Example:

SELECT
rank() OVER (ORDER BY marks DESC) AS rn,
dense_rank() OVER (ORDER BY marks DESC) AS drn,
name
FROM x;
Review: Its useful and fully supported.

Window Clause

Support

OVER (PARTITION BY): Yes
OVER (ORDER BY): Yes
OVER (RANGE): Yes

Use: Read more here.
Example:
Review: These are extremely helpful for people serious about data-extraction / reporting and fully supported.

NTile

Support: . Yes
Use: Ntile().
Example:

SELECT
ntile(4) OVER (ORDER BY marks DESC) AS quartile,
ntile(10) OVER (ORDER BY marks DESC) AS decile,
ntile(100) OVER (ORDER BY marks DESC) AS percentile,
name
FROM x;
Review: Versatile and fully supported.

Nested OLAP Aggregations

Support: No

But doable with alternative SQL? : Yes

Is that as Performant? : Mostly No

Description: Allow something like

SELECT
subject,
AVG(SUM(marks) GROUP BY class)
FROM marks
GROUP BY subject;
Alternative: This could be done with Sub-Selects like this:

SELECT
subject,
AVG(sum_marks) AS avg
FROM (
SELECT
subject
class,
SUM(marks) AS sum_marks
FROM marks
GROUP BY subject, class
) mrk
GROUP BY subject;
Review: In the two examples we are trying to calculate the Per-Subject-Average of (Total marks obtained in different classes). Although PostgreSQL doesn't support this form of nested-aggregates, it clearly is a neat form of doing things. The alternative, acceptably looks like a kludge, and it would be a nice to have feature.

GROUPING SETS

Support: Yes (in 9.5)

Alternative: This could be alternatively done with UNION ALL like this:
SELECT SubjectID,
NULL AS StudentID,
AVG(marks)FROM marksGROUP BY SubjectID
UNION ALL

SELECT NULL AS SubjectID,
StudentID,
AVG(marks)FROM marksGROUP BY StudentID;
Review: Popular databases (Oracle / MSSQL) support this well. PostgreSQL has had this has on the ToDo list from at least a decade ! Looking at the alternative, one can see that this is not just lengthy (and repetitive .. thus error-prone), but also non-performant (simply because it requires multiple-runs of the same data-set).
History:

Already in PostgreSQL TODO list
Discussions started (at least) way back in 2003.
Patch:

A 2008 patch that didn't make it.
A 2014 patch was heavily in discussion since and finally just got through to PostgreSQL 9.5.

ROLLUP

Description: An obvious extension to GROUPING BY (explained above), ROLLUP could be explained with a simple example:

GROUP BY ROLLUP (Year, SubjectID, StudentID)

is equivalent to

GROUP BY GROUPING SETS
(Year, SubjectID, StudentID)
(Year, SubjectID)(Year)
();
Support: Yes (in 9.5)
Alternative: This could be alternatively done with CTEs.

WITH x AS (
SELECT Year, SubjectID, StudentID, marks
FROM marks
WHERE passed
AND NOT inactive
)
SELECT *
FROM x

UNION ALL

SELECT
Year, SubjectID, StudentID, AVG(marks)
FROM x
GROUP BY Year, SubjectID, StudentID

UNION ALL

SELECT
Year, SubjectID, NULL AS StudentID, AVG(marks)
FROM x
GROUP BY Year, SubjectID

UNION ALL

SELECT
Year, NULL AS SubjectID, NULL AS StudentID,
AVG(marks)
FROM marks
GROUP BY Year;
Review: ROLLUPs are ideal to generate things like Sub-totals, which at times form key performance factors when generating large Reports. The alternative essentially uses a CTE, which is subsequently used to calculate subtotals and totals. For multiple-reasons, this is sub-optimal and can be sped up, if only for in-built support. Besides, the alternative is lengthy & repetitive (thus error-prone).
History:

Discussions started (at least) way back in 2003.
Patches submitted

The 2010 patch seemingly didn't make it.
The 2014 attempt finally got through.

CUBE

Support: Yes (in 9.5)
Description: Just like ROLLUP (was an extension of GROUPING SETS), CUBEs are an extension of ROLLUP (and thereby GROUPING SETS) and could be explained with the following example:

GROUP BY CUBE (Year, SubjectID, StudentID)
is equivalent to

GROUP BY GROUPING SETS
(Year, SubjectID, StudentID)
(Year, SubjectID)
(Year, StudentID)
(Year)
(SubjectID, StudentID)
(SubjectID)
(StudentID)
();
Review: The alternative (not provided for obvious reasons) is not just lengthy & repetitive (thus error-prone) but primarily not as performant as is otherwise possible.

MERGE INTO / UPSERT

Support: Yes (in 9.5)
Doable with alternative SQL? : Yes (for 9.4 and below)

Is the alternative as Performant?

No: This is because the alternative (URL given below) is a BEGIN/EXCEPTION based solution which is (time-wise) costly and an in-built support would certainly be faster.

Description: For those new to the complexity of MERGE (or UPSERT) please read this first.

TLDR: In the face of Concurrent Use, MERGE is difficult when a trying to balance Performance vs Integrity.

Unlike some other Database engines (that are sometimes okay with trading-off Integrity when it conflicts with Performance), PostgreSQL consistently prioritizes Data Integrity. The 'best' solution seems to have taken longer than expected, but considering that when a complicated open-source development model needs to coherently agree upon core feature additions, it really takes a few falling stars to get this piece-of-code in, with most people in support of it.
Example (SQL to create scenario + below SQL taken from here) MERGE INTO bonuses B

USING (

SELECT employee_id, salary

FROM employee

WHERE dept_no =20) E

ON (B.employee_id = E.employee_id)

WHEN MATCHED THEN

UPDATE SET B.bonus = E.salary * 0.1

WHEN NOT MATCHED THEN

INSERT (B.employee_id, B.bonus)
VALUES (E.employee_id, E.salary * 0.05);
Alternative: The PostgreSQL documentation mentions one recommended way of doing UPSERT / MERGE here. But again, this is non-performant and 9.5 based support for INSERT .. ON CONFLICT (a.k.a. UPSERT).
History:

MySQL / Oracle / MSSQL support this very well.
Long-pending requirement as per Wiki and now finally has made through!

16 May 2015

Postgres finally has CUBE / ROLLUP / GROUPING SETS !

Finally !

A *much* awaited feature, this attempt at adding the GROUPING SETS / ROLLUP / CUBE feature to PostgreSQL has been in the works for about a year (besides the so many in the past decade and a half that didn't get through), and thankfully this has finally got the approval of the powers that be, so the upcoming Postgres 9.5 would finally have this long pending SQL feature.

MSSQL and Oracle have had this for a while and then its time that PostgreSQL sport this as well. A big boon for Report generating SQLs this feature basically makes (what was earlier possible with lots of unmanageable hack of SQL), now possible with much cleaner code, and with much better (at times single pass) performance.

Read here to know more about OLAP support in PostgreSQL.

Thanks a ton Andrew Gierth and Atri Sharma and so many others who directly or indirectly assisted in getting this patch out of the door!

Andrew / Atri... take that long pending break... one look at that mail thread and it seems you deserve it :D !