Postgresql

Avoiding subquery pyramids of doom

Something I found tricky with SQL is avoiding “write once, read never” code with subqueries nested in subqueries nested in subqueries…

WHEN subqueries

These subqueries are predicates used to filter rows from a table. They are limited to one column per subquery, making some of the examples in SQLite fairly convoluted. In plain vanilla SQL, they are all that is available in DELETE and UPDATE, but Postgresql allows DELETE to have an UPDATE clause and USING to have FROM subqueries just like SELECT make moving subqueries out of WHERE often a good idea.

The documentation splits the ways to build “WHEN” subqueries between Subquery Expressions and Row and Array Comparisons

Something I initially found a bit weird is that “simple comparison” such as filtering for maximum or minimum values require a “SELECT subquery” (ie aggregate function) used in a WHEN subquery, eg:

SELECT city 
FROM weather
WHERE temp_lo = (SELECT max(temp_lo) FROM weather);

IN

Stanford’s SQL MooC illustrates this as a way to find all students who applied to CS. In my view, the correct way is

SELECT DISTINCT sID, sName
FROM Student NATURAL JOIN APPLY
WHERE major = 'CS';

But this can also be done

SELECT sID, sName
FROM Student
WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS');

Note in the above we no longer need DISTINCT since each sID in the Student table is unique.

A snag I find with this is the subquery must return exactly one column, which resulted in some very ugly code in my version of DELETE examples in the course using SQLite which does not allow Postgresql’s USING workaround.

While my bias is towards JOIN, according to EXPLAIN the WHERE subquery is quicker:

course=> EXPLAIN SELECT DISTINCT sID, sName
FROM Student NATURAL JOIN APPLY
WHERE major = 'CS';
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Unique  (cost=43.97..44.09 rows=17 width=36)
   ->  Sort  (cost=43.97..44.01 rows=17 width=36)
         Sort Key: student.sid, student.sname
         ->  Hash Join  (cost=17.91..43.62 rows=17 width=36)
               Hash Cond: (student.sid = apply.sid)
               ->  Seq Scan on student  (cost=0.00..21.30 rows=1130 width=36)
               ->  Hash  (cost=17.88..17.88 rows=3 width=4)
                     ->  Seq Scan on apply  (cost=0.00..17.88 rows=3 width=4)
                           Filter: (major = 'CS'::text)
(9 rows)

course=> EXPLAIN SELECT sID, sName
FROM Student
WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS');
                            QUERY PLAN                            
------------------------------------------------------------------
 Hash Semi Join  (cost=17.91..42.37 rows=17 width=36)
   Hash Cond: (student.sid = apply.sid)
   ->  Seq Scan on student  (cost=0.00..21.30 rows=1130 width=36)
   ->  Hash  (cost=17.88..17.88 rows=3 width=4)
         ->  Seq Scan on apply  (cost=0.00..17.88 rows=3 width=4)
               Filter: (major = 'CS'::text)
(6 rows)

I thought maybe moving the subquery to FROM might help

SELECT DISTINCT sID, sName
FROM (SELECT sID FROM Apply WHERE major = 'CS') NATURAL JOIN Student;

but according to explain the cost remains the same as joining the complete tables.

This offers an alternative way of solving “students who applied to CS but not EE”.

SELECT sID, sName
FROM Student NATURAL JOIN APPLY
WHERE major = 'CS'
EXCEPT
SELECT sID, sName
FROM Student NATURAL JOIN APPLY
WHERE major = 'EE';

SELECT sID, sName
FROM Student
WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS')
AND NOT sID IN (SELECT sID FROM Apply WHERE major = 'EE');

NOT IN

The above can alternatively be written

SELECT sID, sName
FROM Student
WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS')
AND sID NOT IN (SELECT sID FROM Apply WHERE major = 'EE');

EXISTS

Colleges such that some other college is in the same state:

SELECT cName, state
FROM College C1
WHERE EXISTS (SELECT * FROM College C2
              WHERE C2.state = C1.state AND C2.cName <> C1.cName);

This allows an alternative (and mind bending) way to write “Biggest college”.

Instead of

SELECT cName
FROM College
WHERE enrollment = (SELECT max(enrollment) FROM College);

it can be done

SELECT cName
FROM College C1
WHERE NOT EXISTS (SELECT * FROM College C2
                  WHERE C2.enrollment > C1.enrollment);

course=> EXPLAIN SELECT cName
FROM College   
WHERE enrollment = (SELECT max(enrollment) FROM College);
                                    QUERY PLAN                                  
  
--------------------------------------------------------------------------------
--
 Seq Scan on college  (cost=20.64..41.26 rows=4 width=32)
   Filter: (enrollment = $0)
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=20.62..20.64 rows=1 width=4)
           ->  Seq Scan on college college_1  (cost=0.00..18.50 rows=850 width=4
)
(5 rows)

course=> EXPLAIN SELECT cName
FROM College C1
WHERE NOT EXISTS (SELECT * FROM College C2
                  WHERE C2.enrollment > C1.enrollment);
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=0.00..7276.86 rows=567 width=32)
   Join Filter: (c2.enrollment > c1.enrollment)
   ->  Seq Scan on college c1  (cost=0.00..18.50 rows=850 width=36)
   ->  Materialize  (cost=0.00..22.75 rows=850 width=4)
         ->  Seq Scan on college c2  (cost=0.00..18.50 rows=850 width=4)
(5 rows)

ALL

Yet another way to write “Biggest college”:

SELECT cName
FROM College S1
WHERE enrollment > ALL (SELECT enrollment FROM College S2
                        WHERE S2.cName <> S1.cName);

ANY/SOME

And yet another way to write “Biggest college”:

SELECT cName
FROM College S1
WHERE NOT enrollment <= ANY (SELECT enrollment FROM College S2
                             WHERE S2.cName <> S1.cName);

FROM subqueries

These can return multiple rows and columns, what Postgresql’s documentation calls table expressions.

Subqueries specifying a derived table must be enclosed in parentheses. According to the SQL standard, a table alias name must be supplied for a subquery. PostgreSQL allows AS and the alias to be omitted, but writing one is good practice in SQL code that might be ported to another system.

SELECT subqueries

These have to return exactly one row and column, ie, be Aggregate Functions.

I’ve put examples in a separate section.

Style tips from manual

It is widely considered good style to qualify all column names in a join query, so that the query won’t fail if a duplicate column name is later added to one of the tables.

Client Applications

Utilities

ecpg embedded SQL C preprocessor
pgbench run a benchmark test on PostgreSQL
pg_amcheck checks for corruption in one or more PostgreSQL databases
pg_basebackup take a base backup of a PostgreSQL cluster
pg_config retrieve information about the installed version of PostgreSQL
pg_dump extract a PostgreSQL database into a script file or other archive file
pg_dumpall extract a PostgreSQL database cluster into a script file
pg_isready check the connection status of a PostgreSQL server
pg_receivewal stream write-ahead logs from a PostgreSQL server
pg_recvlogical control PostgreSQL logical decoding streams
pg_restore restore a PostgreSQL database from an archive file created by pg_dump
pg_verifybackup verify the integrity of a base backup of a PostgreSQL cluster
psql PostgreSQL interactive terminal

Wrappers

Server Applications

initdb create a new PostgreSQL database cluster
pg_archivecleanup clean up PostgreSQL WAL archive files
pg_checksums enable, disable or check data checksums in a PostgreSQL database cluster
pg_controldata display control information of a PostgreSQL database cluster pg_controldata /var/lib/postgres/data
pg_ctl initialize, start, stop, or control a PostgreSQL server
pg_resetwal reset the write-ahead log and other control information of a PostgreSQL database cluster
pg_rewind synchronize a PostgreSQL data directory with another data directory that was forked from it
pg_test_fsync determine fastest wal_sync_method for PostgreSQL
pg_test_timing measure timing overhead
pg_upgrade upgrade a PostgreSQL server instance
pg_waldump display a human-readable rendering of the write-ahead log of a PostgreSQL database cluster
postgres PostgreSQL database server

with-recursive

The Postgresql documentation has various code examples.

The basic structure of a recursive query is

WITH RECURSIVE recursive_table(init, next, ...) AS (
  SELECT init, next, ... 
  FROM graph_table
    UNION
  SELECT recursive_table.init, graph_table.next, ... 
  FROM graph_table JOIN recursive_table ON recursive_table.next = graph_table.init 
)
SELECT ...
FROM recursive_table
WHERE init = 'Start'
AND next = 'End';

The base query can’t reference mytable while the recursive query joins mytable and graph_table to make a transitive closure.

These examples are taken from edX’s Databases: OLAP and Recursion free online course.

commands

aggregate-functions

Lowest GPA of students applying to CS

There are two places the WHERE clause could go:

SELECT min(gpa)
FROM Student INNER JOIN Apply USING (sID)
WHERE major = 'CS';

SELECT min(gpa) FILTER (WHERE major = 'CS')
FROM Student INNER JOIN Apply USING (sID);

If we wanted the overall min to compare to the CS min:

SELECT min(gpa) AS overall_min, min(gpa) FILTER (WHERE major = 'CS') AS cs_min
FROM Student INNER JOIN Apply USING (sID);

Number of students applying to Cornell

There’s a trap here in that some students made several applications. Here using DISTINCT inside the aggregate function helps:

configuration

Manual

In my system (Arch Linux), the file is /var/lib/postgres/data/postgresql.conf.

FILE LOCATIONS
data_directory = /var/lib/postgres/data
hba_file = /var/lib/postgres/data/pg_hba.conf
ident_file = /var/lib/postgres/data/pg_ident.conf
external_pid_file = ''
CONNECTIONS AND AUTHENTICATION
RESOURCE USAGE (except WAL)
shared_buffers = 128MB

Resource Usage

The default shared_buffers = 128MB should be raised to between 15% and 25% of available RAM according to How to tune PostgreSQL for memory

UNION ALL vs UNION

The result of UNION does not contain any duplicate rows unless the ALL option is specified. ALL prevents elimination of duplicates. (Therefore, UNION ALL is usually significantly quicker than UNION; use ALL when you can.) DISTINCT can be written to explicitly specify the default behavior of eliminating duplicate rows. — UNION Clause

Sometimes, using UNION instead of UNION ALL can accomplish this by discarding rows that duplicate previous output rows. — Cycle Detection

types

The CAST syntax conforms to SQL; the syntax with :: is historical PostgreSQL usage. — Type Casts

manual

Numeric

Monetary

Character

Binary

Date/Time

Boolean

AND, OR, NOT

Comparison

Enumerated

Geometric

Network Address

Bit String

Text Search

UUID

XML

JSON

Functions

Arrays

Composite

Range

Domain

Object Identifier

pg_lsn

Pseudo

plpgsql

CREATE OR REPLACE FUNCTION

My Style

RETURN vs OUT parameters

Postgresql functions can either be written
CREATE OR REPLACE FUNCTION add(a integer, b integer) RETURNS integer
LANGUAGE plpgsql
AS $code$
BEGIN
RETURN a + b;
END;
$code$;
Alternatively, it can be written in a more Prologish style with parameters labeled IN or OUT (or INOUT).
CREATE OR REPLACE FUNCTION add(IN a integer, IN b integer, OUT c integer)
LANGUAGE plpgsql
AS $code$
BEGIN
c = a + b;
END;
$code$;
Since most of the functions for my project involve message broker patterns message_broker(IN json_in JSONB, OUT json_out JSONB), I find the Prologish style better.

Frontier Software