Postgresql
Avoiding subquery pyramids of doom
Something I found tricky with SQL is avoiding “write once, read never” code with subqueries nested in subqueries nested in subqueries…
WHEN subqueries
These subqueries are predicates used to filter rows from a table. They are limited to one column per subquery, making some of the examples in SQLite fairly convoluted. In plain vanilla SQL, they are all that is available in DELETE and UPDATE, but Postgresql allows DELETE to have an UPDATE clause and USING to have FROM subqueries just like SELECT make moving subqueries out of WHERE often a good idea.
The documentation splits the ways to build “WHEN” subqueries between Subquery Expressions and Row and Array Comparisons
Something I initially found a bit weird is that “simple comparison” such as filtering for maximum or minimum values require a “SELECT subquery” (ie aggregate function) used in a WHEN subquery, eg:
SELECT city
FROM weather
WHERE temp_lo = (SELECT max(temp_lo) FROM weather);
IN
Stanford’s SQL MooC illustrates this as a way to find all students who applied to CS. In my view, the correct way is
SELECT DISTINCT sID, sName
FROM Student NATURAL JOIN APPLY
WHERE major = 'CS';
But this can also be done
SELECT sID, sName
FROM Student
WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS');
Note in the above we no longer need DISTINCT since each sID in the Student table is unique.
A snag I find with this is the subquery must return exactly one column, which resulted in some very ugly code in my version of DELETE examples in the course using SQLite which does not allow Postgresql’s USING workaround.
While my bias is towards JOIN, according to EXPLAIN the WHERE subquery is quicker:
course=> EXPLAIN SELECT DISTINCT sID, sName
FROM Student NATURAL JOIN APPLY
WHERE major = 'CS';
QUERY PLAN
------------------------------------------------------------------------------
Unique (cost=43.97..44.09 rows=17 width=36)
-> Sort (cost=43.97..44.01 rows=17 width=36)
Sort Key: student.sid, student.sname
-> Hash Join (cost=17.91..43.62 rows=17 width=36)
Hash Cond: (student.sid = apply.sid)
-> Seq Scan on student (cost=0.00..21.30 rows=1130 width=36)
-> Hash (cost=17.88..17.88 rows=3 width=4)
-> Seq Scan on apply (cost=0.00..17.88 rows=3 width=4)
Filter: (major = 'CS'::text)
(9 rows)
course=> EXPLAIN SELECT sID, sName
FROM Student
WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS');
QUERY PLAN
------------------------------------------------------------------
Hash Semi Join (cost=17.91..42.37 rows=17 width=36)
Hash Cond: (student.sid = apply.sid)
-> Seq Scan on student (cost=0.00..21.30 rows=1130 width=36)
-> Hash (cost=17.88..17.88 rows=3 width=4)
-> Seq Scan on apply (cost=0.00..17.88 rows=3 width=4)
Filter: (major = 'CS'::text)
(6 rows)
I thought maybe moving the subquery to FROM might help
SELECT DISTINCT sID, sName
FROM (SELECT sID FROM Apply WHERE major = 'CS') NATURAL JOIN Student;
but according to explain the cost remains the same as joining the complete tables.
This offers an alternative way of solving “students who applied to CS but not EE”.
SELECT sID, sName
FROM Student NATURAL JOIN APPLY
WHERE major = 'CS'
EXCEPT
SELECT sID, sName
FROM Student NATURAL JOIN APPLY
WHERE major = 'EE';
SELECT sID, sName
FROM Student
WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS')
AND NOT sID IN (SELECT sID FROM Apply WHERE major = 'EE');
NOT IN
The above can alternatively be written
SELECT sID, sName
FROM Student
WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS')
AND sID NOT IN (SELECT sID FROM Apply WHERE major = 'EE');
EXISTS
Colleges such that some other college is in the same state:
SELECT cName, state
FROM College C1
WHERE EXISTS (SELECT * FROM College C2
WHERE C2.state = C1.state AND C2.cName <> C1.cName);
This allows an alternative (and mind bending) way to write “Biggest college”.
Instead of
SELECT cName
FROM College
WHERE enrollment = (SELECT max(enrollment) FROM College);
it can be done
SELECT cName
FROM College C1
WHERE NOT EXISTS (SELECT * FROM College C2
WHERE C2.enrollment > C1.enrollment);
course=> EXPLAIN SELECT cName
FROM College
WHERE enrollment = (SELECT max(enrollment) FROM College);
QUERY PLAN
--------------------------------------------------------------------------------
--
Seq Scan on college (cost=20.64..41.26 rows=4 width=32)
Filter: (enrollment = $0)
InitPlan 1 (returns $0)
-> Aggregate (cost=20.62..20.64 rows=1 width=4)
-> Seq Scan on college college_1 (cost=0.00..18.50 rows=850 width=4
)
(5 rows)
course=> EXPLAIN SELECT cName
FROM College C1
WHERE NOT EXISTS (SELECT * FROM College C2
WHERE C2.enrollment > C1.enrollment);
QUERY PLAN
-------------------------------------------------------------------------
Nested Loop Anti Join (cost=0.00..7276.86 rows=567 width=32)
Join Filter: (c2.enrollment > c1.enrollment)
-> Seq Scan on college c1 (cost=0.00..18.50 rows=850 width=36)
-> Materialize (cost=0.00..22.75 rows=850 width=4)
-> Seq Scan on college c2 (cost=0.00..18.50 rows=850 width=4)
(5 rows)
ALL
Yet another way to write “Biggest college”:
SELECT cName
FROM College S1
WHERE enrollment > ALL (SELECT enrollment FROM College S2
WHERE S2.cName <> S1.cName);
ANY/SOME
And yet another way to write “Biggest college”:
SELECT cName
FROM College S1
WHERE NOT enrollment <= ANY (SELECT enrollment FROM College S2
WHERE S2.cName <> S1.cName);
FROM subqueries
These can return multiple rows and columns, what Postgresql’s documentation calls table expressions.
Subqueries specifying a derived table must be enclosed in parentheses. According to the SQL standard, a table alias name must be supplied for a subquery. PostgreSQL allows AS and the alias to be omitted, but writing one is good practice in SQL code that might be ported to another system.
SELECT subqueries
These have to return exactly one row and column, ie, be Aggregate Functions.
I’ve put examples in a separate section.
Style tips from manual
- It is widely considered good style to qualify all column names in a join query, so that the query won’t fail if a duplicate column name is later added to one of the tables.
Client Applications
Utilities
- ecpg embedded SQL C preprocessor
- pgbench run a benchmark test on PostgreSQL
- pg_amcheck checks for corruption in one or more PostgreSQL databases
- pg_basebackup take a base backup of a PostgreSQL cluster
- pg_config retrieve information about the installed version of PostgreSQL
- pg_dump extract a PostgreSQL database into a script file or other archive file
- pg_dumpall extract a PostgreSQL database cluster into a script file
- pg_isready check the connection status of a PostgreSQL server
- pg_receivewal stream write-ahead logs from a PostgreSQL server
- pg_recvlogical control PostgreSQL logical decoding streams
- pg_restore restore a PostgreSQL database from an archive file created by pg_dump
- pg_verifybackup verify the integrity of a base backup of a PostgreSQL cluster
- psql PostgreSQL interactive terminal
Wrappers
- clusterdb
CLUSTER
- createdb
CREATE DATABASE
- createuser CREATE ROLE
- dropdb DROP DATABASE
- dropuser DROP ROLE
- reindexdb REINDEX
- vacuumdb VACUUM
Server Applications
- initdb create a new PostgreSQL database cluster
- pg_archivecleanup clean up PostgreSQL WAL archive files
- pg_checksums enable, disable or check data checksums in a PostgreSQL database cluster
- pg_controldata display control information of a PostgreSQL database cluster
pg_controldata /var/lib/postgres/data
- pg_ctl initialize, start, stop, or control a PostgreSQL server
- pg_resetwal reset the write-ahead log and other control information of a PostgreSQL database cluster
- pg_rewind synchronize a PostgreSQL data directory with another data directory that was forked from it
- pg_test_fsync determine fastest wal_sync_method for PostgreSQL
- pg_test_timing measure timing overhead
- pg_upgrade upgrade a PostgreSQL server instance
- pg_waldump display a human-readable rendering of the write-ahead log of a PostgreSQL database cluster
- postgres PostgreSQL database server
backups
https://techdocs.akamai.com/cloud-computing/docs/manage-block-storage-volumes-with-the-api
Backup and Restore
with-recursive
The Postgresql documentation has various code examples.
The basic structure of a recursive query is
WITH RECURSIVE recursive_table(init, next, ...) AS (
SELECT init, next, ...
FROM graph_table
UNION
SELECT recursive_table.init, graph_table.next, ...
FROM graph_table JOIN recursive_table ON recursive_table.next = graph_table.init
)
SELECT ...
FROM recursive_table
WHERE init = 'Start'
AND next = 'End';
The base query can’t reference mytable while the recursive query joins mytable and graph_table to make a transitive closure.
These examples are taken from edX’s Databases: OLAP and Recursion free online course.
commands
aggregate-functions
Lowest GPA of students applying to CS
There are two places the WHERE clause could go:
SELECT min(gpa)
FROM Student INNER JOIN Apply USING (sID)
WHERE major = 'CS';
SELECT min(gpa) FILTER (WHERE major = 'CS')
FROM Student INNER JOIN Apply USING (sID);
If we wanted the overall min to compare to the CS min:
SELECT min(gpa) AS overall_min, min(gpa) FILTER (WHERE major = 'CS') AS cs_min
FROM Student INNER JOIN Apply USING (sID);
Number of students applying to Cornell
There’s a trap here in that some students made several applications. Here using DISTINCT inside the aggregate function helps:
configuration
Manual
In my system (Arch Linux), the file is /var/lib/postgres/data/postgresql.conf
.
- FILE LOCATIONS
- data_directory = /var/lib/postgres/data
- hba_file = /var/lib/postgres/data/pg_hba.conf
- ident_file = /var/lib/postgres/data/pg_ident.conf
- external_pid_file = ''
- CONNECTIONS AND AUTHENTICATION
- RESOURCE USAGE (except WAL)
- shared_buffers = 128MB
Resource Usage
The default shared_buffers = 128MB
should be raised to between 15% and 25% of available RAM according to How to tune PostgreSQL for memory
Logging
Reporting deadlocks and resource exhaustion. — PostgreSQL Logging
The defaults are:
other-commands
concurrency
graphs
Postgresql uses WITH RECURSIVE to traverse graphs stored as databases.
UNION ALL vs UNION
The result of UNION does not contain any duplicate rows unless the ALL option is specified. ALL prevents elimination of duplicates. (Therefore, UNION ALL is usually significantly quicker than UNION; use ALL when you can.) DISTINCT can be written to explicitly specify the default behavior of eliminating duplicate rows. — UNION Clause
Sometimes, using UNION instead of UNION ALL can accomplish this by discarding rows that duplicate previous output rows. — Cycle Detection
types
The CAST syntax conforms to SQL; the syntax with :: is historical PostgreSQL usage. — Type Casts
- Numeric
- Monetary
- Character
- Binary
- Date/Time
- Boolean
- Enumerated
- Geometric
- Network Address
- Bit String
- Text Search
- UUID
- XML
- JSON
- Arrays
- Composite
- Range
- Domain
- Object Identifier
- pg_lsn
- Pseudo
plpgsql
CREATE OR REPLACE FUNCTION
My Style
RETURN vs OUT parameters
Postgresql functions can either be written
CREATE OR REPLACE FUNCTION add(a integer, b integer) RETURNS integer LANGUAGE plpgsql AS $code$ BEGIN RETURN a + b; END; $code$;
Alternatively, it can be written in a more Prologish style with parameters labeled IN or OUT (or INOUT).
CREATE OR REPLACE FUNCTION add(IN a integer, IN b integer, OUT c integer) LANGUAGE plpgsql AS $code$ BEGIN c = a + b; END; $code$;
Since most of the functions for my project involve message broker patterns
message_broker(IN json_in JSONB, OUT json_out JSONB)
, I find the Prologish style better.