Frontier Software

Postgresql

Documentation

Avoiding subquery pyramids of doom

Something I found tricky with SQL is avoiding “write once, read never” code with subqueries nested in subqueries nested in subqueries…

WHEN subqueries

These subqueries are predicates used to filter rows from a table. They are limited to one column per subquery, making some of the examples in SQLite fairly convoluted. In plain vanilla SQL, they are all that is available in DELETE and UPDATE, but Postgresql allows DELETE to have an UPDATE clause and USING to have FROM subqueries just like SELECT make moving subqueries out of WHERE often a good idea.

The documentation splits the ways to build “WHEN” subqueries between Subquery Expressions and Row and Array Comparisons

Something I initially found a bit weird is that “simple comparison” such as filtering for maximum or minimum values require a “SELECT subquery” (ie aggregate function) used in a WHEN subquery, eg:

SELECT city 
FROM weather
WHERE temp_lo = (SELECT max(temp_lo) FROM weather);

IN

Stanford’s SQL MooC illustrates this as a way to find all students who applied to CS. In my view, the correct way is

SELECT DISTINCT sID, sName
FROM Student NATURAL JOIN APPLY
WHERE major = 'CS';

But this can also be done

SELECT sID, sName
FROM Student
WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS');

Note in the above we no longer need DISTINCT since each sID in the Student table is unique.

A snag I find with this is the subquery must return exactly one column, which resulted in some very ugly code in my version of DELETE examples in the course using SQLite which does not allow Postgresql’s USING workaround.

While my bias is towards JOIN, according to EXPLAIN the WHERE subquery is quicker:

course=> EXPLAIN SELECT DISTINCT sID, sName
FROM Student NATURAL JOIN APPLY
WHERE major = 'CS';
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Unique  (cost=43.97..44.09 rows=17 width=36)
   ->  Sort  (cost=43.97..44.01 rows=17 width=36)
         Sort Key: student.sid, student.sname
         ->  Hash Join  (cost=17.91..43.62 rows=17 width=36)
               Hash Cond: (student.sid = apply.sid)
               ->  Seq Scan on student  (cost=0.00..21.30 rows=1130 width=36)
               ->  Hash  (cost=17.88..17.88 rows=3 width=4)
                     ->  Seq Scan on apply  (cost=0.00..17.88 rows=3 width=4)
                           Filter: (major = 'CS'::text)
(9 rows)

course=> EXPLAIN SELECT sID, sName
FROM Student
WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS');
                            QUERY PLAN                            
------------------------------------------------------------------
 Hash Semi Join  (cost=17.91..42.37 rows=17 width=36)
   Hash Cond: (student.sid = apply.sid)
   ->  Seq Scan on student  (cost=0.00..21.30 rows=1130 width=36)
   ->  Hash  (cost=17.88..17.88 rows=3 width=4)
         ->  Seq Scan on apply  (cost=0.00..17.88 rows=3 width=4)
               Filter: (major = 'CS'::text)
(6 rows)

I thought maybe moving the subquery to FROM might help

SELECT DISTINCT sID, sName
FROM (SELECT sID FROM Apply WHERE major = 'CS') NATURAL JOIN Student;

but according to explain the cost remains the same as joining the complete tables.

This offers an alternative way of solving “students who applied to CS but not EE”.

SELECT sID, sName
FROM Student NATURAL JOIN APPLY
WHERE major = 'CS'
EXCEPT
SELECT sID, sName
FROM Student NATURAL JOIN APPLY
WHERE major = 'EE';
SELECT sID, sName
FROM Student
WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS')
AND NOT sID IN (SELECT sID FROM Apply WHERE major = 'EE');

NOT IN

The above can alternatively be written

SELECT sID, sName
FROM Student
WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS')
AND sID NOT IN (SELECT sID FROM Apply WHERE major = 'EE');

EXISTS

Colleges such that some other college is in the same state:

SELECT cName, state
FROM College C1
WHERE EXISTS (SELECT * FROM College C2
              WHERE C2.state = C1.state AND C2.cName <> C1.cName);

This allows an alternative (and mind bending) way to write “Biggest college”.

Instead of

SELECT cName
FROM College
WHERE enrollment = (SELECT max(enrollment) FROM College);

it can be done

SELECT cName
FROM College C1
WHERE NOT EXISTS (SELECT * FROM College C2
                  WHERE C2.enrollment > C1.enrollment);
course=> EXPLAIN SELECT cName
FROM College   
WHERE enrollment = (SELECT max(enrollment) FROM College);
                                    QUERY PLAN                                  
  
--------------------------------------------------------------------------------
--
 Seq Scan on college  (cost=20.64..41.26 rows=4 width=32)
   Filter: (enrollment = $0)
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=20.62..20.64 rows=1 width=4)
           ->  Seq Scan on college college_1  (cost=0.00..18.50 rows=850 width=4
)
(5 rows)

course=> EXPLAIN SELECT cName
FROM College C1
WHERE NOT EXISTS (SELECT * FROM College C2
                  WHERE C2.enrollment > C1.enrollment);
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=0.00..7276.86 rows=567 width=32)
   Join Filter: (c2.enrollment > c1.enrollment)
   ->  Seq Scan on college c1  (cost=0.00..18.50 rows=850 width=36)
   ->  Materialize  (cost=0.00..22.75 rows=850 width=4)
         ->  Seq Scan on college c2  (cost=0.00..18.50 rows=850 width=4)
(5 rows)

ALL

Yet another way to write “Biggest college”:

SELECT cName
FROM College S1
WHERE enrollment > ALL (SELECT enrollment FROM College S2
                        WHERE S2.cName <> S1.cName);

ANY/SOME

And yet another way to write “Biggest college”:

SELECT cName
FROM College S1
WHERE NOT enrollment <= ANY (SELECT enrollment FROM College S2
                             WHERE S2.cName <> S1.cName);

FROM subqueries

These can return multiple rows and columns, what Postgresql’s documentation calls table expressions.

Subqueries specifying a derived table must be enclosed in parentheses. According to the SQL standard, a table alias name must be supplied for a subquery. PostgreSQL allows AS and the alias to be omitted, but writing one is good practice in SQL code that might be ported to another system.

SELECT subqueries

These have to return exactly one row and column, ie, be Aggregate Functions.

I’ve put examples in a separate section.

Style tips from manual

Client Applications

Utilities

Wrappers

Server Applications

with-recursive

The Postgresql documentation has various code examples.

The basic structure of a recursive query is

WITH RECURSIVE recursive_table(init, next, ...) AS (
  SELECT init, next, ... 
  FROM graph_table
    UNION
  SELECT recursive_table.init, graph_table.next, ... 
  FROM graph_table JOIN recursive_table ON recursive_table.next = graph_table.init 
)
SELECT ...
FROM recursive_table
WHERE init = 'Start'
AND next = 'End';

The base query can’t reference mytable while the recursive query joins mytable and graph_table to make a transitive closure.

These examples are taken from edX’s Databases: OLAP and Recursion free online course.

commands

aggregate-functions

Lowest GPA of students applying to CS

There are two places the WHERE clause could go:

SELECT min(gpa)
FROM Student INNER JOIN Apply USING (sID)
WHERE major = 'CS';

SELECT min(gpa) FILTER (WHERE major = 'CS')
FROM Student INNER JOIN Apply USING (sID);

If we wanted the overall min to compare to the CS min:

SELECT min(gpa) AS overall_min, min(gpa) FILTER (WHERE major = 'CS') AS cs_min
FROM Student INNER JOIN Apply USING (sID);

Number of students applying to Cornell

There’s a trap here in that some students made several applications. Here using DISTINCT inside the aggregate function helps:

configuration

Manual

In my system (Arch Linux), the file is /var/lib/postgres/data/postgresql.conf.

Resource Usage

The default shared_buffers = 128MB should be raised to between 15% and 25% of available RAM according to How to tune PostgreSQL for memory

Logging

Reporting deadlocks and resource exhaustion. — PostgreSQL Logging

The defaults are:

other-commands

concurrency

Concurrency Control

graphs

Postgresql uses WITH RECURSIVE to traverse graphs stored as databases.

UNION ALL vs UNION

The result of UNION does not contain any duplicate rows unless the ALL option is specified. ALL prevents elimination of duplicates. (Therefore, UNION ALL is usually significantly quicker than UNION; use ALL when you can.) DISTINCT can be written to explicitly specify the default behavior of eliminating duplicate rows. — UNION Clause

Sometimes, using UNION instead of UNION ALL can accomplish this by discarding rows that duplicate previous output rows. — Cycle Detection

types

The CAST syntax conforms to SQL; the syntax with :: is historical PostgreSQL usage. — Type Casts

manual

plpgsql

CREATE OR REPLACE FUNCTION

My Style

RETURN vs OUT parameters

Postgresql functions can either be written

CREATE OR REPLACE FUNCTION add(a integer, b integer) RETURNS integer
LANGUAGE plpgsql
AS $code$
BEGIN
RETURN a + b;
END;
$code$;

Alternatively, it can be written in a more Prologish style with parameters labeled IN or OUT (or INOUT).

CREATE OR REPLACE FUNCTION add(IN a integer, IN b integer, OUT c integer)
LANGUAGE plpgsql
AS $code$
BEGIN
c = a + b;
END;
$code$;

Since most of the functions for my project involve message broker patterns message_broker(IN json_in JSONB, OUT json_out JSONB), I find the Prologish style better.