Simulating Rule 30 With Recursive SQL and Window Functions
Databases are commonly used as dumb storage bins for CRUD application data. Such narrow use cases don’t do them justice, however: Relational database systems such as PostgreSQL are much more capable and versatile than that, supporting a wide range of operations across many different data types.
The standard way of interfacing with Postgres is SQL – you know, that language you may have briefly learned in webdev class where you SELECT
stuff FROM
a set of tables WHERE
some condition is met. But that’s a far cry from what you can achieve when taking advantage of the full feature set of this declarative and – surprise! – Turing complete programming language.
In this post, I’ll demonstrate1 how to use the WITH RECURSIVE
construct and some basic window functions (neither of which you need any experience with to follow along – knowledge of basic SQL concepts should suffice) to implement the evolution of elementary cellular automata.
This post is loosely based on what I learned in the Advanced SQL lecture held by Torsten Grust in the summer of 2017 at the University of Tübingen. Take a look at the lecture slides for in-depth explanations and a wide array of examples. As a rare positive side-effect of the COVID-19 pandemic, an updated iteration of the course has been released on YouTube.
Cellular automata 101
(Know all about cellular automata? Click to skip straight to the SQL.)
If you haven’t encountered cellular automata before, try not to fall too deep down the rabbit hole. If you do, you might end up helping millions of undergrad students cheat on their math assignments – that’s what Stephen Wolfram did, after all, by creating Wolfram Alpha. Previously, in the early and mid-80s, he had been heavily involved in research of cellular automata. One might conclude that he’s a bit obsessed with them.
With good reason, to be fair: Cellular automata can be used as analogues for certain kinds of complex systems – in fact, they’re one of the prime synthetic examples of complex behavior emerging from simple rules. Even elementary cellular automata, the simplest kind, exhibit very interesting properties: Rule 110, which works by exactly the same mechanism as Rule 30 but with a different rule set, has been proven capable of supporting universal computation.
Apart from their multitude of scientific applications, cellular automata make for pretty images – a while ago, I’ve written a cellular automata poster generator which these visualizations of Rule 30, Rule 106, and Rule 73 have been created2 with:
Rule 30 even features on the exterior cladding of a train station in Cambridge. All that ought to be sufficient motivation for playing with cellular automata using SQL!
Quoting from Wolfram’s first published paper on the concept, “Statistical mechanics of cellular automata”, this is how he defines them (emphasis and circly numbery annotations mine):
Cellular automata are mathematical idealizations of physical systems in which space and time are discrete, and physical quantities take on a finite set of discrete values. A cellular automaton consists of a ① regular uniform lattice (or “array”), usually infinite in extent, with a ② discrete variable at each site (“cell”). The state of a cellular automaton is completely specified by the values of the variables at each site. A cellular automaton ③ evolves in discrete time steps, with the value of the variable at one site being affected by the values of variables at sites in its “neighborhood” on the previous time step. The neighborhood of a site is typically taken to be the site itself and all immediately adjacent sites. The variables at each site are ④ updated simultaneously (“synchronously”), based on the values of the variables in their neighborhood at the preceding time step, and according to a definite set of “local rules.”
Let’s translate this prose into a few graphics!
For elementary cellular automata, the ① array can be imagined as an one-dimensional row of square cells (other flavors of cellular automata may use, say, 2D grids).
Each ② cell (hence, cellular automata) of the array houses a discrete variable. Again, elementary cellular automata cover the simplest possible case where cells represent boolean variables: They’re either true
or false
, or , black or white.
As Wolfram notes, the ③ evolution of a cellular automaton takes place across discrete time steps. Since elementary cellular automata operate on a one-dimensional array, it’s convenient to draw these generations below each other, with the initial state of the array at the top and successive generations downwards.
Note that fitting an infinite array into memory is considered nontrivial, so by convention, cellular automata are simulated on a fixed-length array where the edges wrap around: The leftmost and rightmost cells of the image above, shown in a semitransparent manner, are in fact the same cell. As a result, the array takes the shape of a circle rather than a straight line (but my meager Inkscape skill keeps me from visualizing that properly).
What are the ④ rules, then, according to which all cells are updated in each generation? For any given elementary cellular automaton, there are eight deterministic rules, one for each possible configuration of a triplet of cells: the cell in question and its immediate neighbors to either side. Each rule maps its triplet to a new value of the center cell. Such a rule set is commonly specified visually:
In each evolution step of the cellular automaton, the new generation is determined by sliding a three-wide “window” across the array one step at a time, finding the matching rule (indicated by the three below), and accumulating what it maps to ():
The three cells at the top of the eight rules simply enumerate all possibly three-cell states. If you interpret the black squares as s and the white squares as s, you can also write the rules above as:
(Foreshadowing Alert: Doesn’t that look almost tabular?)
What becomes apparent here is that the left side of the “” operator is ordered based on interpreting the 3-tuples as 3-digit numbers in base 2: .
Given this order, the right side, read from top to bottom, can also be interpreted as a binary number: . That’s where the name of this specific cellular automaton, Rule 30, comes from! If we accept this way of formatting the rules as a convention, then that number, the so-called Wolfram code, completely defines the evolutionary behavior of the associated elementary cellular automaton. The SQL implementation will indeed require nothing3 but the Wolfram code as a parameter.
Before jumping into the database, let me point out that elementary cellular automata can be defined in various ways. Notably, coming up with a minimal boolean form that encapsulates the eight rules above, e.g.,
for Rule 30, or similarly, an algebraic form, e.g.,
both courtesy of Wolfram Alpha, would dispense with the need to check which of the eight separate rules matches a given neighborhood. However, deriving any such representation on the fly with SQL is a greater task than I’m willing to tackle in this post.
As to why I’ve chosen Rule 30 as the running example: There’s a number of as-of-yet-unanswered questions about it, and it seems to be the most well-known to boot.
Tick-tock, it’s SQL o’clock!
If you’ve skimmed the previous section, now’s the time4 to pay a bit more attention again.
In my mind, the most elegant way of going about this is to start with just the rule’s Wolfram code, say , and have the database take care of all the required conversions and simulations, with the end result being an ASCII art representation of multiple generations of the automaton.
Dividing that into easily conquerable sub-problems, I’ll be walking you through my process of writing a series of SQL statements5 that…
-
…read the Wolfram code from a
psql
variable and build up a lookupTABLE
that maps neighborhoods to next-generation values. -
…set up the array with an initial state, either randomly or with a single black cell in the middle. The array’s width and which initial state to set is specified via
psql
variables. This will also be aTABLE
, because what else would it be, we’re doing SQL, after all. -
…simulate the cellular automaton for a certain number of generations – this number, you’ve guessed it, must also be given in a
psql
variable. -
…draw the generations as ASCII art6 so we can verify that the previous step actually works as intended: For each cell of each generation, we’ll output either a space or a black rectangle.
Sound reasonable?
Initially, the .sql
file we’ll successively build up throughout this post (which is also included at the very bottom in a TL;DR section) should contain the following variable definitions:
\set rule 30
\set size 14
\set generations 7
\set random TRUE
If you’re following along in a psql
session, make sure to execute these commands before running any SQL snippets that reference the variables. And know that, since they’re psql
variables, you’ll indeed need to run the SQL statements in this post in psql
– if you prefer a different client, you might have to substitute the variables manually. Also, none of the statements modify the state of your database, you could probably get away with running them on your employer’s production cluster. But, maybe, don’t.
First up:
Building a lookup table for the specified rule
As foreshadowed in the cellular automata explainer above, this lookup table will encode the eight separate transition rules implied by the specified Wolfram code. Its schema reflects what we’re trying to do here: Storing a mapping from a binary triplet (the neighborhood) to a single bit (the next-generation result). Luckily, Postgres includes a bit
type7 intended for encoding bit strings.
CREATE TEMPORARY TABLE rule (
neighborhood bit(3), -- bit string of length 3
result bit -- just a single bit
);
Notice how this table is declared as TEMPORARY
– this way, we need not worry about cleaning up after ourselves, it’ll be gone8 once the psql
session is terminated. Omitting this keyword would require an additional TRUNCATE
step if you were to, say, adjust the variables and rerun the INSERT INTO
statement we’re about to formulate.
Populating this table solely based on the Wolfram code stored in the :rule
variable admittedly involves some trickery, but the query solving this part of the puzzle (whose result we can simply INSERT INTO
the rule
table at the end) relies on just a few “advanced” SQL concepts.
-
Crucially, to “rebase” the Wolfram code from its decimal form to a binary representation, we can have Postgres cast9 the integer value of
:rule
to a bit string of the desired length:SELECT :rule :: bit(8);
This yields
00011110
. -
But that’s still a string-like thing – it needs to be broken up into bits, each of which will constitute a value within the
result
column of therule
table. Casting our bit string to atext
value, then splitting that into characters (or, more accurately, single-character strings) using theregexp_split_to_array
function, then casting thetext[]
array back tobit[]
, gets us part of the way there:SELECT regexp_split_to_array(B'00011110' :: text, '') :: bit[];
The result of this chain of conversions is an array10
{0,0,0,1,1,1,1,0}
, which doesn’t seem like a whole lot of progress – we’ve turned a bit string into an array of bits, but we really want a table… -
…which is what the
unnest
function is for!SELECT * FROM unnest('{0,0,0,1,1,1,1,0}' :: bit[]);
This expands our array into a table-valued thing, which is why a
FROM
clause is now required.But we’ve lost something along the way: Since tables in SQL are conceptually unordered sets of rows – unless you employ an
ORDER BY
clause, any similarities between insertion order and output order are the result of implementation details that could change in a future version – it’s now impossible to reconstruct which bit is supposed to be the next generation for which neighborhood. This crucial info had been implicitly encoded by the positions of the bits in the array.Some means of bringing the order of the array along into the tabular representation is needed – and the
WITH ORDINALITY
clause, introduced in Postgres 9.4, does exactly that.SELECT * FROM unnest('{0,0,0,1,1,1,1,0}' :: bit[]) WITH ORDINALITY AS rule(result, neighborhood);
This query now returns an additional column, which I’ve taken the liberty of naming
neighborhood
, denoting the index of each element, starting at 1 and slotted in to the right of the single column created byunnest
. Execute this query and you’ll see:+--------+--------------+ | result | neighborhood | +--------+--------------+ | 0 | 1 | | 0 | 2 | | 0 | 3 | | 1 | 4 | | 1 | 5 | | 1 | 6 | | 1 | 7 | | 0 | 8 | +--------+--------------+
-
There’s not much work left – for one, we’ll need to convert the new
neighborhood
column from its present decimal representation to a three-element bit string. We’ll also switch the column order to match therule
table’s schema.SELECT neighborhood :: bit(3), result FROM ...
Is that it, are we done?
Not quite! Looking back at the binary notation of the eight rules in the cellular automata explainer earlier, let’s annotate the result of our query with what we’re expecting and compare:
+---------------------+--------+ | neighborhood | result | +---------------------+--------+ | 001 [should be 111] | 0 | | 010 [should be 110] | 0 | | 011 [should be 101] | 0 | | 100 [should be 100] | 1 | | 101 [should be 011] | 1 | | 110 [should be 010] | 1 | | 111 [should be 001] | 1 | | 000 [should be 000] | 0 | +---------------------+--------+
Uh-oh. After a minute of contemplation, it’s apparent that the
neighborhood
column is “counting” up instead of down and it’s offset by 1, which is due toWITH ORDINALITY
beginning its count at 1 instead of 0. Nothing some arithmetic can’t fix – before casting tobit(3)
, we’ll simply subtract 1 to remove the offset, then subtract that from 7 to invert the “counting” order:SELECT (7 - (neighborhood - 1)) :: bit(3), result FROM ...
So there we have it! Putting all of these components together, the following query correctly computes the lookup table for any elementary cellular automaton based solely on its Wolfram code, the result of which we can immediately INSERT INTO
the rule
table.
INSERT INTO rule
SELECT (7 - (neighborhood - 1)) :: bit(3) AS neighborhood, result
FROM unnest(regexp_split_to_array(:rule :: bit(8) :: text, '') :: bit[])
WITH ORDINALITY AS rule(result, neighborhood);
After executing that statement, you can inspect things via TABLE rule;
. For Rule 30, it should look like this:
+--------------+--------+
| neighborhood | result |
+--------------+--------+
| 111 | 0 |
| 110 | 0 |
| 101 | 0 |
| 100 | 1 |
| 011 | 1 |
| 010 | 1 |
| 001 | 1 |
| 000 | 0 |
+--------------+--------+
Generating the initial state
At this point, there’s a decision to make, namely how to represent the cellular automaton’s state throughout the simulation. This, after all, determines how we’ll encode the initial state.
Luckily, it’s not a difficult decision. In addition to each cell’s value, we must store an additional indexing column for the same reason we just had to use WITH ORDINALITY
– y’know, tables being unordered sets of rows and insertion order being meaningless. Hence:
CREATE TEMPORARY TABLE initial_state (
pos int,
value bit
);
The query we’ll write to fill the initial_state
table is a bit simpler that what was required to derive the lookup table.
We’re going to use generate_series(1, :size)
to generate a single-column table of ascending indexes, which we’ll call pos
in accordance with the table definition. Then we’re going to take advantage of SQL’s CASE WHEN
construct to – depending on the value of the boolean :random
variable – either generate a random value for each row or set all values to 0
, except for a single 1
for the index that’s located in the middle of the array.
INSERT INTO initial_state
SELECT pos, CASE
WHEN :random THEN round(random()) :: int -- random
ELSE CASE WHEN pos = :size / 2 + 1 THEN 1 ELSE 0 END -- single 1 in the middle
END :: bit
FROM generate_series(1, :size) AS pos;
Since the random
function returns a double
in the range instead of the desired bit, this query round
s it and then casts that to an integer (because Postgres doesn’t know how to cast the double
-valued result of the round
function to bit
). We could, in fact, skip the rounding step and cast to int
immediately – most programming languages round towards zero in this situation, but Postgres rounds to the closest integer, which is what we want. Explicitly rounding, however, communicates the intent more clearly, which beats brevity in my book.
Simulating the cellular automaton
Enough setup work, let’s get a-simulatin’!
Since elementary cellular automata are fundamentally iterative processes – each generation results from the application of the eighth rules to its precursor, rinse and repeat – a simple SELECT
-FROM
-WHERE
-style query isn’t powerful enough to get the job done. It takes heavier machinery – namely, recursive CTEs.
Recursion (or, in this case, effectively iteration) is a powerful and somewhat complex tool, which is why I won’t explain how the WITH RECURSIVE
construct works in SQL - that’s been done before by much more gifted teachers than me, and I’ve also previously taken a stab at it in my post about drawing the Sierpiński triangle with SQL.
So, assuming you’re now familiar with their general semantics, let’s start with some scaffolding.
WITH RECURSIVE ca(gen, pos, value) AS (
...
UNION ALL
...
)
SELECT json_agg(c) FROM ca c;
(The SELECT json_agg(c) FROM ca c
bit is just for debuggin’, it’ll be replaced later.)
Notice that the schema of the table we’ll successively build up is the one of initial_state
plus a gen
eration column. This will allow us to 1. tell the generations apart and 2. pretty-print them in the correct order later on. Since we’ll be incrementing gen
in each generation, we know that each row produced by the recursive part of the query will be necessarily distinct from any previous rows. This means there’s no need to try and filter out duplicate rows, so we’re telling Postgres to recurse with UNION ALL
semantics to avoid burning cycles on trying to maintain uniqueness.
As is commonly the case, the non-recursive term is trivial – we just “load” the initial state and call it the 0th generation:
WITH RECURSIVE ca(gen, pos, value) AS (
SELECT 0, *
FROM initial_state
UNION ALL
...
)
SELECT json_agg(c) FROM ca c;
The recursive term ...
will be significantly more complex – it’s going to be doing all the work, after all. Let’s think about what needs to happen here for each new generation:
-
The neighborhood of every cell of the previous-generation array must be extracted, i.e., a bit string containing the values of the left neighbor, the cell itself, and the right neighbor. At either end of the array, the neighborhood should wrap around the edge as described previously: the leftmost cell’s left neighbor should be the rightmost cell and vice versa.
-
Once that’s done, the new generation can be assembled through lookups in the
rule
table – for each cell, we’ll find the match for its just-computed neighborhood. (And, lest we forget,gen
needs incrementing in each generation, plus we should cease producing new rows oncegen
eclipses the:generations
variable, but that’s easy.)
Let’s tackle these two tasks in isolation, keeping recursion at the back of our minds but not bothering with it for now – to make things simpler to begin with, we’ll write a query the extracts the required neighborhoods from the initial state, then we’ll compute the next generation based on that, and finally we’ll go back and transfer our solution into the recursive CTE context.
Constructing the neighborhood for each cell – Take 1
There’s multiple ways we could go about this. If you aren’t familiar with window functions, you might come up with an implementation based on a three-way self-join11 where you’d declare three “copies” of initial_state
in the FROM
clause – one for each left neighbor, another for each cell in question, and a third for each right neighbor. Then, in the WHERE
clause, you’d handle the standard case and the two wrap-around cases separately, in each case doing some arithmetic to select only the direct neighbors of the central cell. Something like this:
SELECT c.pos, l.value || c.value || r.value AS neighborhood
FROM initial_state l, initial_state c, initial_state r
WHERE (l.pos = c.pos - 1 AND c.pos + 1 = r.pos) -- standard case
OR (c.pos = 1 AND l.pos = :size AND 2 = r.pos) -- for leftmost cell
OR (c.pos = :size AND l.pos = 1 AND :size - 1 = r.pos); -- for rightmost cell
(As you can see, the string concatenation operator ||
is conveniently overloaded for use with bit strings.)
This works – the query output is reproduced below – but the query is somewhat inelegant, has potential to slow things down due to the three-way join, and when you’d try to integrate this into the recursive CTE, Postgres would gently let you know that the “recursive reference to query ca
[which would take the place of initial_state
] must not appear more than once”. There are methods to work around this limitation, but they’d turn this solution even less elegant.
Let’s say the initial_state
table is populated according to the example in the “Cellular Automata 101” section…
+-----+-------+
| pos | value |
+-----+-------+
| 1 | 0 |
| 2 | 0 |
| 3 | 1 |
| 4 | 0 |
| 5 | 1 |
| 6 | 0 |
| 7 | 1 |
| 8 | 1 |
| 9 | 0 |
| 10 | 1 |
| 11 | 1 |
| 12 | 1 |
| 13 | 0 |
| 14 | 0 |
+-----+-------+
…then this query yields:
+-----+--------------+
| pos | neighborhood |
+-----+--------------+
| 2 | 001 |
| 14 | 000 |
| 3 | 010 |
| 4 | 101 |
| 5 | 010 |
| 6 | 101 |
| 7 | 011 |
| 8 | 110 |
| 9 | 101 |
| 10 | 011 |
| 11 | 111 |
| 12 | 110 |
| 13 | 100 |
| 1 | 000 |
+-----+--------------+
Go ahead and compare the constructed neighborhoods with the state – despite the order being a bit of a mess, the query’s result is indeed correct, but we can improve upon how it’s computed.
Constructing the neighborhood for each cell – Take 2
I’d say window functions are the way to go here: By taking advantage of them, we can assemble the neighborhoods entirely within the SELECT
clause. The following first stab at implementing a window-function-powered variant doesn’t take the (literal!) edge cases into account – we’ll fix that in a bit. For now, the neighborhoods of the cells at either end of the array will end up being NULL
.
SELECT pos,
lag(value) OVER (ORDER BY pos) || value || lead(value) OVER (ORDER BY pos) AS neighborhood
FROM initial_state;
This query produces12 the following output:
+-----+--------------+
| pos | neighborhood |
+-----+--------------+
| 1 | <NULL> |
| 2 | 001 |
| 3 | 010 |
| 4 | 101 |
| 5 | 010 |
| 6 | 101 |
| 7 | 011 |
| 8 | 110 |
| 9 | 101 |
| 10 | 011 |
| 11 | 111 |
| 12 | 110 |
| 13 | 100 |
| 14 | <NULL> |
+-----+--------------+
“What’s going on here,” you ask?
By the time the SELECT
clause is evaluated, the database system already knows which rows will feature in the result as long as the query doesn’t contain aggregate functions like sum
, count
, or json_agg
(which would lead to implicit grouping) or a number of other edge cases apply. In most queries, the SELECT
clause merely determines which columns (or values computed from the available columns) are included in the result. Window functions like lag
and lead
take advantage of this semi-known result set, allowing access to other rows within it given a window definition like ORDER BY pos
; they allow a query to peek at column values in a window of rows.
Postgres defines a whole family of window functions, some13 of which are rather specialized, and window definitions can be complex, nuanced and extremely powerful. Seriously, if you’re going to read their documentation from top to bottom, better brew a cup of strong coffee.
You’ll be glad to know that asking some cellular automaton cells “Won’t you be my neighbor?” doesn’t take us too far into the weeds:
-
The
lag(value)
window function returns thevalue
attribute of the preceding row according to theORDER
specified in the window definition.lead(value)
works analogously for the following row.For your ed(u|ifi)cation: There are other handy window functions such as
row_number
(self-explanatory),percent_rank
(essentially returning the position of the current row in the window as a percentage),first_value
(first value in the window),last_value
(analogous), and many others. Postgres extensions can supply their own window functions, as well. -
The
ORDER BY pos
window definition, uh, defines the ordering that our window functionslag
andlead
operate on. Omitting an explicit window frame definition, it’s a shorthand forORDER BY pos RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
, i.e., the defined window includes all rows starting from the first row of the accordingly-ordered result set until the current row.For our two candidates
lag
andlead
, the window frame definition doesn’t matter much; they can look beyond the frame aslead
indeed does in the query above: It extracts the value “after” the current row, which lies outside the window frame.For your continued ed(u|ifi)cation: In some cases, you can omit the
ORDER BY
part. The documentation notes that if you do so, “rows are processed in an unspecified order,” which can be a source of miscalculation if you’re not 100% sure the order is immaterial. Also, instead ofRANGE BETWEEN
, you can also specifyROWS BETWEEN
– there’s subtle differences between these two variants if theORDER BY
column contains duplicates, but since that’s not the case for us (if there’s two cells at the samepos
ition, something’s very wrong), we can disregard this difference.Make sure to read up on details like this if you’re planning to write complex queries taking advantage of window functions.
Having learned all this, I hope you now understand how the query shown earlier functions. But it’s not finished – we’ve yet to handle the two wrap-around cases at the edges!
Because lag
and lead
yield NULL
s here, we can turn to the good ol’ COALESCE
function to detect these edge cases and, if they occur for any given neighborhood, turn our gaze towards the other end of the “array” using the last_value
and first_value
window functions. Since they, unlike lag
and lead
, are constrained by the window frame definition RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
, we need to extend it to include the whole result set. No problem: The window definition can just as easily be ORDER BY pos RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
, that won’t impact lag
and lead
in the slightest.
Finally, since this all requires reusing an identical window definition four times, we might as well define it just once, for readability’s sake, via a WINDOW
clause at the bottom of the query.
SELECT pos,
COALESCE(lag(value) OVER w,
last_value(value) OVER w)
|| value
|| COALESCE(lead(value) OVER w,
first_value(value) OVER w)
AS neighborhood
FROM initial_state
WINDOW w AS (ORDER BY pos RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
For the initial state from earlier, this query yields
+-----+--------------+
| pos | neighborhood |
+-----+--------------+
| 1 | 000 |
| 2 | 001 |
| 3 | 010 |
| 4 | 101 |
| 5 | 010 |
| 6 | 101 |
| 7 | 011 |
| 8 | 110 |
| 9 | 101 |
| 10 | 011 |
| 11 | 111 |
| 12 | 110 |
| 13 | 100 |
| 14 | 000 |
+-----+--------------+
which, as it happens, is correct: This table matches the output of the triple-self-join variant (modulo row order, which could be remedied by adding ORDER BY pos
to both variants). But you don’t have to take my word for it, feel free to check and compare for yourself.
Phew! You’ll be glad to know that we’re over the hump now – the next part will be almost trivial in comparison. Let’s get it done, then there will be just an tiny bit more work required to merge things back into our recursive CTE.
Assembling the next generation though neighborhood lookups
Having extracted the bit(3)
-shaped neighborhood of each cell of the current generation, it’s time to assemble the next generation by querying the rule
table, matching neighborhoods and retrieving next-generation result
s during the process.
Here it is, barely any explanation required:
SELECT c.pos,
(SELECT r.result
FROM rule r
WHERE r.neighborhood = c.neighborhood)
FROM (...)
AS c(pos, neighborhood);
The ...
part stands for the neighborhood construction query we’ve worked so hard on – reproducing it here, in my mind, would distract from how little logic is actually required for the lookups.
The next generation, then, is:
+-----+--------+
| pos | result |
+-----+--------+
| 1 | 0 |
| 2 | 1 |
| 3 | 1 |
| 4 | 0 |
| 5 | 1 |
| 6 | 0 |
| 7 | 1 |
| 8 | 0 |
| 9 | 0 |
| 10 | 1 |
| 11 | 0 |
| 12 | 0 |
| 13 | 1 |
| 14 | 0 |
+-----+--------+
If you scroll back up to the cellular automata explainer at the beginning of this post, you’ll find that this output precisely matches the second generation shown there. Looks like we’re onto something!
Merging all this into the recursive CTE
Back to the soon-not-to-be-incomplete-anymore recursive CTE. We had previously left it in this state before wandering off to figure out the ...
part in isolation:
WITH RECURSIVE ca(gen, pos, value) AS (
SELECT 0, *
FROM initial_state
UNION ALL
...
)
SELECT json_agg(c) FROM ca c;
Meanwhile, the ...
part we came up with as a stand-alone query, computing the second generation only, looks like this:
SELECT c.pos,
(SELECT r.result
FROM rule r
WHERE r.neighborhood = c.neighborhood)
FROM (SELECT pos,
COALESCE(lag(value) OVER w,
last_value(value) OVER w)
|| value
|| COALESCE(lead(value) OVER w,
first_value(value) OVER w)
AS neighborhood
FROM initial_state
WINDOW w AS (ORDER BY pos RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))
AS c(pos, neighborhood);
We can’t just paste this in – there’s a few knobs and toggles we’ll need to adjust:
-
In the neighborhood construction subquery, we obviously have to reference
ca
instead ofinitial_state
– each generation should be computed based on the directly previous one, not the very first one. -
At the same time, we shouldn’t forget to pass the
gen
attribute ofca
along to the outer query, where we can then increment it as part of theSELECT
clause, thereby marking the newly computed cells as belonging to the next generation. -
Finally, to avoid having the query run forever (I hear that’s bad for performance metrics), we require a termination condition: Once
gen > :generations
, the recursive part of the CTE should stop producing new rows, which terminates the process. Flipping the comparison operator, we should only produce new rowsWHERE gen < :generations
.If you’re new to
WITH RECURSIVE
, let me point something out to stave off confusion: Since the database system permits access only to rows ofca
that have been freshly added in the previous recursion step (i.e., generation), there’s no need to explicitly ignore previous generations.
That’s it? That’s indeed, finally, it:
WITH RECURSIVE ca(gen, pos, value) AS (
SELECT 0, *
FROM initial_state
UNION ALL
SELECT c.gen + 1,
c.pos,
(SELECT r.result
FROM rule r
WHERE r.neighborhood = c.neighborhood)
FROM (SELECT gen,
pos,
COALESCE(lag(value) OVER w,
last_value(value) OVER w)
|| value
|| COALESCE(lead(value) OVER w,
first_value(value) OVER w)
AS neighborhood
FROM ca
WHERE gen < :generations
WINDOW w AS (ORDER BY pos RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
)
AS c(gen, pos, neighborhood)
)
SELECT json_agg(c) FROM ca c;
Upon running this query, we’re presented with a JSON value that enables us to kinda-sorta-verify the result’s correctness even after many generations. While this would be great for shipping off to some kind of frontend (if we’d built one), it’s not exactly pretty:
+-----------------------------------+
| json_agg |
+-----------------------------------+
| [{"gen":0,"pos":1,"value":"0"}, +|
| {"gen":0,"pos":2,"value":"0"}, +|
| {"gen":0,"pos":3,"value":"0"}, +|
| {"gen":0,"pos":4,"value":"0"}, +|
| {"gen":0,"pos":5,"value":"0"}, +|
| {"gen":0,"pos":6,"value":"1"}, +|
| {"gen":0,"pos":7,"value":"1"}, +|
| {"gen":0,"pos":8,"value":"0"}, +|
| {"gen":0,"pos":9,"value":"1"}, +|
| {"gen":0,"pos":10,"value":"1"}, +|
| {"gen":0,"pos":11,"value":"0"}, +|
| {"gen":0,"pos":12,"value":"0"}, +|
| {"gen":0,"pos":13,"value":"1"}, +|
| {"gen":0,"pos":14,"value":"0"}, +|
| {"gen":1,"pos":1,"value":"0"}, +|
| {"gen":1,"pos":2,"value":"1"}, +|
| {"gen":1,"pos":3,"value":"1"}, +|
| {"gen":1,"pos":4,"value":"0"}, +|
| {"gen":1,"pos":5,"value":"0"}, +|
| {"gen":1,"pos":6,"value":"0"}, +|
| {"gen":1,"pos":7,"value":"1"}, +|
┆ ┆
| {"gen":7,"pos":8,"value":"1"}, +|
| {"gen":7,"pos":9,"value":"1"}, +|
| {"gen":7,"pos":10,"value":"1"}, +|
| {"gen":7,"pos":11,"value":"0"}, +|
| {"gen":7,"pos":12,"value":"1"}, +|
| {"gen":7,"pos":13,"value":"0"}, +|
| {"gen":7,"pos":14,"value":"1"}] |
+-----------------------------------+
Let’s fix that before calling it quits.
Pretty-printing
Alright, now all the hard work’s done, but we’d really like something pretty to feast our eyes upon. So instead of SELECT json_agg(c) FROM ca c
, we can GROUP BY
the gen
eration and output either a space or a black rectangle for each cell, then string_agg
the whole shebang…
...
SELECT gen,
string_agg(CASE WHEN value = 0 :: bit THEN ' ' ELSE '█' END, '' ORDER BY pos) AS state
FROM ca
GROUP BY gen
ORDER BY gen;
…et14 voila:
+-----+----------------+
| gen | state |
+-----+----------------+
| 0 | █ █ ██ ███ |
| 1 | ██ █ █ █ █ |
| 2 | ██ █ ████████ |
| 3 | ███ █ |
| 4 | ██ ██ |
| 5 | ██ █ ██ █ |
| 6 | █ █ █ ██ █ |
| 7 | ███ ████ █ ██ |
+-----+----------------+
Neat, right? Now go back to the beginning and try a different rule, or a wider array, or more generations, or something else entirely – three examples ready for pastin’ into psql
:
\set rule 105
\set size 60
\set generations 30
\set random TRUE
\set rule 75
\set size 60
\set generations 30
\set random FALSE
\set rule 60
\set size 60
\set generations 30
\set random TRUE
Rest assured that despite the ridiculousness of doing all within a database and while neither defining any indexes not attempting to optimize the queries, this implementation is reasonably performant – on my aging 2015-era laptop, the recursive CTE seems to terminate within less than a second for
or, in human terms, anything that’s gonna fit on your screen will be generated in the blink of an eye.
I’m sure a well-optimized C implementation could easily handle millions and millions of cells per second, but where’s the fun in that?
TL;DR
Here’s the completed, only-the-good-bits SQL code – feel free to use it under the terms of the MIT License.
\set rule 30
\set size 14
\set generations 7
\set random TRUE
CREATE TEMPORARY TABLE rule (
neighborhood bit(3), -- bit string of length 3
result bit -- just a single bit
);
INSERT INTO rule
SELECT (7 - (neighborhood - 1)) :: bit(3) AS neighborhood, result
FROM unnest(regexp_split_to_array(:rule :: bit(8) :: text, '') :: bit[])
WITH ORDINALITY AS rule(result, neighborhood);
CREATE TEMPORARY TABLE initial_state (
pos int,
value bit
);
INSERT INTO initial_state
SELECT pos, CASE
WHEN :random THEN round(random()) :: int -- random
ELSE CASE WHEN pos = :size / 2 + 1 THEN 1 ELSE 0 END -- single 1 in the middle
END :: bit
FROM generate_series(1, :size) AS pos;
WITH RECURSIVE ca(gen, pos, value) AS (
SELECT 0, *
FROM initial_state
UNION ALL
SELECT c.gen + 1,
c.pos,
(SELECT r.result
FROM rule r
WHERE r.neighborhood = c.neighborhood)
FROM (SELECT gen,
pos,
COALESCE(lag(value) OVER w,
last_value(value) OVER w)
|| value
|| COALESCE(lead(value) OVER w,
first_value(value) OVER w)
AS neighborhood
FROM ca
WHERE gen < :generations
WINDOW w AS (ORDER BY pos RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
)
AS c(gen, pos, neighborhood)
)
SELECT gen,
string_agg(CASE WHEN value = 0 :: bit THEN ' ' ELSE '█' END, '' ORDER BY pos) AS state
FROM ca
GROUP BY gen
ORDER BY gen;
-
Note that a small number of PostgreSQL-specific features will be used, however due to SQL’s Turing completeness, everything we do here is theoretically possible in any standards-compliant RDBMS. Any recent version of Postgres should work – I’ve tested the code presented in this blog post using Postgres 10.17. ↩
-
More recently, I’ve taken the poster generation code, rejiggered its color scheme selection part, and turned it into a Twitter bot – feel free to follow @sundryautomata. ↩
-
And the length of the array, and how many generations should be simulated, and what kind of initial state to set – but that’s all window dressing and doesn’t determine the inherent properties of the cellular automaton. ↩
-
One might say that it’s SQL time. ↩
-
Note that my implementation isn’t golfed at all – it could be shortened significantly, but I’ve gone for comprehensibility over making myself feel smart. Neither is it optimized for performance, mind you. ↩
-
It’s also conceivable to generate an SVG image similar to those in the “Cellular automata 101” section above, but I’ve done something like that for a previous blog post when I drew the Sierpiński Triangle with recursive SQL, so there’s no novelty in this for me. But for you, there might be – sounds like a good thing to leave as an exercise to the reader, then! ↩
-
Along with a selection of handy bitwise operators – we won’t need them here, but they’re important tools to know about. ↩
-
Gone in the sense of “not cluttering the database” rather than “incognito mode”, of course. ↩
-
We can count ourselves lucky that Postgres just does this conversion for us – otherwise, we’d need to formulate a recursive CTE or a stored procedure that could do the job. That could actually be a fun exercise! ↩
-
The design choice to denote arrays with curly braces will continue to confound me to no end until the day I die. ↩
-
Sounds oddly more lonely than a standard self-join, doesn’t it? ↩
-
I’ve configured
psql
to indicateNULL
s with<NULL>
instead of blank table cells by adding the following line to my~/.psqlrc
:\pset null <NULL>
The
\pset null
option is only the tip of the iceberg – there’s a whole host of useful configuration commands. For a complete list, search for “Adjustable printing options are:” in thepsql
documentation. Some favorites:-
You can configure
psql
to maintain a separate command history for each database.\set HISTFILE ~/.psql_history_ :DBNAME
This can be very handy when trying to retrace your steps a few weeks after the fact. If you’re experiencing déjà vu right now, that might be because, as noted in the documentation, “[t]his feature was shamelessly plagiarized from Bash.” Other Bash-like options like
HISTCONTROL
,HISTSIZE
, andIGNOREEOF
exist, as well. -
You might prefer prettier table borders using Unicode box drawing characters.
\pset border 2 \pset linestyle unicode \pset unicode_header_linestyle double
(I used only the first of the three options for the examples in this post.)
-
Finally:
\timing on
This compels
psql
to tell you how much time each command has taken. This can help you to catch slow queries in a size-constrained testing environment.
…and that barely scratches the surface –
psql
is highly configurable indeed. ↩ -
-
It’s really an extended family – you can also use aggregate functions in the place of window functions, although the semantics might feel somewhat unusual at first: The documentation notes that “[a]n aggregate used with
ORDER BY
and the default window frame definition produces a “running sum” type of behavior.” You can observe this when usingarray_agg
as a window function, for instance. ↩ -
Of course, you won’t see quite the same unless your system happens to pick the very same initial state from possible variations – use
INSERT INTO initial_state SELECT pos, value FROM unnest(regexp_split_to_array('00101011011100', '') :: bit[]) WITH ORDINALITY AS initial_state(value, pos);
instead of the query described earlier to start off with the same state as the examples throughout this post. ↩