Simulating Rule 30 With Recursive SQL and Window Functions

Databases are commonly used as dumb storage bins for CRUD application data. Such narrow use cases don’t do them justice, however: Relational database systems such as PostgreSQL are much more capable and versatile than that, supporting a wide range of operations across many different data types.

The standard way of interfacing with Postgres is SQL – you know, that language you may have briefly learned in webdev class where you SELECT stuff FROM a set of tables WHERE some condition is met. But that’s a far cry from what you can achieve when taking advantage of the full feature set of this declarative and – surprise! – Turing complete programming language.

In this post, I’ll demonstrate¹ how to use the WITH RECURSIVE construct and some basic window functions (neither of which you need any experience with to follow along – knowledge of basic SQL concepts should suffice) to implement the evolution of elementary cellular automata.

This post is loosely based on what I learned in the Advanced SQL lecture held by Torsten Grust in the summer of 2017 at the University of Tübingen. Take a look at the lecture slides for in-depth explanations and a wide array of examples. As a rare positive side-effect of the COVID-19 pandemic, an updated iteration of the course has been released on YouTube.

Cellular automata 101

(Know all about cellular automata? Click to skip straight to the SQL.)

If you haven’t encountered cellular automata before, try not to fall too deep down the rabbit hole. If you do, you might end up helping millions of undergrad students cheat on their math assignments – that’s what Stephen Wolfram did, after all, by creating Wolfram Alpha. Previously, in the early and mid-80s, he had been heavily involved in research of cellular automata. One might conclude that he’s a bit obsessed with them.

With good reason, to be fair: Cellular automata can be used as analogues for certain kinds of complex systems – in fact, they’re one of the prime synthetic examples of complex behavior emerging from simple rules. Even elementary cellular automata, the simplest kind, exhibit very interesting properties: Rule 110, which works by exactly the same mechanism as Rule 30 but with a different rule set, has been proven capable of supporting universal computation.

Apart from their multitude of scientific applications, cellular automata make for pretty images – a while ago, I’ve written a cellular automata poster generator which these visualizations of Rule 30, Rule 106, and Rule 73 have been created² with:

Rule 30 even features on the exterior cladding of a train station in Cambridge. All that ought to be sufficient motivation for playing with cellular automata using SQL!

Quoting from Wolfram’s first published paper on the concept, “Statistical mechanics of cellular automata”, this is how he defines them (emphasis and circly numbery annotations mine):

Cellular automata are mathematical idealizations of physical systems in which space and time are discrete, and physical quantities take on a finite set of discrete values. A cellular automaton consists of a ① regular uniform lattice (or “array”), usually infinite in extent, with a ② discrete variable at each site (“cell”). The state of a cellular automaton is completely specified by the values of the variables at each site. A cellular automaton ③ evolves in discrete time steps, with the value of the variable at one site being affected by the values of variables at sites in its “neighborhood” on the previous time step. The neighborhood of a site is typically taken to be the site itself and all immediately adjacent sites. The variables at each site are ④ updated simultaneously (“synchronously”), based on the values of the variables in their neighborhood at the preceding time step, and according to a definite set of “local rules.”

Let’s translate this prose into a few graphics!

For elementary cellular automata, the ① array can be imagined as an one-dimensional row of square cells (other flavors of cellular automata may use, say, 2D grids).

Each ② cell (hence, cellular automata) of the array houses a discrete variable. Again, elementary cellular automata cover the simplest possible case where cells represent boolean variables: They’re either true or false, $1$ or $0$ , black or white.

As Wolfram notes, the ③ evolution of a cellular automaton takes place across discrete time steps. Since elementary cellular automata operate on a one-dimensional array, it’s convenient to draw these generations below each other, with the initial state of the array at the top and successive generations downwards.

Note that fitting an infinite array into memory is considered nontrivial, so by convention, cellular automata are simulated on a fixed-length array where the edges wrap around: The leftmost and rightmost cells of the image above, shown in a semitransparent manner, are in fact the same cell. As a result, the array takes the shape of a circle rather than a straight line (but my meager Inkscape skill keeps me from visualizing that properly).

What are the ④ rules, then, according to which all cells are updated in each generation? For any given elementary cellular automaton, there are eight deterministic rules, one for each possible configuration of a triplet of cells: the cell in question and its immediate neighbors to either side. Each rule maps its triplet to a new value of the center cell. Such a rule set is commonly specified visually:

In each evolution step of the cellular automaton, the new generation is determined by sliding a three-wide “window” across the array one step at a time, finding the matching rule (indicated by the three $=$ below), and accumulating what it maps to ( $\mapsto$ ):

The three cells at the top of the eight rules simply enumerate all possibly three-cell states. If you interpret the black squares as $1$ s and the white squares as $0$ s, you can also write the rules above as:

(1,1,1) \mapsto 0\\ (1,1,0) \mapsto 0\\ (1,0,1) \mapsto 0\\ (1,0,0) \mapsto 1\\ (0,1,1) \mapsto 1\\ (0,1,0) \mapsto 1\\ (0,0,1) \mapsto 1\\ (0,0,0) \mapsto 0\\

(Foreshadowing Alert: Doesn’t that look almost tabular?)

What becomes apparent here is that the left side of the “ $\mapsto$ ” operator is ordered based on interpreting the 3-tuples as 3-digit numbers in base 2: $111_2 = 7, 110_2 = 6, \dots, 000_2 = 0$ .

Given this order, the right side, read from top to bottom, can also be interpreted as a binary number: $00011110_2 = 30$ . That’s where the name of this specific cellular automaton, Rule 30, comes from! If we accept this way of formatting the rules as a convention, then that number, the so-called Wolfram code, completely defines the evolutionary behavior of the associated elementary cellular automaton. The SQL implementation will indeed require nothing³ but the Wolfram code as a parameter.

Before jumping into the database, let me point out that elementary cellular automata can be defined in various ways. Notably, coming up with a minimal boolean form that encapsulates the eight rules above, e.g.,

(p,q,r) \mapsto p \text{\scriptsize{ XOR }} (q \text{\scriptsize{ OR }} r),

for Rule 30, or similarly, an algebraic form, e.g.,

(p,q,r) \mapsto (p + q + r + q \times r) \bmod 2,

both courtesy of Wolfram Alpha, would dispense with the need to check which of the eight separate rules matches a given neighborhood. However, deriving any such representation on the fly with SQL is a greater task than I’m willing to tackle in this post.

As to why I’ve chosen Rule 30 as the running example: There’s a number of as-of-yet-unanswered questions about it, and it seems to be the most well-known to boot.

Tick-tock, it’s SQL o’clock!

If you’ve skimmed the previous section, now’s the time⁴ to pay a bit more attention again.

In my mind, the most elegant way of going about this is to start with just the rule’s Wolfram code, say $30$ , and have the database take care of all the required conversions and simulations, with the end result being an ASCII art representation of multiple generations of the automaton.

Dividing that into easily conquerable sub-problems, I’ll be walking you through my process of writing a series of SQL statements⁵ that…

…read the Wolfram code from a psql variable and build up a lookup TABLE that maps neighborhoods to next-generation values.
…set up the array with an initial state, either randomly or with a single black cell in the middle. The array’s width and which initial state to set is specified via psql variables. This will also be a TABLE, because what else would it be, we’re doing SQL, after all.
…simulate the cellular automaton for a certain number of generations – this number, you’ve guessed it, must also be given in a psql variable.
…draw the generations as ASCII art⁶ so we can verify that the previous step actually works as intended: For each cell of each generation, we’ll output either a space or a black rectangle.

Sound reasonable?

Initially, the .sql file we’ll successively build up throughout this post (which is also included at the very bottom in a TL;DR section) should contain the following variable definitions:

\set rule 30
\set size 14
\set generations 7
\set random TRUE

If you’re following along in a psql session, make sure to execute these commands before running any SQL snippets that reference the variables. And know that, since they’re psql variables, you’ll indeed need to run the SQL statements in this post in psql – if you prefer a different client, you might have to substitute the variables manually. Also, none of the statements modify the state of your database, you could probably get away with running them on your employer’s production cluster. But, maybe, don’t.

First up:

Building a lookup table for the specified rule

As foreshadowed in the cellular automata explainer above, this lookup table will encode the eight separate transition rules implied by the specified Wolfram code. Its schema reflects what we’re trying to do here: Storing a mapping from a binary triplet (the neighborhood) to a single bit (the next-generation result). Luckily, Postgres includes a bit type⁷ intended for encoding bit strings.

CREATE TEMPORARY TABLE rule (
  neighborhood bit(3),  -- bit string of length 3
  result bit            -- just a single bit
);

Notice how this table is declared as TEMPORARY – this way, we need not worry about cleaning up after ourselves, it’ll be gone⁸ once the psql session is terminated. Omitting this keyword would require an additional TRUNCATE step if you were to, say, adjust the variables and rerun the INSERT INTO statement we’re about to formulate.

Populating this table solely based on the Wolfram code stored in the :rule variable admittedly involves some trickery, but the query solving this part of the puzzle (whose result we can simply INSERT INTO the rule table at the end) relies on just a few “advanced” SQL concepts.

Crucially, to “rebase” the Wolfram code from its decimal form to a binary representation, we can have Postgres cast⁹ the integer value of :rule to a bit string of the desired length:
```
SELECT :rule :: bit(8);
```
This yields 00011110.
But that’s still a string-like thing – it needs to be broken up into bits, each of which will constitute a value within the result column of the rule table. Casting our bit string to a text value, then splitting that into characters (or, more accurately, single-character strings) using the regexp_split_to_array function, then casting the text[] array back to bit[], gets us part of the way there:
```
SELECT regexp_split_to_array(B'00011110' :: text, '') :: bit[];
```
The result of this chain of conversions is an array¹⁰ {0,0,0,1,1,1,1,0}, which doesn’t seem like a whole lot of progress – we’ve turned a bit string into an array of bits, but we really want a table…
…which is what the unnest function is for!
```
SELECT *
FROM   unnest('{0,0,0,1,1,1,1,0}' :: bit[]);
```
This expands our array into a table-valued thing, which is why a FROM clause is now required.

But we’ve lost something along the way: Since tables in SQL are conceptually unordered sets of rows – unless you employ an ORDER BY clause, any similarities between insertion order and output order are the result of implementation details that could change in a future version – it’s now impossible to reconstruct which bit is supposed to be the next generation for which neighborhood. This crucial info had been implicitly encoded by the positions of the bits in the array.

Some means of bringing the order of the array along into the tabular representation is needed – and the WITH ORDINALITY clause, introduced in Postgres 9.4, does exactly that.
```
SELECT *
FROM   unnest('{0,0,0,1,1,1,1,0}' :: bit[])
       WITH ORDINALITY AS rule(result, neighborhood);
```
This query now returns an additional column, which I’ve taken the liberty of naming neighborhood, denoting the index of each element, starting at 1 and slotted in to the right of the single column created by unnest. Execute this query and you’ll see:
```
+--------+--------------+
| result | neighborhood |
+--------+--------------+
| 0      |            1 |
| 0      |            2 |
| 0      |            3 |
| 1      |            4 |
| 1      |            5 |
| 1      |            6 |
| 1      |            7 |
| 0      |            8 |
+--------+--------------+
```
There’s not much work left – for one, we’ll need to convert the new neighborhood column from its present decimal representation to a three-element bit string. We’ll also switch the column order to match the rule table’s schema.
```
SELECT neighborhood :: bit(3), result
FROM   ...
```
Is that it, are we done?

Not quite! Looking back at the binary notation of the eight rules in the cellular automata explainer earlier, let’s annotate the result of our query with what we’re expecting and compare:
```
+---------------------+--------+
| neighborhood        | result |
+---------------------+--------+
| 001 [should be 111] | 0      |
| 010 [should be 110] | 0      |
| 011 [should be 101] | 0      |
| 100 [should be 100] | 1      |
| 101 [should be 011] | 1      |
| 110 [should be 010] | 1      |
| 111 [should be 001] | 1      |
| 000 [should be 000] | 0      |
+---------------------+--------+
```
Uh-oh. After a minute of contemplation, it’s apparent that the neighborhood column is “counting” up instead of down and it’s offset by 1, which is due to WITH ORDINALITY beginning its count at 1 instead of 0. Nothing some arithmetic can’t fix – before casting to bit(3), we’ll simply subtract 1 to remove the offset, then subtract that from 7 to invert the “counting” order:
```
SELECT (7 - (neighborhood - 1)) :: bit(3), result
FROM   ...
```

So there we have it! Putting all of these components together, the following query correctly computes the lookup table for any elementary cellular automaton based solely on its Wolfram code, the result of which we can immediately INSERT INTO the rule table.

INSERT INTO rule
  SELECT (7 - (neighborhood - 1)) :: bit(3) AS neighborhood, result
  FROM   unnest(regexp_split_to_array(:rule :: bit(8) :: text, '') :: bit[])
         WITH ORDINALITY AS rule(result, neighborhood);

After executing that statement, you can inspect things via TABLE rule;. For Rule 30, it should look like this:

+--------------+--------+
| neighborhood | result |
+--------------+--------+
| 111          | 0      |
| 110          | 0      |
| 101          | 0      |
| 100          | 1      |
| 011          | 1      |
| 010          | 1      |
| 001          | 1      |
| 000          | 0      |
+--------------+--------+

Generating the initial state

At this point, there’s a decision to make, namely how to represent the cellular automaton’s state throughout the simulation. This, after all, determines how we’ll encode the initial state.

Luckily, it’s not a difficult decision. In addition to each cell’s value, we must store an additional indexing column for the same reason we just had to use WITH ORDINALITY – y’know, tables being unordered sets of rows and insertion order being meaningless. Hence:

CREATE TEMPORARY TABLE initial_state (
  pos int,
  value bit
);

The query we’ll write to fill the initial_state table is a bit simpler that what was required to derive the lookup table.

We’re going to use generate_series(1, :size) to generate a single-column table of ascending indexes, which we’ll call pos in accordance with the table definition. Then we’re going to take advantage of SQL’s CASE WHEN construct to – depending on the value of the boolean :random variable – either generate a random value for each row or set all values to 0, except for a single 1 for the index that’s located in the middle of the array.

INSERT INTO initial_state
  SELECT pos, CASE
                WHEN :random THEN round(random()) :: int              -- random
                ELSE CASE WHEN pos = :size / 2 + 1 THEN 1 ELSE 0 END  -- single 1 in the middle
              END :: bit
  FROM   generate_series(1, :size) AS pos;

Since the random function returns a double in the range $0.0 \leq x < 1.0$ instead of the desired bit, this query rounds it and then casts that to an integer (because Postgres doesn’t know how to cast the double-valued result of the round function to bit). We could, in fact, skip the rounding step and cast to int immediately – most programming languages round towards zero in this situation, but Postgres rounds to the closest integer, which is what we want. Explicitly rounding, however, communicates the intent more clearly, which beats brevity in my book.

Simulating the cellular automaton

Enough setup work, let’s get a-simulatin’!

Since elementary cellular automata are fundamentally iterative processes – each generation results from the application of the eighth rules to its precursor, rinse and repeat – a simple SELECT-FROM-WHERE-style query isn’t powerful enough to get the job done. It takes heavier machinery – namely, recursive CTEs.

Recursion (or, in this case, effectively iteration) is a powerful and somewhat complex tool, which is why I won’t explain how the WITH RECURSIVE construct works in SQL - that’s been done before by much more gifted teachers than me, and I’ve also previously taken a stab at it in my post about drawing the Sierpiński triangle with SQL.

So, assuming you’re now familiar with their general semantics, let’s start with some scaffolding.

WITH RECURSIVE ca(gen, pos, value) AS (
  ...

    UNION ALL

  ...
)
SELECT json_agg(c) FROM ca c;

(The SELECT json_agg(c) FROM ca c bit is just for debuggin’, it’ll be replaced later.)

Notice that the schema of the table we’ll successively build up is the one of initial_state plus a generation column. This will allow us to 1. tell the generations apart and 2. pretty-print them in the correct order later on. Since we’ll be incrementing gen in each generation, we know that each row produced by the recursive part of the query will be necessarily distinct from any previous rows. This means there’s no need to try and filter out duplicate rows, so we’re telling Postgres to recurse with UNION ALL semantics to avoid burning cycles on trying to maintain uniqueness.

As is commonly the case, the non-recursive term is trivial – we just “load” the initial state and call it the 0^th generation:

WITH RECURSIVE ca(gen, pos, value) AS (
  SELECT 0, *
  FROM   initial_state

    UNION ALL

  ...
)
SELECT json_agg(c) FROM ca c;

The recursive term ... will be significantly more complex – it’s going to be doing all the work, after all. Let’s think about what needs to happen here for each new generation:

The neighborhood of every cell of the previous-generation array must be extracted, i.e., a bit string containing the values of the left neighbor, the cell itself, and the right neighbor. At either end of the array, the neighborhood should wrap around the edge as described previously: the leftmost cell’s left neighbor should be the rightmost cell and vice versa.
Once that’s done, the new generation can be assembled through lookups in the rule table – for each cell, we’ll find the match for its just-computed neighborhood. (And, lest we forget, gen needs incrementing in each generation, plus we should cease producing new rows once gen eclipses the :generations variable, but that’s easy.)

Let’s tackle these two tasks in isolation, keeping recursion at the back of our minds but not bothering with it for now – to make things simpler to begin with, we’ll write a query the extracts the required neighborhoods from the initial state, then we’ll compute the next generation based on that, and finally we’ll go back and transfer our solution into the recursive CTE context.

Constructing the neighborhood for each cell – Take 1

There’s multiple ways we could go about this. If you aren’t familiar with window functions, you might come up with an implementation based on a three-way self-join¹¹ where you’d declare three “copies” of initial_state in the FROM clause – one for each left neighbor, another for each cell in question, and a third for each right neighbor. Then, in the WHERE clause, you’d handle the standard case and the two wrap-around cases separately, in each case doing some arithmetic to select only the direct neighbors of the central cell. Something like this:

SELECT c.pos, l.value || c.value || r.value AS neighborhood
FROM   initial_state l, initial_state c, initial_state r
WHERE  (l.pos = c.pos - 1 AND c.pos + 1 = r.pos)                -- standard case
       OR (c.pos = 1 AND l.pos = :size AND 2 = r.pos)           -- for leftmost cell
       OR (c.pos = :size AND l.pos = 1 AND :size - 1 = r.pos);  -- for rightmost cell

(As you can see, the string concatenation operator || is conveniently overloaded for use with bit strings.)

This works – the query output is reproduced below – but the query is somewhat inelegant, has potential to slow things down due to the three-way join, and when you’d try to integrate this into the recursive CTE, Postgres would gently let you know that the “recursive reference to query ca [which would take the place of initial_state] must not appear more than once”. There are methods to work around this limitation, but they’d turn this solution even less elegant.

Let’s say the initial_state table is populated according to the example in the “Cellular Automata 101” section…

+-----+-------+
| pos | value |
+-----+-------+
|   1 | 0     |
|   2 | 0     |
|   3 | 1     |
|   4 | 0     |
|   5 | 1     |
|   6 | 0     |
|   7 | 1     |
|   8 | 1     |
|   9 | 0     |
|  10 | 1     |
|  11 | 1     |
|  12 | 1     |
|  13 | 0     |
|  14 | 0     |
+-----+-------+

…then this query yields:

+-----+--------------+
| pos | neighborhood |
+-----+--------------+
|   2 | 001          |
|  14 | 000          |
|   3 | 010          |
|   4 | 101          |
|   5 | 010          |
|   6 | 101          |
|   7 | 011          |
|   8 | 110          |
|   9 | 101          |
|  10 | 011          |
|  11 | 111          |
|  12 | 110          |
|  13 | 100          |
|   1 | 000          |
+-----+--------------+

Go ahead and compare the constructed neighborhoods with the state – despite the order being a bit of a mess, the query’s result is indeed correct, but we can improve upon how it’s computed.

Constructing the neighborhood for each cell – Take 2

I’d say window functions are the way to go here: By taking advantage of them, we can assemble the neighborhoods entirely within the SELECT clause. The following first stab at implementing a window-function-powered variant doesn’t take the (literal!) edge cases into account – we’ll fix that in a bit. For now, the neighborhoods of the cells at either end of the array will end up being NULL.

SELECT pos,
       lag(value) OVER (ORDER BY pos) || value || lead(value) OVER (ORDER BY pos) AS neighborhood
FROM   initial_state;

This query produces¹² the following output:

+-----+--------------+
| pos | neighborhood |
+-----+--------------+
|   1 | <NULL>       |
|   2 | 001          |
|   3 | 010          |
|   4 | 101          |
|   5 | 010          |
|   6 | 101          |
|   7 | 011          |
|   8 | 110          |
|   9 | 101          |
|  10 | 011          |
|  11 | 111          |
|  12 | 110          |
|  13 | 100          |
|  14 | <NULL>       |
+-----+--------------+

“What’s going on here,” you ask?

By the time the SELECT clause is evaluated, the database system already knows which rows will feature in the result as long as the query doesn’t contain aggregate functions like sum, count, or json_agg (which would lead to implicit grouping) or a number of other edge cases apply. In most queries, the SELECT clause merely determines which columns (or values computed from the available columns) are included in the result. Window functions like lag and lead take advantage of this semi-known result set, allowing access to other rows within it given a window definition like ORDER BY pos; they allow a query to peek at column values in a window of rows.

Postgres defines a whole family of window functions, some¹³ of which are rather specialized, and window definitions can be complex, nuanced and extremely powerful. Seriously, if you’re going to read their documentation from top to bottom, better brew a cup of strong coffee.

You’ll be glad to know that asking some cellular automaton cells “Won’t you be my neighbor?” doesn’t take us too far into the weeds:

The lag(value) window function returns the value attribute of the preceding row according to the ORDER specified in the window definition. lead(value) works analogously for the following row.

For your ed(u|ifi)cation: There are other handy window functions such as row_number (self-explanatory), percent_rank (essentially returning the position of the current row in the window as a percentage), first_value (first value in the window), last_value (analogous), and many others. Postgres extensions can supply their own window functions, as well.
The ORDER BY pos window definition, uh, defines the ordering that our window functions lag and lead operate on. Omitting an explicit window frame definition, it’s a shorthand for ORDER BY pos RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, i.e., the defined window includes all rows starting from the first row of the accordingly-ordered result set until the current row.

For our two candidates lag and lead, the window frame definition doesn’t matter much; they can look beyond the frame as lead indeed does in the query above: It extracts the value “after” the current row, which lies outside the window frame.

For your continued ed(u|ifi)cation: In some cases, you can omit the ORDER BY part. The documentation notes that if you do so, “rows are processed in an unspecified order,” which can be a source of miscalculation if you’re not 100% sure the order is immaterial. Also, instead of RANGE BETWEEN, you can also specify ROWS BETWEEN – there’s subtle differences between these two variants if the ORDER BY column contains duplicates, but since that’s not the case for us (if there’s two cells at the same position, something’s very wrong), we can disregard this difference.

Make sure to read up on details like this if you’re planning to write complex queries taking advantage of window functions.

Having learned all this, I hope you now understand how the query shown earlier functions. But it’s not finished – we’ve yet to handle the two wrap-around cases at the edges!

Because lag and lead yield NULLs here, we can turn to the good ol’ COALESCE function to detect these edge cases and, if they occur for any given neighborhood, turn our gaze towards the other end of the “array” using the last_value and first_value window functions. Since they, unlike lag and lead, are constrained by the window frame definition RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, we need to extend it to include the whole result set. No problem: The window definition can just as easily be ORDER BY pos RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, that won’t impact lag and lead in the slightest.

Finally, since this all requires reusing an identical window definition four times, we might as well define it just once, for readability’s sake, via a WINDOW clause at the bottom of the query.

SELECT pos,
       COALESCE(lag(value) OVER w,
                last_value(value) OVER w)
         || value
         || COALESCE(lead(value) OVER w,
                     first_value(value) OVER w)
       AS neighborhood
FROM   initial_state
WINDOW w AS (ORDER BY pos RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

For the initial state from earlier, this query yields

+-----+--------------+
| pos | neighborhood |
+-----+--------------+
|   1 | 000          |
|   2 | 001          |
|   3 | 010          |
|   4 | 101          |
|   5 | 010          |
|   6 | 101          |
|   7 | 011          |
|   8 | 110          |
|   9 | 101          |
|  10 | 011          |
|  11 | 111          |
|  12 | 110          |
|  13 | 100          |
|  14 | 000          |
+-----+--------------+

which, as it happens, is correct: This table matches the output of the triple-self-join variant (modulo row order, which could be remedied by adding ORDER BY pos to both variants). But you don’t have to take my word for it, feel free to check and compare for yourself.

Phew! You’ll be glad to know that we’re over the hump now – the next part will be almost trivial in comparison. Let’s get it done, then there will be just an tiny bit more work required to merge things back into our recursive CTE.

Assembling the next generation though neighborhood lookups

Having extracted the bit(3)-shaped neighborhood of each cell of the current generation, it’s time to assemble the next generation by querying the rule table, matching neighborhoods and retrieving next-generation results during the process.

Here it is, barely any explanation required:

SELECT c.pos,
       (SELECT r.result
        FROM   rule r
        WHERE r.neighborhood = c.neighborhood)
FROM   (...)
       AS c(pos, neighborhood);

The ... part stands for the neighborhood construction query we’ve worked so hard on – reproducing it here, in my mind, would distract from how little logic is actually required for the lookups.

The next generation, then, is:

+-----+--------+
| pos | result |
+-----+--------+
|   1 | 0      |
|   2 | 1      |
|   3 | 1      |
|   4 | 0      |
|   5 | 1      |
|   6 | 0      |
|   7 | 1      |
|   8 | 0      |
|   9 | 0      |
|  10 | 1      |
|  11 | 0      |
|  12 | 0      |
|  13 | 1      |
|  14 | 0      |
+-----+--------+

If you scroll back up to the cellular automata explainer at the beginning of this post, you’ll find that this output precisely matches the second generation shown there. Looks like we’re onto something!

Merging all this into the recursive CTE

Back to the soon-not-to-be-incomplete-anymore recursive CTE. We had previously left it in this state before wandering off to figure out the ... part in isolation:

WITH RECURSIVE ca(gen, pos, value) AS (
  SELECT 0, *
  FROM   initial_state

    UNION ALL

  ...
)
SELECT json_agg(c) FROM ca c;

Meanwhile, the ... part we came up with as a stand-alone query, computing the second generation only, looks like this:

SELECT c.pos,
       (SELECT r.result
        FROM   rule r
        WHERE  r.neighborhood = c.neighborhood)
FROM (SELECT pos,
             COALESCE(lag(value) OVER w,
                      last_value(value) OVER w)
               || value
               || COALESCE(lead(value) OVER w,
                           first_value(value) OVER w)
             AS neighborhood
      FROM   initial_state
      WINDOW w AS (ORDER BY pos RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))
     AS c(pos, neighborhood);

We can’t just paste this in – there’s a few knobs and toggles we’ll need to adjust:

In the neighborhood construction subquery, we obviously have to reference ca instead of initial_state – each generation should be computed based on the directly previous one, not the very first one.
At the same time, we shouldn’t forget to pass the gen attribute of ca along to the outer query, where we can then increment it as part of the SELECT clause, thereby marking the newly computed cells as belonging to the next generation.
Finally, to avoid having the query run forever (I hear that’s bad for performance metrics), we require a termination condition: Once gen > :generations, the recursive part of the CTE should stop producing new rows, which terminates the process. Flipping the comparison operator, we should only produce new rows WHERE gen < :generations.

If you’re new to WITH RECURSIVE, let me point something out to stave off confusion: Since the database system permits access only to rows of ca that have been freshly added in the previous recursion step (i.e., generation), there’s no need to explicitly ignore previous generations.

That’s it? That’s indeed, finally, it:

WITH RECURSIVE ca(gen, pos, value) AS (
  SELECT 0, *
  FROM   initial_state

    UNION ALL

  SELECT c.gen + 1,
         c.pos,
         (SELECT r.result
          FROM   rule r
          WHERE r.neighborhood = c.neighborhood)
  FROM   (SELECT gen,
                 pos,
                 COALESCE(lag(value) OVER w,
                          last_value(value) OVER w)
                   || value
                   || COALESCE(lead(value) OVER w,
                               first_value(value) OVER w)
                 AS neighborhood
          FROM   ca
          WHERE  gen < :generations
          WINDOW w AS (ORDER BY pos RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
         )
         AS c(gen, pos, neighborhood)
)
SELECT json_agg(c) FROM ca c;

Upon running this query, we’re presented with a JSON value that enables us to kinda-sorta-verify the result’s correctness even after many generations. While this would be great for shipping off to some kind of frontend (if we’d built one), it’s not exactly pretty:

+-----------------------------------+
|             json_agg              |
+-----------------------------------+
| [{"gen":0,"pos":1,"value":"0"},  +|
|  {"gen":0,"pos":2,"value":"0"},  +|
|  {"gen":0,"pos":3,"value":"0"},  +|
|  {"gen":0,"pos":4,"value":"0"},  +|
|  {"gen":0,"pos":5,"value":"0"},  +|
|  {"gen":0,"pos":6,"value":"1"},  +|
|  {"gen":0,"pos":7,"value":"1"},  +|
|  {"gen":0,"pos":8,"value":"0"},  +|
|  {"gen":0,"pos":9,"value":"1"},  +|
|  {"gen":0,"pos":10,"value":"1"}, +|
|  {"gen":0,"pos":11,"value":"0"}, +|
|  {"gen":0,"pos":12,"value":"0"}, +|
|  {"gen":0,"pos":13,"value":"1"}, +|
|  {"gen":0,"pos":14,"value":"0"}, +|
|  {"gen":1,"pos":1,"value":"0"},  +|
|  {"gen":1,"pos":2,"value":"1"},  +|
|  {"gen":1,"pos":3,"value":"1"},  +|
|  {"gen":1,"pos":4,"value":"0"},  +|
|  {"gen":1,"pos":5,"value":"0"},  +|
|  {"gen":1,"pos":6,"value":"0"},  +|
|  {"gen":1,"pos":7,"value":"1"},  +|
┆                                   ┆
|  {"gen":7,"pos":8,"value":"1"},  +|
|  {"gen":7,"pos":9,"value":"1"},  +|
|  {"gen":7,"pos":10,"value":"1"}, +|
|  {"gen":7,"pos":11,"value":"0"}, +|
|  {"gen":7,"pos":12,"value":"1"}, +|
|  {"gen":7,"pos":13,"value":"0"}, +|
|  {"gen":7,"pos":14,"value":"1"}]  |
+-----------------------------------+

Let’s fix that before calling it quits.

Pretty-printing

Alright, now all the hard work’s done, but we’d really like something pretty to feast our eyes upon. So instead of SELECT json_agg(c) FROM ca c, we can GROUP BY the generation and output either a space or a black rectangle for each cell, then string_agg the whole shebang…

...
SELECT gen,
       string_agg(CASE WHEN value = 0 :: bit THEN ' ' ELSE '█' END, '' ORDER BY pos) AS state
FROM   ca
GROUP BY gen
ORDER BY gen;

…et¹⁴ voila:

+-----+----------------+
| gen |     state      |
+-----+----------------+
|   0 |   █ █ ██ ███   |
|   1 |  ██ █ █  █  █  |
|   2 | ██  █ ████████ |
|   3 |   ███ █        |
|   4 |  ██   ██       |
|   5 | ██ █ ██ █      |
|   6 | █  █ █  ██   █ |
|   7 |  ███ ████ █ ██ |
+-----+----------------+

Neat, right? Now go back to the beginning and try a different rule, or a wider array, or more generations, or something else entirely – three examples ready for pastin’ into psql:

\set rule 105
\set size 60
\set generations 30
\set random TRUE

\set rule 75
\set size 60
\set generations 30
\set random FALSE

\set rule 60
\set size 60
\set generations 30
\set random TRUE

Rest assured that despite the ridiculousness of doing all within a database and while neither defining any indexes not attempting to optimize the queries, this implementation is reasonably performant – on my aging 2015-era laptop, the recursive CTE seems to terminate within less than a second for

\text{size} \times \text{generations} < 100\,000,

or, in human terms, anything that’s gonna fit on your screen will be generated in the blink of an eye.

I’m sure a well-optimized C implementation could easily handle millions and millions of cells per second, but where’s the fun in that?

TL;DR

Here’s the completed, only-the-good-bits SQL code – feel free to use it under the terms of the MIT License.

\set rule 30
\set size 14
\set generations 7
\set random TRUE

CREATE TEMPORARY TABLE rule (
  neighborhood bit(3),  -- bit string of length 3
  result bit            -- just a single bit
);

INSERT INTO rule
  SELECT (7 - (neighborhood - 1)) :: bit(3) AS neighborhood, result
  FROM   unnest(regexp_split_to_array(:rule :: bit(8) :: text, '') :: bit[])
         WITH ORDINALITY AS rule(result, neighborhood);

CREATE TEMPORARY TABLE initial_state (
  pos int,
  value bit
);

INSERT INTO initial_state
  SELECT pos, CASE
                WHEN :random THEN round(random()) :: int              -- random
                ELSE CASE WHEN pos = :size / 2 + 1 THEN 1 ELSE 0 END  -- single 1 in the middle
              END :: bit
  FROM   generate_series(1, :size) AS pos;

WITH RECURSIVE ca(gen, pos, value) AS (
  SELECT 0, *
  FROM   initial_state

    UNION ALL

  SELECT c.gen + 1,
         c.pos,
         (SELECT r.result
          FROM   rule r
          WHERE r.neighborhood = c.neighborhood)
  FROM   (SELECT gen,
                 pos,
                 COALESCE(lag(value) OVER w,
                          last_value(value) OVER w)
                   || value
                   || COALESCE(lead(value) OVER w,
                               first_value(value) OVER w)
                 AS neighborhood
          FROM   ca
          WHERE  gen < :generations
          WINDOW w AS (ORDER BY pos RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
         )
         AS c(gen, pos, neighborhood)
)
SELECT gen,
       string_agg(CASE WHEN value = 0 :: bit THEN ' ' ELSE '█' END, '' ORDER BY pos) AS state
FROM   ca
GROUP BY gen
ORDER BY gen;

Note that a small number of PostgreSQL-specific features will be used, however due to SQL’s Turing completeness, everything we do here is theoretically possible in any standards-compliant RDBMS. Any recent version of Postgres should work – I’ve tested the code presented in this blog post using Postgres 10.17. ↩
More recently, I’ve taken the poster generation code, rejiggered its color scheme selection part, and turned it into a Mastodon bot – feel free to follow @sundryautomata@botsin.space. ↩
And the length of the array, and how many generations should be simulated, and what kind of initial state to set – but that’s all window dressing and doesn’t determine the inherent properties of the cellular automaton. ↩
One might say that it’s SQL time. ↩
Note that my implementation isn’t golfed at all – it could be shortened significantly, but I’ve gone for comprehensibility over making myself feel smart. Neither is it optimized for performance, mind you. ↩
It’s also conceivable to generate an SVG image similar to those in the “Cellular automata 101” section above, but I’ve done something like that for a previous blog post when I drew the Sierpiński Triangle with recursive SQL, so there’s no novelty in this for me. But for you, there might be – sounds like a good thing to leave as an exercise to the reader, then! ↩
Along with a selection of handy bitwise operators – we won’t need them here, but they’re important tools to know about. ↩
Gone in the sense of “not cluttering the database” rather than “incognito mode”, of course. ↩
We can count ourselves lucky that Postgres just does this conversion for us – otherwise, we’d need to formulate a recursive CTE or a stored procedure that could do the job. That could actually be a fun exercise! ↩
The design choice to denote arrays with curly braces will continue to confound me to no end until the day I die. ↩
Sounds oddly more lonely than a standard self-join, doesn’t it? ↩
I’ve configured psql to indicate NULLs with <NULL> instead of blank table cells by adding the following line to my ~/.psqlrc:
```
\pset null <NULL>
```
The \pset null option is only the tip of the iceberg – there’s a whole host of useful configuration commands. For a complete list, search for “Adjustable printing options are:” in the psql documentation. Some favorites:
- You can configure psql to maintain a separate command history for each database.
```
\set HISTFILE ~/.psql_history_ :DBNAME
```
  This can be very handy when trying to retrace your steps a few weeks after the fact. If you’re experiencing déjà vu right now, that might be because, as noted in the documentation, “[t]his feature was shamelessly plagiarized from Bash.” Other Bash-like options like HISTCONTROL, HISTSIZE, and IGNOREEOF exist, as well.
- You might prefer prettier table borders using Unicode box drawing characters.
```
\pset border 2
\pset linestyle unicode
\pset unicode_header_linestyle double
```
  (I used only the first of the three options for the examples in this post.)
- Finally:
```
\timing on
```
  This compels psql to tell you how much time each command has taken. This can help you to catch slow queries in a size-constrained testing environment.
…and that barely scratches the surface – psql is highly configurable indeed. ↩
It’s really an extended family – you can also use aggregate functions in the place of window functions, although the semantics might feel somewhat unusual at first: The documentation notes that “[a]n aggregate used with ORDER BY and the default window frame definition produces a “running sum” type of behavior.” You can observe this when using array_agg as a window function, for instance. ↩
Of course, you won’t see quite the same unless your system happens to pick the very same initial state from $2^{14} = 16\,384$ possible variations – use
```
INSERT INTO initial_state
  SELECT pos, value
  FROM   unnest(regexp_split_to_array('00101011011100', '') :: bit[])
         WITH ORDINALITY AS initial_state(value, pos);
```
instead of the query described earlier to start off with the same state as the examples throughout this post. ↩