Representing, Multiplying, and Transposing (Sparse) Matrices in SQL
Databases are commonly used as dumb storage bins for CRUD application data. However, this doesn’t do them justice: database systems such as PostgreSQL are much more capable and versatile than that, supporting a wide range of operations across many different data types.
The standard way of interfacing with Postgres is SQL – you know, that thing you may have briefly learned in webdev class where you SELECT
stuff FROM
a set of tables WHERE
some condition is met. But that’s a far cry from what you can achieve when taking advantage of the full feature set of this declarative and – surprise! – Turing complete programming language.
In this post, I’ll describe how to idiomatically represent matrices in a PostgreSQL1 database and how to multiply and transpose them using fairly straightforward SQL queries. I’ll also show how to make those queries work with an alternative, array-based representation of matrices.
This post is loosely based on what I learned in the Advanced SQL lecture held by Torsten Grust in the summer of 2017 at the University of Tübingen. Take a look at the lecture slides for in-depth explanations and a wide array of examples.
Representation matters
To find a good way to represent matrices in a database table, let’s consider the following matrix:
One possibility is what I’ll call a literal translation, i.e. translating the rows and columns of directly to rows and columns of a database table.
However, this approach would quickly prove to be inflexible, requiring a table definition tailored to the matrix’s column count and providing no way of uniquely identifying2 rows. It’s also worth considering that the product of two non-square matrices has a different dimensionality than both factors, so computing a result table correctly in a query targeting this representation would be complex and likely involve quite a bit of trickery. The final nail in the coffin of this representation stems from its inability to store more than one matrix per table, making it impractical for any real-world use case.
After careful consideration of these drawbacks, we might come up with a table schema that assigns a unique identifier to our matrix and represents its contents as a set of coordinates (row and column indices) and associated values:
CREATE TABLE matrix (
id INTEGER,
row INTEGER,
col INTEGER,
val INTEGER,
PRIMARY KEY(id, row, col)
);
The PRIMARY KEY
constraint3 ensures that we won’t insert more than one value at the same coordinate for any matrix. Now we can store4 our matrix from above as follows:
INSERT INTO matrix (id, row, col, val) VALUES
(1, 0, 0, 1), (1, 0, 1, 2), (1, 0, 2, 3),
(1, 1, 0, 4), (1, 1, 1, 5), (1, 1, 2, 6);
Storing multiple, differently sized matrices is no problem when using this encoding. Adding another matrix
to our table is as simple as executing the following statement (note that we’ve chosen a different value for the id
column):
INSERT INTO matrix (id, row, col, val) VALUES
(2, 0, 0, 1),
(2, 1, 0, 2),
(2, 2, 0, 3);
Now we’re ready to compute the matrix product , which we’re then able to transpose!
Matrix multiplication
We’ll begin by quickly recapping how the product of our two matrices and can be calculated. To show the process more clearly, let’s temporarily replace their elements with placeholders. The first component of each placeholder’s index indicates which row the entry originates from, the second part accordingly signifies the column index.
The product can then be computed using component-wise multiplication of each row from with each column from and subsequent addition of the products.
In order for this to work, the column count of our matrix must match the row count of the matrix , which happens to be the case here – so matrix multiplication is not5 associative. The result matrix then has the same number of rows as and the same number of columns as – and indeed, it’s a matrix.
Notice that the indexes shown in the result matrix above follow a pattern: In the term for each element of the result matrix, the second component of each index (corresponding to column of ) is always equal to the first component of each index (corresponding to row of ). Expressing this idea with math notation, it holds that
This observation will come in handy once it’s query writing time – but first, let’s try this procedure with the actual entries of our matrices:
With this in mind, let’s get started on writing a query that multiplies our two protagonists.
It’s clear that we need to somehow match each row of with its partner columns from . Since both matrices reside in the same table, the best way to do this is via a self join – this, as the name suggests, means joining6 our table matrix
with itself, which pairs up all table rows of matrix
with each other table row of matrix
before executing the WHERE
clause.
SELECT m1.*, m2.*
FROM matrix m1, matrix m2
WHERE m1.id = 1 AND m2.id = 2
ORDER BY m1.id, m1.row, m1.col, m2.id, m2.row, m2.col;
(The ORDER BY
clause isn’t strictly required, but it makes sure that the rows are returned in the order shown below – tables are fundamentally unordered multisets of rows. The fact that Postgres will return them in insertion order if we omit the ORDER BY
clause of basic queries like this is merely an implementation detail that may change in future.)
Taking a look at an annotated and shortened version of the query result will hopefully answer any questions you might have about basic joins7 at this point:
+----------------------+----------------------+
| matrix m1 | matrix m2 |
+----+-----+-----+-----+----+-----+-----+-----+
| id | row | col | val | id | row | col | val |
+----+-----+-----+-----+----+-----+-----+-----+
| 1 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | <- a₁,₁ paired with b₁,₁
| 1 | 0 | 0 | 1 | 2 | 1 | 0 | 2 | <- a₁,₁ paired with b₂,₁
| 1 | 0 | 0 | 1 | 2 | 2 | 0 | 3 | <- a₁,₁ paired with b₃,₁
| 1 | 0 | 1 | 2 | 2 | 0 | 0 | 1 | <- a₁,₂ paired with b₁,₁
| 1 | 0 | 1 | 2 | 2 | 1 | 0 | 2 | <- a₁,₂ paired with b₂,₁
| 1 | 0 | 1 | 2 | 2 | 2 | 0 | 3 | <- a₁,₂ paired with b₃,₁
| 1 | 0 | 2 | 3 | 2 | 0 | 0 | 1 | <- a₁,₃ paired with b₁,₁
| 1 | 0 | 2 | 3 | 2 | 1 | 0 | 2 | <- a₁,₃ paired with b₂,₁
| 1 | 0 | 2 | 3 | 2 | 2 | 0 | 3 | <- a₁,₃ paired with b₃,₁
| 1 | 1 | 0 | 4 | 2 | 0 | 0 | 1 | <- a₂,₁ paired with b₁,₁
| 1 | 1 | 0 | 4 | 2 | 1 | 0 | 2 | <- a₂,₁ paired with b₂,₁
| 1 | 1 | 0 | 4 | 2 | 2 | 0 | 3 | <- a₂,₁ paired with b₃,₁
| 1 | 1 | 1 | 5 | 2 | 0 | 0 | 1 | <- a₂,₂ paired with b₁,₁
| 1 | 1 | 1 | 5 | 2 | 1 | 0 | 2 | <- a₂,₂ paired with b₂,₁
| 1 | 1 | 1 | 5 | 2 | 2 | 0 | 3 | <- a₂,₂ paired with b₃,₁
| 1 | 1 | 2 | 6 | 2 | 0 | 0 | 1 | <- a₂,₃ paired with b₁,₁
| 1 | 1 | 2 | 6 | 2 | 1 | 0 | 2 | <- a₂,₃ paired with b₂,₁
| 1 | 1 | 2 | 6 | 2 | 2 | 0 | 3 | <- a₂,₃ paired with b₃,₁
+----+-----+-----+-----+----+-----+-----+-----+
(18 rows)
Notice this:
-
The left half of the result table contains all table rows corresponding to elements of . This has been achieved by assigning the alias
m1
to a “copy” of thematrix
table and filtering it based on the predicate8m1.id = 1
. Notice that each element of is repeated three times: once for each element of . -
The right half analogously contains multiple copies of each element of , once for each element of .
A good first step! Now, before we can accomplish what we’ve set out to do with this self join – matching each row of with its partner columns from – it makes sense to augment the WHERE
clause of our query to constrain the working set of rows such that only the matching pairs of values (recall: the ones that share the same ) within each row-column pair are retained.
SELECT m1.*, m2.*
FROM matrix m1, matrix m2
WHERE m1.id = 1 AND m2.id = 2
AND m1.col = m2.row
ORDER BY m1.id, m1.row, m1.col, m2.id, m2.row, m2.col;
Adding this predicate to the WHERE
clause has decreased9 the size of the query result considerably – take a look (again, this is an annotated result table):
+----------------------+----------------------+
| matrix m1 | matrix m2 |
+----+-----+-----+-----+----+-----+-----+-----+
| id | row | col | val | id | row | col | val |
+----+-----+-----+-----+----+-----+-----+-----+
| 1 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | <- a₁,₁ paired with b₁,₁
| 1 | 0 | 1 | 2 | 2 | 1 | 0 | 2 | <- a₁,₂ paired with b₂,₁
| 1 | 0 | 2 | 3 | 2 | 2 | 0 | 3 | <- a₁,₃ paired with b₃,₁
| 1 | 1 | 0 | 4 | 2 | 0 | 0 | 1 | <- a₂,₁ paired with b₁,₁
| 1 | 1 | 1 | 5 | 2 | 1 | 0 | 2 | <- a₂,₂ paired with b₂,₁
| 1 | 1 | 2 | 6 | 2 | 2 | 0 | 3 | <- a₂,₃ paired with b₃,₁
+----+-----+-----+-----+----+-----+-----+-----+
(6 rows)
As you might intuit from on this query result, in order to (finally!) match each row of with its partner columns from that we’ll then be able to component-wise multiply and sum up, we need to group them.
Luckily, SQL’s GROUP BY
construct will do this grouping work for us – we just need to direct it to group the data by m1.row
and m2.col
. For now, we’ll aggregate the group contents into arrays so we can inspect them:
SELECT m1.row AS "m1.row", m2.col AS "m2.col", array_agg(m1.val) AS "m1.vals", array_agg(m2.val) AS "m2.vals"
FROM matrix m1, matrix m2
WHERE m1.id = 1 AND m2.id = 2
AND m1.col = m2.row
GROUP BY m1.row, m2.col
ORDER BY m1.row, m2.col;
And the result:
+--------+--------+---------+---------+
| m1.row | m2.col | m1.vals | m2.vals |
+--------+--------+---------+---------+
| 0 | 0 | {1,2,3} | {1,2,3} |
| 1 | 0 | {4,5,6} | {1,2,3} |
+--------+--------+---------+---------+
(2 rows)
Let’s compare this with the intermediate result we got when multiplying the two matrices by hand earlier:
It appears to match. Based on this comparison, we can surmise that for each table row, all that’s left to do is
- multiplying each element of
m1.vals
with the corresponding element ofm2.vals
and sum
ming these products up.
We’ll do just that while giving our new matrix a unique id
at the same time:
SELECT 3 AS id, m1.row, m2.col, sum(m1.val * m2.val) AS val
FROM matrix m1, matrix m2
WHERE m1.id = 1 AND m2.id = 2
AND m1.col = m2.row
GROUP BY m1.row, m2.col
ORDER BY m1.row, m2.col;
Which yields:
+----+-----+-----+-----+
| id | row | col | val |
+----+-----+-----+-----+
| 3 | 0 | 0 | 14 |
| 3 | 1 | 0 | 32 |
+----+-----+-----+-----+
(2 rows)
Checking back in with our manually computed result from above…
…we notice that they match, so our query works! Of course, we’ve tried this with only a single test case. If you remain unconvinced, run a couple more matrices through the query.
Pro tip: You can insert this result back into the table by simply putting INSERT INTO matrix
in front of the query.
From zero to hero
A significant advantage of our chosen matrix representation and the multiplication query is that it handles sparse matrices – ones where most of the entries are zero, e.g. diagonal matrices, adjacency matrices, or some kinds of matrices that crop up in machine learning contexts – extremely gracefully: matrix entries that are zero need not be encoded.
Any such “missing” entries are correctly handled by our query because if the m1.col = m2.row
predicate fails to “find” a partner for one of the matrix elements, that element not processed further, which is equivalent10 to muliplying it with zero.
But you don’t have to take my word for it – let’s multiply these11 two sparse matrices:
INSERT INTO matrix (id, row, col, val) VALUES
(4, 0, 1, -1), (4, 1, 0, 1), (4, 2, 2, 1),
(5, 0, 0, 1);
SELECT 6 AS id, m1.row, m2.col, sum(m1.val * m2.val) AS val
FROM matrix m1, matrix m2
WHERE m1.id = 4 AND m2.id = 5
AND m1.col = m2.row
GROUP BY m1.row, m2.col
ORDER BY m1.row, m2.col;
This correctly yields:
+----+-----+-----+-----+
| id | row | col | val |
+----+-----+-----+-----+
| 6 | 1 | 0 | 1 |
+----+-----+-----+-----+
As a result, when dealing with truly massive sparse matrices, our chosen encoding significantly benefits both storage space and multiplication performance.
Matrix transposition
Transposing a matrix is a whole lot simpler than multiplying two of them: All it is is mirroring a matrix along a line from the top left to the bottom right. A quick demo using our running example will make this clear:
Implementing this is almost trivial, which is a sign that we’ve chosen a well-suited representation: All we need to do is swap the row and column indices for each element of the matrix. As above, the ORDER BY
clause is optional, it merely serves to improve readability.
SELECT 4 AS id, col AS row, row AS col, val
FROM matrix
WHERE id = 1
ORDER BY row, col;
And we get:
+----+-----+-----+-----+
| id | row | col | val |
+----+-----+-----+-----+
| 4 | 0 | 0 | 1 |
| 4 | 0 | 1 | 4 |
| 4 | 1 | 0 | 2 |
| 4 | 1 | 1 | 5 |
| 4 | 2 | 0 | 3 |
| 4 | 2 | 1 | 6 |
+----+-----+-----+-----+
Addition and subtraction of matrices can be implemented even more easily, so I won’t discuss them in this post.
Representation matters, redux
Modern SQL implementations include support for a data type that may strike you as a good candidate for representing matrices atomically (sort of, anyway), i.e. with one matrix per table row: arrays. They can be multidimensional, meaning that you could represent our two matrices and as an array12 of row-representing arrays:
CREATE TABLE matrix_array (
id INTEGER PRIMARY KEY,
matrix INTEGER[][]
);
INSERT INTO matrix_array (id, matrix) VALUES
(1, '{{1,2,3},{4,5,6}}'),
(2, '{{1},{2},{3}}');
This new representation has several advantages:
-
It may strike you as neater – one row per matrix instead of many. It’s certainly more compact.
-
If you choose this representation, Postgres will actually do some error-checking for you. Consider the following
INSERT
statement:INSERT INTO matrix_array (id, matrix) VALUES (3, '{{1,2},{3}}');
An attempt to execute it will lead to Postgres complaining about the inconsistent sub-array dimensions.
ERROR: malformed array literal: "{{1,2},{3}}" LINE 2: (3, '{{1,2},{3}}'); ^ DETAIL: Multidimensional arrays must have sub-arrays with matching dimensions.
-
You don’t need to worry about storing row and column indices for each matrix element – they’re encoded implicitly.
-
The dimensions of the arrays correspond to the size of each matrix, thereby explicitly encoding this metadata. The table-based representation does not necessarily provide a clear indication of the size of (sparse) matrices: Consider a matrix whose bottom-most row(s) or right-most column(s) are all zeros – such rows and columns would simply not be stored, making it impossible to ascertain whether and how many of such rows and columns should logically be present.
Disadvantages include:
- Since zeros cannot be omitted, this representation delivers no benefits for storing or operating on sparse matrices – the table-based variant is clearly superior in this regard.
- Even though it’s more compact, it’s not idiomatic SQL. In fact, one might argue that it violates 1NF since arrays semantically aren’t atomic, they’re collections13 of atomic values.
- Arrays aren’t as widely implemented amongst relational databases systems as, you know, tables.
- Unpacking arrays has complexity .
Let’s explore this representation anyway! To this end, we could write two wholly new queries for multiplying and transposing matrices represented in this manner, or14 we could take the lazy (and more interesting) route: Setting up one VIEW
that converts the array-based encoding into the table-based one that our queries already operate on, so basically an adapter.
Translating arrays into tables
As the name suggests, a VIEW
allows for a different view into your data: It provides the same querying interface as a table, but internally executes a query and provides its results to the “caller”. What’s particularly cool about views is that they automatically15 update whenever the underlying data is modified.
In order to set up a view that takes the array-based encoding and translates it into the tabular one that our queries can operate on, we’ll need a couple of shiny tools on our belt:
-
The
unnest
function converts an arbitrary-dimensional array into a flat table. For example, the query16SELECT * FROM matrix_array m, LATERAL unnest(m.matrix);
yields one row per array element, irrespective of its nesting depth:
+----+-------------------+--------+ | id | matrix | unnest | +----+-------------------+--------+ | 1 | {{1,2,3},{4,5,6}} | 1 | | 1 | {{1,2,3},{4,5,6}} | 2 | | 1 | {{1,2,3},{4,5,6}} | 3 | | 1 | {{1,2,3},{4,5,6}} | 4 | | 1 | {{1,2,3},{4,5,6}} | 5 | | 1 | {{1,2,3},{4,5,6}} | 6 | | 2 | {{1},{2},{3}} | 1 | | 2 | {{1},{2},{3}} | 2 | | 2 | {{1},{2},{3}} | 3 | +----+-------------------+--------+
Notice that the
unnest
column of the result does not give any clues regarding the dimensions of the matrices – if we were to discard thematrix
column, we could interpret the data as representing column vectors, row vectors or anything in-between. To avoid having to guess, we’ll need to access the array dimensions in order to compute row and column indices later on.But another – more immediate! – issue is that the
unnest
output is not in any particular order, however we’ll require an explicitly-encoded ordering when it comes to computing the row and column indices later on. Hence: -
The
WITH ORDINALITY
keywords can be tacked on to any table-valued function. If specified, the result is enumerated (beginning with 1). As previously mentioned, this will help us with calculating the row and column indices.SELECT m.id, u.ord, u.val FROM matrix_array m, LATERAL unnest(m.matrix) WITH ORDINALITY AS u(val, ord);
And the output17:
+----+-----+-----+ | id | ord | val | +----+-----+-----+ | 1 | 1 | 1 | | 1 | 2 | 2 | | 1 | 3 | 3 | | 1 | 4 | 4 | | 1 | 5 | 5 | | 1 | 6 | 6 | | 2 | 1 | 1 | | 2 | 2 | 2 | | 2 | 3 | 3 | +----+-----+-----+
-
array_length(arr, dim)
returns the length ofarr
along dimensiondim
. For our chosen representation,array_length(..., 1)
provides the row count andarray_length(..., 2)
yields the column count. -
This can be used in conjunction with integer division
/
, the good old modulo operator%
and ourord
column to accurately reconstruct our matrices’ row and column indices. -
Finally, we can of course drop any matrix entries that are zero to reduce storage overhead.
Throw these ingredients18 into a dutch oven, let them stew for a while and this view emerges:
CREATE VIEW matrix_arraytable AS
SELECT m.id,
(u.ord-1) / array_length(m.matrix, 2) AS row,
(u.ord-1) % array_length(m.matrix, 2) AS col,
u.val
FROM matrix_array m,
LATERAL unnest(m.matrix) WITH ORDINALITY AS u(val, ord)
WHERE u.val <> 0;
And indeed, running TABLE matrix_arraytable ORDER BY id, row, col;
yields:
+----+-----+-----+-----+
| id | row | col | val |
+----+-----+-----+-----+
| 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | 2 |
| 1 | 0 | 2 | 3 |
| 1 | 1 | 0 | 4 |
| 1 | 1 | 1 | 5 |
| 1 | 1 | 2 | 6 |
| 2 | 0 | 0 | 1 |
| 2 | 1 | 0 | 2 |
| 2 | 2 | 0 | 3 |
+----+-----+-----+-----+
Now we can reuse our multiplication and transposition queries – by simply changing matrix
to matrix_arraytable
in their FROM
lists.
Going full circle
Given that we manged to set up an automated translation of the array-based representation into the table-based one, you might be inclined to wonder whether this mapping can be reversed – I certainly did! Essentially, this is asking whether the two representations are isomorphic.
For matrices that don’t contain any zeros, the answer is a resounding yes. The following query does the job:
SELECT id, array_agg(vals ORDER BY row)
FROM (SELECT id, row, array_agg(val ORDER BY col) AS vals
FROM matrix
GROUP BY id, row) AS _
GROUP BY id;
Notice how the query does the work in two stages corresponding to the doubly-nested nature of our array-based representation:
- Matrix entries are aggregated in a subquery into arrays representing rows.
- These row arrays are aggregated once more, yielding full matrices.
Looking at the query’s result, however, we notice that the sparse matrices and defined earlier aren’t looking right:
+----+-------------------+
| id | array_agg |
+----+-------------------+
| 1 | {{1,2,3},{4,5,6}} |
| 2 | {{1},{2},{3}} |
| 4 | {{-1},{1},{1}} |
| 5 | {{1}} |
+----+-------------------+
All the zeros are missing!
This is because we haven’t stored the zeros – after all, our table-based representation doesn’t need them to be present – so the array_agg
function was of course unable to aggregate them. While we could fix this by employing some generate_series
magic based on the maximum column and row indices given in the tabular representation of each matrix, this would leave cases where the bottom row(s) or the right column(s) of a matrix are all zeros out in the cold – as mentioned previously, matrices could be implicitly zero-padded to infinity.
Summary
Phew, you made it all the way to the end! Let’s recap what we’ve learned today. In this post, I’ve examined…
- how to efficiently represent matrices in database tables,
- how to multiply and transpose them with SQL,
- what the advantages of our chosen representation are for sparse matrices,
- a different, array-based encoding of matrices,
- how to mechanically translate this encoding back into the representation our queries expect,
- and whether our two representations are isomorphic.
Along the way, you’ve hopefully picked up some basics of joins, grouping, arrays and views in SQL!
-
Make sure to use PostgreSQL 9.4 or newer – we’ll be relying on some features that weren’t yet available in older versions. Other DBMSes may or may not work. ↩
-
The rows could appear shuffled upon retrieval (making correct multiplication impossible altogether) because tables in relational databases are unordered – they’re multisets of rows. As you’ll likely recall, sets are always unordered and their semantics ignore duplicates, e.g. holds. The “multi” prefix means that identical rows can appear more than once in a table – if tables were proper sets, such duplicates would be filtered out. ↩
-
Of course, we could define further constraints – for example, a
CHECK
constraint excluding negative numbers from the range of admissible column and row indices comes to mind. ↩ -
I’ve arbitrarily chosen to have row and column indexes begin from 0 instead of 1, you could do it the other way (although my way makes some of the array-specific stuff later on a bit more elegant). ↩
-
But you can always multiply a matrix with its transpose: Suppose is a matrix, then is a matrix, and – this might surprise you – the equality holds universally. ↩
-
In math terms, you can think of a join as computing the cross product of two tables (which makes sense because they’re really just (multi-)sets of rows). Plus, “cross product” already sounds like multiplication – we’re clearly on the right track. ↩
-
Check out this tweet by Julia Evans for a quick overview of different kinds of Joins, and this one for an explanation of left joins. Also check out this article on how to illustrate joins. ↩
-
A predicate is a boolean-valued expression or function, nothing fancy. ↩
-
Our
WHERE
clause has thus proved to be highly selective, which matters when it comes to query optimization (not that a query over this amount of data requires any). See slide 9 of this slide deck for more information. ↩ -
To verify this, try mentally “stepping” through the query – that’s a useful skill to build, anyway. ↩
-
If you’re curious: describes a 90° rotation around the -axis in three-dimensional space, is a three-dimensional unit vector aligned with the -axis, and the multiplication applies the rotation to the vector, yielding a unit vector aligned with the -axis. ↩
-
Array literals are written with curly braces for god knows what reason. Arrays within SQL’s JSON type, meanwhile, are delimited by the same square brackets as literally everywhere else. ↩
-
Of course, strings aren’t really atomic either as they can be split into characters. And an utter pedant might remark that anything but raw bits isn’t atomic. ↩
-
No prizes for guessing. ↩
-
Unless you configure your view to be materialized, i.e. persisted to disk. This can have query performance benefits, especially since you can set up indexes for materialized views. ↩
-
The
LATERAL
keyword is not actually required here, but it’s good style to specify it anyway. ↩ -
ord = val
holds here merely due to our choice of example matrices. ↩ -
Don’t forget to add a Knorr stock pot. ↩