Displaying an Accurate Reading Time Estimate in Jekyll

When reading an article online, it’s always nice¹ to be provided with a reading time estimate somewhere near the top of the page. The widely-used static site generator Jekyll doesn’t ship with this feature, which is why I decided to implement it as a snippet you can pop into your template without installing any plugins.

A basic variant

Of course, I’m not the first person to take a crack at this – others have come before. Their solutions can be all be implemented more or less like this:

<span class="ert">
    <abbr title="Estimated reading time">ERT</abbr>
    {% assign words = page.content | strip_html | number_of_words %}
    {% assign ert = words | divided_by:250 | at_least: 1 %}
    {{ ert }} minute{% if ert != 1 %}s{% endif %}
</span>

That snippet would go into _layouts/post.html if you’re using a standard Jekyll setup. It works as follows:

Remove HTML tags² from a copy of the page content and count the remaining words. The number_of_words filter is implemented in the simplest³ possible way, splitting its input on spaces and returning the length of the resulting list.
Compute the estimated reading time by dividing the word count by 250, which approximates the reading speed expressed in words per minute (WPM) of an average adult. Of course, reading speed differs wildly based on education, language and other factors, but there’s very little⁴ that could be done about this within Jekyll. The at_least: 1 filter makes sure that no obviously-incorrect “0 minutes” reading time estimate is displayed⁵ for very short posts.
Finally, the estimated reading time is printed with a possibly-pluralized label.

What’s wrong with it

On this blog, I exclusively write in English, so my Jekyll setup doesn’t suffer from CJK-related word count or variable-language WPM issues. Rather, to me, the main downsides of this approach are the following:

Code blocks, i.e. <pre> tags, aren’t treated any differently from normal text.

This inflates the reading time estimate since code blocks are more frequently skimmed than text, and some of them may contain verbose example outputs that take much less time to comprehend than Jekyll’s naïve word count metric accounts for.
The strip_html filter, in addition to removing HTML tags, also removes the contents of <script> and <style> tags. As a result, the reading time estimate happens to ignore LaTeX math as it’s wrapped in <script type="math/tex"> tags.

Since mathematical notation tends to be fairly dense and takes longer to comprehend than the same length of prose, this behavior leads to a significant reading time underestimate for math-heavy posts.

Improving things

Based on these observations, I’ve modified the snippet shown above to compute a weighted reading time estimate where

the impact of code snippets is halved, while
the word count of math zones is doubled.

In my testing, these heuristics work well in practice, providing a more accurate estimate than the naïve approach for the kinds of articles I write. The improved⁶ snippet heavily relies on the replace filter and the fact that the strip_html filter removes HTML comments:

<span class="ert">
    <abbr title="Estimated reading time">ERT</abbr>
    {% assign words_total = page.content | replace: '<script type="math/tex">', '' | replace: '<script type="math/tex; mode=display">', '' | replace: '</script>', '' | strip_html | number_of_words %}
    {% assign words_without_code = page.content | replace: '<pre class="highlight">', '<!--' | replace: '</pre>', '-->' | replace: '<script type="math/tex">', '' | replace: '<script type="math/tex; mode=display">', '' | replace: '</script>', '' | strip_html | number_of_words %}
    {% assign words_without_math = page.content | strip_html | number_of_words %}
    {% assign words_without_either = page.content | replace: '<pre class="highlight">', '<!--' | replace: '</pre>', '-->' | strip_html | number_of_words %}

    {% assign words_code = words_total | minus: words_without_code | divided_by: 2.0 %}
    {% assign words_math = words_total | minus: words_without_math | times: 2.0 %}
    {% assign words = words_without_either | plus: words_code | plus: words_math | round %}

    {% assign ert = words | divided_by:250 | at_least: 1 %}
    {{ ert }} minute{% if ert != 1 %}s{% endif %}
</span>

I’d even argue that Medium’s reading time estimate was chief among the little details that made the service popular (you know, before they ruined the reading experience with excessive growth hacking banners and popups). ↩
Some prior solutions even leave out the strip_html filter, thereby accidentally inflating their reading time estimate. ↩
Note that this approach doesn’t work at all for languages like Chinese and Japanese which don’t use spaces to separate words, and might yield misleading reading time estimates for fairly “dense” languages like Korean. At the time of writing, a pull request which addresses this deficiency (but overshoots its goal for Korean) has been in limbo for several months. ↩
Apart from detecting the article’s language (or having the author specify it as post metadata) and applying a different base WPM based on this information. Another option would be to make the WPM configurable by the reader, but that clashes with the goal of providing a seamless, glanceable reading time estimate (plus it would involve a bunch of JavaScript). ↩
Some of the prior solutions don’t implement this little fix. ↩
Improved in its function, but yes, the code is really awkward and unintuitive. There’s no way around this since Jekyll doesn’t provide a regex replacement filter. Also note that reading time estimates could be way off if you include executable JavaScript code in your posts – the snippet replaces closing </script> tags with empty strings during construction of the words_total variable, thus causing their contents to be counted alongside the rest of the post. ↩