Excessively Adequate

Converting Numbers to Strings in C11

While working on my bachelors thesis this past weekend, I came across an interesting problem: Given an int or float of unknown size, how can I write it to a string with no bytes to spare?

Conventionally, you’d probably write something like this when having to compute the string representation of an int:

char *inttostr_conventional(int n) {
    char *s = (char*) malloc(sizeof(char) * 42);
    sprintf(s, "%d", n);
    return s;
}

However, that’s not quite what I wanted. While 42 bytes is certainly sufficient for an arbitrary 32-bit integer1, some of this space will go unused more often than not.

One way to improve upon this is to use the log\log function. When represented as a string of characters, the “length” of a positive integer nn is log10(n)+1\left\lfloor\log_{10}(n)\right\rfloor+1. Except that log(0)\log(0) is not defined2, so you’ll have to add a special case for that. And there’s these pesky negative numbers too, for which you’ll have to add another byte to account for the minus sign. And then, in all three cases, another byte for the null terminator:

char *inttostr_log(int n) {
    int len;
    if (n > 0) {
        len = (int) floor(log10(n)) + 1;
    } else if (n == 0) {
        len = 1;
    } else if (n < 0) {
        len = (int) floor(log10(-n)) + 2;
    }
    len += 1;

    char *s = (char*) malloc(sizeof(char) * len);
    sprintf(s, "%i", n);
    return s;
}

That’s fairly verbose, and I’m sure there’s an edge case that breaks this implementation. Plus this approach only works for ints, we haven’t even gotten started with floats yet!

Meet snprintf

After some research, I came up with the folling three-line function that seems to magically work in all cases (even for floats if you change the parameter and format string accordingy):

char *inttostr_magic(int n) {
    int len = snprintf((char*) 0, 0, "%i", n) + 1;
    char *s = (char*) malloc(sizeof(char) * len);
    snprintf(s, len, "%i", n);
    return s;
}

The function snprintf was added as a part of the C11 standard and has an additional argument over the good old sprintf:

int sprintf(char *str, const char *format, ...);
int snprintf(char *str, size_t size, const char *format, ...);

The functions snprintf() and vsnprintf() do not write more than size bytes (including the terminating null byte ('\0')). If the output was truncated due to this limit then the return value is the number of characters (excluding the terminating null byte) which would have been written to the final string if enough space had been available.

We can exploit this behavior by passing a size of 0. That results in snprintf not even trying to write to the pointer given in its first argument (so we can safely pass a null pointer) and instead returning the number of characters that would have been written if we hadn’t been so sneaky.

Now we can simply malloc this number of characters (plus one, for the null terminator), use snprintf again to actually write our number to the string we’ve just created and return it!

Analysis

When considering time efficiency as well, things look a bit different than what we saw when only looking at how much memory was allocated. Here’s a simple benchmark, with – spoiler alert – an averaged runtime measurement next to each function call:

void inttostr_benchmark() {
    char *s = (char*) 0;
    for (int i = -1000000; i < 1000000; i++) {
        s = inttostr_conventional(i); // => 0.365s
        //s = inttostr_log(i);        // => 0.379s
        //s = inttostr_magic(i);      // => 0.544s
    }
}

Depending on your needs, it might be your best bet to go with the conventional wisdom. If performance isn’t too important to you, go with the snprintf approach. As a tradeoff between runtime and memory usage3, manually computing the required string length is an option too – just make sure to cover every edge case!

The code is available as a Gist.

Background

This problem arose when I was faced with the task of serializing data structures4 for logging purposes. I chose to implement a tostr function for each data type, which recursively calls the tostr functions of its children and concatenates the results (with some syntactic glue in-between).

At some point, even highly nested data structures – which is the primary use case – boil down to atomic values: ints, floats and bools (which are really just tiny ints), which led me to look into what I’ve covered in this article.

  1. Actually, it’s even enough for 128-bit (unsigned) integers: log10(21281)38.5\log_{10}(2^{128}-1) \approx 38.5. My point is that while it’s guaranteed to be enough space, it’s a bit wasteful for small numbers, which tend to occur more frequently

  2. Its limit is -\infty, which is what’s commonly implemented: On my machine, running printf("%f", log10(0)); prints -inf, and printf("%i", (int) log10(0)); yields -2147483648

  3. And when dealing with bases 10\neq 10

  4. Lists, dictionaries and a special flavor of dictionary used to represent database rows.