# Improving math performance in glibc

**Update:** Vincent Lefèvre helpfully pointed out that I had linked to the incorrect Worst Cases paper. That link is now fixed.

**Update 2:** Dan Courcy pointed out that my equation in the “Multiplying zeroes” section had an error, which I have now fixed.

Mathematical function implementations usually have to trade off between speed of computation and the accuracy of the result. This is especially true for transcendentals (i.e. the exponential and trigonometric functions), where results often have to be computed to a fairly large precision to get last bit accuracy in a result that is to be stored in an IEEE-754 double variable.

Transcendental functions in glibc are implemented as multiple phases. The first phases of computation use a lookup table of pre-computed values and a polynomial approximation, a combination of which gives an accurate result for a majority of inputs. If it is found that the lookup table may not give an accurate result the next, slower phase is employed. This phase uses a multiple precision representation to compute results to precisions of up to 768 bits before rounding the result to double. As expected, this kills performance; the slowest path for `pow`

for example is a few thousand times slower than the table lookup phase.

I looked into this problem not very long ago and tried to improve performance of the multiple precision phase so that it is not as bad as it currently is. The result of that was an improvement of about 8 times in the performance of the slowest path of the `pow`

function. Other transcendentals got similar improvements since the fixes were mostly in the generic multiple precision code. These improvements went into glibc-2.18 upstream. We’ll take a look at what these changes were but first, a quick look at the multiple precision number for context on the changes.

## The multiple precision number

The multiple precision number is implemented in glibc as a structure with an integral exponent and an array of numbers for the mantissa.

typedef struct { int e; double d[40]; } mp_no;

The exponent e can be any integer value and the array d represents the mantissa. Each number in d is a non-negative integer less than 2^{24}. Only 32 of the 40 digits are used (the rest being there for additional precision for rounding), giving a maximum precision of 768 bits. The first question one may have is why the mantissa digits are double and not int if their values are going to be only integral. We’ll get to that point later, when we look at one of the relevant improvements.

The multiple precision bits implement addition, subtraction and multiplication as primitives – division is implemented as multiplication of one number with the inverse of the other. A simple analysis of test programs using oprofile or gprof show that the slow path performance is heavily dependent on the performance of the multiplication function. As much as 97% of the time in computation of `pow`

is spent in the multiplication function! So it is obvious that the first place to look at to get improvements is the multiplication routine.

The multiplication routine is actually quite simple, but that is its bane. It emulates the long multiplication algorithm (the method most of us must have learned in school) using a nested loop on the mantissa digits. The inner loop adds the results of the multiplication of individual digits along the column, divides the result by 2^{24}, keeps the remainder and adds the quotient to the higher digit. The outer loop traverses across the digits. To make the algorithm simple, it is assumed that all of the 32 digits are in use and hence the loop runs over all of the 32 digits.

## Multiplying zeroes is obviously a waste

When a number is converted from double to the mp_no format, only 3 mantissa digits are in use. Initial computations don’t use much more than that and in fact for smaller multiple precision phases (240 bits for example), only 10 digits are in use. Despite that, the multiplication algorithm iterates over all of the 32 digits all the time, which is wasteful. So the first optimization was to find the longest digit sequence among the two multiplicands and iterate only up to that point. This is an extra iteration, but for every run of length m on digit sequence length n, we save n^{2} – m^{2} – 2 * m operations.

## Squaring is a special case of multiplication

The long form multiplication algorithm has a neat sequence, which allows us to loop through the digits in nested loops to come up with the result. The sequence in the inner loop is of the form:

R[k] = SUM(X[i] * Y[j])

where i goes from 1 to k – 1 and j goes the other way, from k – 1 to 1. It is easy to see that if X and Y were the same, i.e. we were squaring a number, we just have to run the inner loop halfway and then double the result, adding in the remaining digit if there are an odd number of digits. Since squaring was not an uncommon operation, I wrote a separate function for it that was a special case of multiplication.

## Karatsuba-influenced multiplication

Let us look at the SUM operation above a bit closer. It can be written as:

R[k] = X[1] * Y[k - 1] + X[2] * Y[K - 2] + ... + X[k - 1] * Y[1]

rearranging the terms and grouping them, we can write the above as:

R[k] = (X[1]*Y[k - 1] + X[k-1]*Y[1]) + (X[2]*Y[k-2] + X[k-2]*Y[2]) + ...

Now, we know that `(X[i] + Y[i]) * (X[j] + Y[j])`

is `X[i]*X[j] + Y[i]*Y[j] + X[i]*Y[j] + X[j]*Y[i]`

, so we can write the above sum as:

R[k] = (X[1] + Y[1])*(X[k-1] + Y[k-1]) + (X[2] + Y[2])*(X[k-2] + Y[k-2]) + ... - SUM(X[i]*Y[i])

where i goes from 1 to k-1. With this notation, I could precompute the smaller summation into an array and then the remaining computations reduce by aout 20%, resulting in a net improvement in performance.

## Faster polynomial evaluation for exp

This improvement was specific to the `exp`

function, but the result improves `pow`

too, since `pow`

uses `exp`

. The old implementation computed the Taylor series result as follows:

e^{x}= 1 + x (1 + x / 2 (1 + x / 3 (1 + x/4 ...)))

This involved one multiple precision multiplication and one multiple precision division in each iteration. This can be improved by computing the result as the following instead:

e^{x}= 1 + (x * (n!/1! + x * (n!/2! + x * (n!/3! + x ...)))) / n!

Here, the factorials can be computed on the fly with a single primitive multiplication, which leaves just one multiple precision multiplication.

## Configurable mantissa types

We finally come to the mantissa types problem that I had alluded to in the beginning of the post. The mantissa digits were doubles even though they were only required to store and operate on integers. Even the intermediate operations would at best require int64_t storage to accommodate intermediate values of digits during multiplication. I started by replacing those with long, but found that performance on powerpc-based systems dropped. This reason for this is that a Power processor has 4 floating point units for a single fixed point unit, making floating point operations heavily parallelizable.

So the solution in the end was to have configurable types for mantissa digits, with powerpc sticking to the double type and everything else using int64_t. This again gave a fairly big boost to performance on x86 (~30%) since fixed point computation is significantly faster on x86.

## The road ahead

This is definitely not the end of the road for improvements in the multiple precision performance. However, I believe I have picked a lot of the low hanging fruit. There are other more complicated improvements, like the limitation of worst case precision for `exp`

and `log`

functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision. One needs to prove that those results apply to the glibc multiple precision bits.

**Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development. The developer program and software are both free!**

How would you characterize the likelihood of extracting further platform-agnostic performance out of the frequently used non-floating-point portions of glibc? String and array functions, sorting, searching, pattern matching, parsing & formatting, etc.?

LikeLike

A lot of the string functions (especially the copy/move functions) are implemented in assembly and selected at runtime according to the target processor, so they’re very platform-specific. I haven’t done a deep enough study of the regex or the sort, search or formatting functions to make a useful comment about how they can be improved. If anybody is interested in taking up such a study I am sure we (i.e. me and the glibc upstream community) would love to mentor and help in whatever reasonable way we can.

LikeLike

Note that the link in the last paragraph concerns the decimal64 exponential function (paper “Worst Cases for the Exponential Function in the IEEE 754r decimal64 Format” published in 2008, not the paper “Worst Cases for Correct Rounding of the Elementary Functions in Double Precision” published in 2001).

FYI, the latest results concerning binary64 (a.k.a. double precision) are described in two presentations I’ve done for the “ANR TaMaDi project”: https://www.vinc17.net/research/slides/tamadi2010-10.pdf and https://www.vinc17.net/research/slides/tamadi2013-10.pdf also available from the TaMaDi wiki: http://tamadiwiki.ens-lyon.fr/ (but which is partly in French).

LikeLiked by 2 people

Thanks, I have now fixed the link to point to the paper I actually wanted to refer to.

LikeLike

Brilliant work! In the future could you integrate Fredrik Johansson’s implementation? http://arxiv.org/abs/1410.7176

LikeLike

Would it be possible to further increase performance by using SIMD operations? e.g. SSE on x86 architecture, NEON on ARM etc.

LikeLike

The multiplication double loop ought to be a candidate for vectorization but each iteration feeds into the previous element serially, making it difficult to parallelize. Maybe if that loop is written in a vectorization-friendly manner, it should eek out a bit more performance.

LikeLike

Re. speeding up trig functions:

We have discussed “exact range reduction” for trig functions multiple times in news:comp.arch, this is the key to handling large inputs with full accuracy. (Personally I believe it is stupid to ask for something like sin(1e60) and expect the result to have any significant mantissa bits left…)

Anyway, the easiest method is to start with an ~1100 bits accurate value for 0.5/pi, stored either as an array of uint32_t or as 24-bit fp values like you wrote about here:

From the input exponent you can determine if any initial range reduction is needed at all (i.e. abs(x) > 2*pi), and you can also figure out which chunks of the reciprocal can influence the final (reduced to the [-1 .. 1] range) product.

If the implied negative exponent of a given chunk plus the input exponent is larger than the chunk size (32/24 plus the input precision of 24 or 53 bits), then the product of these two numbers will be an integral value, corresponding to the same number of full circles which is what we are trying to get rid of.

Having determined the first valid chunk we then need to include enough of those following that the rest will only contribute negligible additional data: Shooting for twice the input precision is fine here!

At the end of this process we have converted the input range from an arbitrary number of radians to fractional circles, at which point we can (for many of the functions) use the top mantissa bits for the next stage of range reduction ending up with a very small range where a simple cheby or even truncated taylor series is sufficient to reach the desired final precision.

LikeLike

I don’t know much about the glibc code structure but perhaps someone should separate each math function into a different file to make it easy for math wizards to help or for hardware designers to add optimized code for different architectures.

LikeLike

That is exactly how we implement all functions in glibc, not just in libm. Look for files named e_exp.c, e_pow.c, etc. in the glibc sources.

LikeLike

Great discovery. Implementation of the new library will surely speed up many algorithms.

LikeLike

“with power sticking” -> “with powerpc sticking”

LikeLike

Thanks, fixed.

LikeLike

So we were wondering whether methods other than taylor had been considered for the worst case results. Taylor typically does not converege as fast as other alternatives.

LikeLike

No, my work was restricted to tweaking current methods. I would not have had enough time to explore and implement other methods.

LikeLike

Great work! Would you consider backporting these improvements into the glibc versions used in RHEL 6 and 7?

I and many other people have applications that could really benefit from increased transcendental function performance, e.g. geospacial calculations.

Otherwise, would it be possible for you to release a code extract with the functions that you have optimised? That way, it could be turned into a shared library that overrides the existing versions of these glibc transcendental functions with your optimised versions. That would avoid having to touch glibc.

I found this guide here and it doesn’t seem too hard:

http://hackerboss.com/overriding-system-functions-for-fun-and-profit/

LikeLike

We could consider backporting these improvements into RHEL-7 if there is interest; the way to indicate that is to just file a feature request or even better, raise a ticket with customer support. But before you do, you might want to see if the slow paths actually affect your application since they’re a very small fraction of the input domain.

You can see all of the relevant changes I’ve made using the following command on the upstream glibc repository:

You won’t be able to override those functions as is though, since they’re glibc internal. You can use the interposition trick you mentioned only for exported functions. If you’re looking to override the exported math functions (e.g. pow and not just the slow path of pow) then you’ll have to adapt and build all of the files involved in those functions, i.e. the wrappers that do error handling, the fast path implementation and then the slow path. That will be a lot of work, so you really want to know if that effort is worthwhile for your application.

LikeLike

In which version of glibc did or will the improvements enter?

LikeLike

These improvements went upstream in glibc 2.18. We haven’t yet backported them to RHEL.

LikeLike

Can you backport this to RHEL6?

LikeLike