[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Multiple precisions of floating-point arithmetic

This page is part of the web mail archives of SRFI 77 from before July 7th, 2015. The new archives for SRFI 77 contain all messages, not just those from before July 7th, 2015.

Some floating-point applications need greater-than-64-bit-precision arithmetic; two are mentioned below.

Perhaps this SRFI should tackle the problem of providing floating- point arithmetics of various precisions. If we think this might be needed, then the specially-named--operator approach for floating- point arithmetic as suggested in this SRFI (and which I like, by the way), does not seem to scale well.

Common Lisp has an approach which is perhaps cumbersome to use properly and may be error prone, but it does allow for the implementation and use of differing precisions of floating-point arithmetic where they are useful.

Or perhaps one could use the naming convention "name" (default double precision operation), "name"f (single-precision, 32-bit, operator), and "name"l (long double, whether 80 bit extended precision, 128-bit quad precision, or 128-bit pair-of-64-bit-doubles precision) for operations as is done in C if one wants to use the special-name approach.


Examples of effective use of 128-bit floating-point arithmetic:

The following problem was pointed out by Philip W Sharp at the University of Auckland in a talk on the long-time simulation of the solar system.

As computers get faster, round-off error accumulates more quickly, and, indeed, scientists are reaching the end of usefulness of 64-bit IEEE floating-point arithmetic for long-time simulations of the behavior of the solar system. There's a paper here that discusses this issue:


Basically, if you want to simulate the solar system for longer times you'll need an underlying arithmetic with more accuracy.

Beyond using extended-precision arithmetic for accurate evaluation of the elementary functions, this was the first "real" application that I had heard of that needed more than 64-bit arithmetic.

Then Colin Percival published his paper "Rapid multiplication modulo the sum and difference of highly composite numbers",

www.ams.org/mcom/2003-72-241/S0025-5718-02-01419-9/ S0025-5718-02-01419-9.pdf

which gives new bounds for the error in FFTs implemented in floating- point arithmetic. This allows you to use FFTs to implement bignum arithmetic with inputs of size 256 * (1024)^2 bits in 64-bit IEEE arithmetic with proven accuracy. (Most codes for FFT bignum arithmetic use number-theoretic FFTs on finite fields.) This is not as big as some applications would like, but with 128-bit arithmetic (either so-called quad-precision with a 15 bit exponent and 113-bit mantissa or IBM-type long-double implemented as a pair of doubles (so with the same dynamic range as 64-bit IEEE arithmetic but with about 106 bits of precision)), one could very easily implement fast, provably accurate bignum multiplication for sizes as big as one might ever need (and I don't think I'll live long enough to see that statement made false).

I think that, given the effort and expense put into designing fast floating-point arithmetic units, bignum arithmetic built on floating- point FFTs will, in the end, be faster than the number theoretic FFTs now popular among the "really big bignum" folks.