[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Floating-point formats and standards

This page is part of the web mail archives of SRFI 58 from before July 7th, 2015. The new archives for SRFI 58 contain all messages, not just those from before July 7th, 2015.

Bradd wrote:
>> The current names for flonum arrays are "real-64" and "real-32,"
>> corresponding to "IEEE 64.bit floating point real" and "IEEE 32.bit
>> floating point real." There are a few problems with this.

Aubrey Jaffer wrote:
> Those widths were quoted directly from R5RS 6.2.3 Implementation
> restrictions:
>   This report recommends, but does not require, that the IEEE 32-bit
>   and 64-bit floating point standards be followed by implementations
>   that use flonum representations, and that implementations using
>   other representations should match or exceed the precision
>   achievable using these floating point standards [IEEE].
> Since R5RS used widths rather than IEEE terms here, at least we are
> clear about the size of these floats.

Note that [IEEE] is "IEEE Standard 754-1985. IEEE Standard for Binary
Floating-Point Arithmetic. IEEE, New York, 1985." At the very least, the
SRFI should reference the same document in its bibliography to avoid any
confusion. Please note that the IEEE 754R working group is considering
revisions to the standard; see <http://754r.ucbtest.org/>. You also can
find "the technical content of ANSI/IEEE Std 754-1985" at the front page
of that site, which may help you with technical decisions.

>> Second, the format names are properly "single" and "double," not
>> "32.bit" and "64.bit."

> The phrases "single precision" and "double precision" do not appear in
> R5RS.  The only appearance of "SHORT, SINGLE, DOUBLE, and LONG"
> referring to numbers is in 6.2.4 Syntax of numerical constants.

IEEE 754 calls them "single format" and "double format." IEEE 854 calls
them "single precision" and "double precision." In the only place it
describes flonum formats, R5RS calls them "single" and "double." All
three use the same basic name. I'm not sure what your objection is here;
all three standards agree on the names.

However, that may change. I don't know how close the 754R working group
is to publishing a new standard, but they've changed the names of all
the formats, and it looks like they may drop all of the extended formats
except Intel's 80-bit double extended format. More on this below.

>> Third, both IEEE 754 and R5RS Scheme specify four floating-point
>> formats, but SRFI 58 currently supports only two. It should probably
>> support the other two types.

> Difficult at the moment, since I don't know what their sizes are, or
> their ordering.

See the website above for details; here's an overview.

IEEE 754 defines two "basic" formats (single and double), with precise
layout and precision requirements for binary compatibility between
implementations. While it only requires support for the single format,
all major workstation- and server-class FPUs support both of them
(mostly) in hardware.

The current standard also recommends support for one implementation-
defined "extended" format for temporary results (to minimize rounding
errors in intermediate calculations). As far as I know, only x86-type
FPUs have ever actually done this. All of the x86 FPU registers are 80
bits wide; it only rounds to single or double precision when you store
results to memory in one of the basic formats.

Everyone except Intel (and x86-cloners) has taken a different approach.
They all offer a 128-bit flonum, but they don't use it for intermediate
results as IEEE 754 recommends, possibly because many language standards
(inexplicably) discourage it by overspecifying the results of
calculations. There are two major 128-bit representations: the "quad"
and the "double-double." The quad format works just like the IEEE basic
formats, only bigger, and almost always implemented in software. Many
vendors implement double-double instead, using special FPU ops to
improve speed at the cost of precision. (Double-doubles are only a
little more precise than doubles, despite all the extra bits.)

The 754R folks have recognized this /de facto/ standard, and it looks
like they'll adopt these formats for the revised /de jure/ standard,
with one exception: They want vendors to drop the fast-but-weak double-
double in favor of the quad format. Meanwhile, they're adding four new
formats and changing all the names. It looks like the full list will be:

    New name    Sig   Exp   Old name   Currently implemented by

    binary16     11     5
    binary32     23     8   single     all systems (hardware)
    binary64     52    11   double     all systems (hardware)
    binary80     64    15   extended   all x86-based systems (hardware)
    binary128   112    15   quad       most RISC systems (software)

    decimal32     7    7½
    decimal64    16    9½
    decimal128   34   13½

    Sig = width of significand in bits (binary) or digits (decimal)
    Exp = width of exponent in bits (both bases)
    New = name proposed by 754R working group
    Old = name used in most FPU manuals

If you want names that most programmers are already familiar with, use
the "old" names from the table. If you want bit widths, use the "new"
names. Providing both as synonyms might be a good idea. Here are my
recommendations for "Schemey" versions of all the names:

    binary16    flonum-b16
    binary32    flonum-b32  and single-flonum
    binary64    flonum-b64  and double-flonum
    binary80    flonum-b80  and extended-flonum (or long-flonum)
    binary128   flonum-b128 and quad-flonum

    decimal32   flonum-d32
    decimal64   flonum-d32
    decimal128  flonum-d128

I'd only make support mandatory for b32 and b64. It looks like 754R is
encouraging vendors to migrate from extended and double-double to b128
(quad), but they expect some holdouts.

>> Fourth, if the SRFI requires IEEE 754 representations, it should
>> also mandate a particular correspondence between IEEE 754 formats
>> and Scheme precisions (e.g., f=single, s=single extended, etc).
>> Otherwise, users won't be able to match literal reals to array
>> types reliably.

> Okay, what's the correspondence?

F/single is the single (binary32) format.
D/double is the double (binary64) format.

L/long is the system's "long double" type, one of:
    extended (binary80)
    quad (binary128)
    double-double (implementation defined)

S/short is currently useless, and best used as an alias for "single." In
the future, it might be useful if your system implements binary16.

I hope this helps. I certainly learned a lot about floating-point
formats as I was researching details for this comment.

>> Finally, the array element prefix should match IEEE 754 (single,
>> double), Scheme (f/single, d/double), or SRFI 47 (ar32, ar64). The
>> current "real-32" and "real-64" prefixes don't quite match any of
>> those.

> (define A:real-64    ar64)
> (define A:real-32    ar32)
> SRFI-47 should be modified (if that is possible) or replaced after we
> are finished with SRFI-58.

Yeah, I missed that earlier; sorry about that.
Bradd W. Szonye