[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

This page is part of the web mail archives of SRFI 58 from before July 7th, 2015. The new archives for SRFI 58 are here. Eventually, the entire history will be moved there, including any new messages.

*To*: srfi-58@xxxxxxxxxxxxxxxxx*Subject*: Floating-point formats and standards*From*: "Bradd W. Szonye" <bradd+srfi@xxxxxxxxxx>*Date*: Wed, 5 Jan 2005 03:48:09 -0800*Delivered-to*: srfi-58@xxxxxxxxxxxxxxxxx*In-reply-to*: <20050105055413.5FA901B7717@xxxxxxxxxxxxxxxx>*Mail-followup-to*: srfi-58@xxxxxxxxxxxxxxxxx*References*: <Pine.LNX.4.44.0412262207080.10074-100000@xxxxxxxxxxxxxxxxxxxxxxxxxxx> <20041230222337.3D7BD1B7711@xxxxxxxxxxxxxxxx> <Pine.LNX.4.58.0412301550550.3862@xxxxxxxxxxxxxx> <20050105012438.GD6573@xxxxxxxxxxxxxxx> <20050105055413.5FA901B7717@xxxxxxxxxxxxxxxx>*User-agent*: Mutt/1.4.1i

Bradd wrote: >> The current names for flonum arrays are "real-64" and "real-32," >> corresponding to "IEEE 64.bit floating point real" and "IEEE 32.bit >> floating point real." There are a few problems with this. Aubrey Jaffer wrote: > Those widths were quoted directly from R5RS 6.2.3 Implementation > restrictions: > > This report recommends, but does not require, that the IEEE 32-bit > and 64-bit floating point standards be followed by implementations > that use flonum representations, and that implementations using > other representations should match or exceed the precision > achievable using these floating point standards [IEEE]. > > Since R5RS used widths rather than IEEE terms here, at least we are > clear about the size of these floats. Note that [IEEE] is "IEEE Standard 754-1985. IEEE Standard for Binary Floating-Point Arithmetic. IEEE, New York, 1985." At the very least, the SRFI should reference the same document in its bibliography to avoid any confusion. Please note that the IEEE 754R working group is considering revisions to the standard; see <http://754r.ucbtest.org/>. You also can find "the technical content of ANSI/IEEE Std 754-1985" at the front page of that site, which may help you with technical decisions. >> Second, the format names are properly "single" and "double," not >> "32.bit" and "64.bit." > The phrases "single precision" and "double precision" do not appear in > R5RS. The only appearance of "SHORT, SINGLE, DOUBLE, and LONG" > referring to numbers is in 6.2.4 Syntax of numerical constants. IEEE 754 calls them "single format" and "double format." IEEE 854 calls them "single precision" and "double precision." In the only place it describes flonum formats, R5RS calls them "single" and "double." All three use the same basic name. I'm not sure what your objection is here; all three standards agree on the names. However, that may change. I don't know how close the 754R working group is to publishing a new standard, but they've changed the names of all the formats, and it looks like they may drop all of the extended formats except Intel's 80-bit double extended format. More on this below. >> Third, both IEEE 754 and R5RS Scheme specify four floating-point >> formats, but SRFI 58 currently supports only two. It should probably >> support the other two types. > Difficult at the moment, since I don't know what their sizes are, or > their ordering. See the website above for details; here's an overview. IEEE 754 defines two "basic" formats (single and double), with precise layout and precision requirements for binary compatibility between implementations. While it only requires support for the single format, all major workstation- and server-class FPUs support both of them (mostly) in hardware. The current standard also recommends support for one implementation- defined "extended" format for temporary results (to minimize rounding errors in intermediate calculations). As far as I know, only x86-type FPUs have ever actually done this. All of the x86 FPU registers are 80 bits wide; it only rounds to single or double precision when you store results to memory in one of the basic formats. Everyone except Intel (and x86-cloners) has taken a different approach. They all offer a 128-bit flonum, but they don't use it for intermediate results as IEEE 754 recommends, possibly because many language standards (inexplicably) discourage it by overspecifying the results of calculations. There are two major 128-bit representations: the "quad" and the "double-double." The quad format works just like the IEEE basic formats, only bigger, and almost always implemented in software. Many vendors implement double-double instead, using special FPU ops to improve speed at the cost of precision. (Double-doubles are only a little more precise than doubles, despite all the extra bits.) The 754R folks have recognized this /de facto/ standard, and it looks like they'll adopt these formats for the revised /de jure/ standard, with one exception: They want vendors to drop the fast-but-weak double- double in favor of the quad format. Meanwhile, they're adding four new formats and changing all the names. It looks like the full list will be: New name Sig Exp Old name Currently implemented by binary16 11 5 binary32 23 8 single all systems (hardware) binary64 52 11 double all systems (hardware) binary80 64 15 extended all x86-based systems (hardware) binary128 112 15 quad most RISC systems (software) decimal32 7 7½ decimal64 16 9½ decimal128 34 13½ Key: Sig = width of significand in bits (binary) or digits (decimal) Exp = width of exponent in bits (both bases) New = name proposed by 754R working group Old = name used in most FPU manuals If you want names that most programmers are already familiar with, use the "old" names from the table. If you want bit widths, use the "new" names. Providing both as synonyms might be a good idea. Here are my recommendations for "Schemey" versions of all the names: binary16 flonum-b16 binary32 flonum-b32 and single-flonum binary64 flonum-b64 and double-flonum binary80 flonum-b80 and extended-flonum (or long-flonum) binary128 flonum-b128 and quad-flonum decimal32 flonum-d32 decimal64 flonum-d32 decimal128 flonum-d128 I'd only make support mandatory for b32 and b64. It looks like 754R is encouraging vendors to migrate from extended and double-double to b128 (quad), but they expect some holdouts. >> Fourth, if the SRFI requires IEEE 754 representations, it should >> also mandate a particular correspondence between IEEE 754 formats >> and Scheme precisions (e.g., f=single, s=single extended, etc). >> Otherwise, users won't be able to match literal reals to array >> types reliably. > Okay, what's the correspondence? F/single is the single (binary32) format. D/double is the double (binary64) format. L/long is the system's "long double" type, one of: extended (binary80) quad (binary128) double-double (implementation defined) S/short is currently useless, and best used as an alias for "single." In the future, it might be useful if your system implements binary16. I hope this helps. I certainly learned a lot about floating-point formats as I was researching details for this comment. >> Finally, the array element prefix should match IEEE 754 (single, >> double), Scheme (f/single, d/double), or SRFI 47 (ar32, ar64). The >> current "real-32" and "real-64" prefixes don't quite match any of >> those. > (define A:real-64 ar64) > (define A:real-32 ar32) > > SRFI-47 should be modified (if that is possible) or replaced after we > are finished with SRFI-58. Yeah, I missed that earlier; sorry about that. -- Bradd W. Szonye http://www.szonye.com/bradd

**Follow-Ups**:**Re: Floating-point formats and standards***From:*Bradd W. Szonye

**References**:**Re: #\a octothorpe syntax vs SRFI 10***From:*campbell

**Re: #\a octothorpe syntax vs SRFI 10***From:*Aubrey Jaffer

**Re: #\a octothorpe syntax vs SRFI 10***From:*bear

**Re: #\a octothorpe syntax vs SRFI 10***From:*Bradd W. Szonye

**Re: #\a octothorpe syntax vs SRFI 10***From:*Aubrey Jaffer

- Prev by Date:
**Re: SRFI-10 syntax vs. #nA syntax** - Next by Date:
**Re: Floating-point formats and standards** - Previous by thread:
**Re: #\a octothorpe syntax vs SRFI 10** - Next by thread:
**Re: Floating-point formats and standards** - Index(es):