SRFI 169

Title

Underscores in numbers

Author

Lassi Kortela

Status

This SRFI is currently in final status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-169@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Received: 2019-04-16
Draft #1 published: 2019-04-18
Draft #2 published: 2019-07-16
Draft #3 published: 2019-07-20
Finalized: 2019-07-26

Abstract

Many people find that large numbers are easier to read when the digits are broken into small groups. For example, the number 1582439 might be easier to read if written as 1 582 439. This applies to source code as it does to other writing. We propose an extension of Scheme syntax to allow the underscore as a digit separator in numerical constants.

Rationale
Specification
Examples
Implementation
Acknowledgements

Rationale

How many digits per group

Western cultures tend to divide digits into groups of three. This convention is not universal. For example, in India people write numbers like 3 14 15 926 (read three crore fourteen lakh fifteen thousand nine hundred and twenty-six in Indian English).

For simplicity and universality, we propose that digit groups of all sizes may be mixed freely when writing a number. It is permissible to have just one digit in a group, and groups in a number don’t need to be ordered by increasing or decreasing digit count.

What separator character to use

Human cultures and programming languages differ in what separator to use between groups.

The examples in this document so far have used a space. This is familiar to humans but not a good fit for most programming languages since whitespace has a prominent role as token separator. Scheme is no exception here.
The next natural alternative is to use a comma or a period. This is likely to cause confusion in an international community since countries that use a comma as the decimal separator are as numerous as those that use a period. More trouble comes from Scheme using the comma to splice things into a quasiquoted list: e.g. `(1,2) evaluates to (1 2). Allowing commas in numbers would change splicing behavior in a confusing way.
C++ uses an apostrophe which is somewhat exotic and may call to mind units of measure, e.g. feet and inches. Scheme also uses the apostrophe for quotation: e.g. '(1'2) evaluates to (1 (quote 2)). Allowing apostrophes in numbers would change the meaning of this syntax.
The most popular digit group separator among programming languages is the underscore. It is in the standard syntax of Ada, C#, Clojure, Eiffel, Frink, Java, Julia, Kotlin, OCaml, Perl, Python, Ruby, Rust and Swift. It is also being added to JavaScript and is a common syntax extension in implementations of Standard ML. The Common Lisp standard permits it under the umbrella of potential numbers but we are not aware of implementations that use the opportunity. Of Scheme implementations, Gauche can read numbers with underscores when they have a radix or exactness prefix.

In light of the above, we consider the underscore to be the clear winner. It is the most widely compatible and least ambiguous choice, in both human and machine terms.

Potential ambiguity between numbers and identifiers

Languages in the Lisp family traditionally allow a larger set of characters in identifiers than do most other languages. For example, the tokens 1+ and 3*/! parse as symbols in Common Lisp. Scheme is slightly more restrictive: none of R⁴RS, R⁵RS, R⁶RS and R⁷RS recognize identifiers that begin with a decimal digit. Implementations can be more relaxed with identifiers. For example, MIT Scheme comes with 1+ and -1+ procedures to increment and decrement numbers. Several implementations presently parse tokens consisting entirely of digits and underscores as identifiers. Some implementations, such as Chicken, assume that anything they cannot recognize as a number is an identifier.

Countless languages outside the Lisp family have a convention of using underscores as word separators in multi-word identifiers. Following that convention, Scheme’s open-input-file would be spelled open_input_file instead. In these languages it’s common to use a leading underscore to mark private (as opposed to public or exported) identifiers. This leads to potential ambiguity with identifiers such as _123 that start with an underscore and contain only underscores and digits. Such tokens often parse as identifiers. If we made them parse as numbers in Scheme it could confuse programmers and spell trouble for code generators that translate Scheme identifiers to other languages.

Scheme supports a rich numeric tower of integers, ratios, real and complex numbers. These come in exact and inexact varieties. For real numbers, we have decimal-point and exponent notation. The Kawa implementation of Scheme adds quaternions and units of measure to the mix. Common Lisp’s potential numbers offer a glimpse of how far numerical syntax can go. These intricate extensions, some of which we cannot even anticipate yet, make it even trickier for us to specify a digit separation scheme devoid of ambiguity.

We attempt to solve these problems with a conservative rule that allows underscores only between digits. After considering everything in the above paragraphs, we did not manage to come up with any concrete examples of present or future tasks that would be impeded by this restricted version of the syntax extension.

As an extra measure we also forbid trailing underscores, and forbid more than one consecutive underscore. We could not think of any particular situations these cause problems but decided to avoid them anyway. There are enough similar gotchas that caution seems the wise choice.

Printing numbers with underscores

This SRFI does not specify anything about inserting underscores into numbers at print time. Printing with underscores would be as useful as reading is, especially when using a Scheme read-eval-print loop as a calculator. However, there is no consensus on how to best extend the Scheme printer. Major work is underway but it will not stabilize in time for the publication of this SRFI.

Apart from printer extension concerns the cultural conventions of where to place digit separators are also varied and complex. When reading numbers we can leave the decision to writers and simply accept a wide range of possibilities. When printing we would have to make those decisions, or else map out what printer options are needed and design good defaults for them.

For these reasons, decisions about printing are deferred to implementations and to future SRFIs.

Specification

The underscore rule

We stipulate that conforming implementations of this SRFI must allow one underscore between any two digits, in any part of a number.

For the purpose of this rule, the term digit covers all digits in any radix between 2 and 36 inclusive - not only decimal digits. That means that the letters a-z and A-Z are considered digits (but only in places where the implementation parses that character as a digit).

The rule is necessarily ambiguous and incomplete

We lament that it is impossible to give a precise formal definition of the underscore rule because a typical Scheme implementation does not have a complete formal grammar for its syntax. Even if it did, that grammar could change in new versions.

The next section gives what we believe to be a correct and complete extension to the formal grammar of standard Scheme. But since few implementations support the whole standard syntax and nothing but the standard syntax, implementors of this SRFI may encounter situations where their subjective judgment is called for. The subsequent section attempts to help by listing many examples of how the rule is intended to apply in particular situations. Unfortunately that list cannot be exhaustive either.

In situations where the letter of this specification does not say anything conclusive, we ask that implementors try to follow its spirit. When in doubt as to whether or not underscores should be supported in a particular part of number syntax, we suggest that implementors not allow them. They can always be allowed later once there is more clarity or consensus.

The rule as applied to RⁿRS

The standard syntax of Scheme is defined in:

R⁶RS section 4.2. Lexical syntax
R⁷RS section 7.1. Formal syntax

The underscore rule can be implemented as an extension to either standard by adding the following grammar rules to the lexical syntax:

⟨digits R⟩       = ⟨digit R⟩+ ⟨more digits R⟩
⟨more digits R⟩  = ⟨empty⟩ | ⟨one underscore⟩ ⟨digits R⟩
⟨maybe digits R⟩ = ⟨empty⟩ | ⟨digits R⟩
⟨one underscore⟩ = _

and then making the following substitutions in existing rules (for all R):

replace all occurrences of ⟨digit R⟩+ with ⟨digits R⟩
replace all occurrences of ⟨digit R⟩* with ⟨maybe digits R⟩

Note that both standards also define the character classes ⟨digit⟩ and ⟨hex digit⟩. Neither of those should be amended. The ⟨digit⟩ class is used for identifiers and to help define other character classes. The ⟨hex digit⟩ class is used to define the backslash escape syntax for inserting characters into strings by their hexadecimal value. This SRFI does not modify the syntax of string escapes, and does not aim to modify the syntax of identifiers.

Implications of the rule

The rule includes at least the following things:

Underscores between digits in numbers of any radix (binary, octal, decimal, hexadecimal and any others supported by the implementation).
Underscores between digits 0-9 a-z A-Z when a number is written in a radix higher than 10 (using the standard hexadecimal read syntax, or any implementation-defined read syntax).
Underscores in the numerator and/or denominator of a ratio.
Underscores in the integer, fractional and/or exponent part of a real number.
Underscores in the real and/or imaginary part of a complex number.
Underscores in any dimension of a hypercomplex number (for implementations with syntax for such numbers).
Underscores in both exact and inexact numbers.
Underscores in the quantity part of a number with a unit of measure (for implementations with syntax for units of measure).
Underscores between leading zeros (but not before the first zero).

The rule excludes at least the following things:

Leading underscores before digits.
Trailing underscores after digits.
Two or more consecutive underscores.
Underscores between sign and magnitude.
Underscores next to a letter in a prefix. This includes the #b #o #d #x radix prefixes, the #e #i exactness prefixes and the #nr arbitrary radix prefix of Chez Scheme.
Underscores next to # unknown digit markers in inexact numbers.
Underscores next to the d D e E f F l L s S exponent markers.
Underscores next to the @ + - i j k markers in complex and hypercomplex numbers.
Underscores next to the R⁶RS | mantissa width suffix.
Underscores next to and within the inf and nan markers.

Being lenient about the rule

Conforming implementations may be more lenient in what they allow (perhaps to maintain compatibility with existing code). In this document, numbers written according to the underscore rule are called conforming. Other numbers (which may or may not be valid depending on the implementation) are called non-conforming.

Examples

Integers

0123             ; conforming
0_1_2_3          ; conforming
0_123            ; conforming
01_23            ; conforming
012_3            ; conforming
+0123            ; conforming
+0_123           ; conforming
-0123            ; conforming
-0_123           ; conforming

_0123            ; non-conforming
0123_            ; non-conforming
0123__           ; non-conforming
01__23           ; non-conforming
0_1__2___3       ; non-conforming
+_0123           ; non-conforming
+0123_           ; non-conforming
-_0123           ; non-conforming
-0123_           ; non-conforming

Rational numbers

1_2_3/4_5_6_7    ; conforming
12_34/5_678      ; conforming

1_2_3/_4_5_6_7   ; non-conforming
_12_34/5_678     ; non-conforming

Real numbers

0_1_23.4_5_6     ; conforming
1_2_3.5e6        ; conforming
1_2e1_2          ; conforming

_0123.456        ; non-conforming
0123_.456        ; non-conforming
0123._456        ; non-conforming
0123.456_        ; non-conforming
123_.5e6         ; non-conforming
123._5e6         ; non-conforming
123.5_e6         ; non-conforming
123.5e_6         ; non-conforming
123.5e6_         ; non-conforming
12_e12           ; non-conforming
12e_12           ; non-conforming
12e12_           ; non-conforming

Complex numbers

-12_3.0_00_00-12_34.56_78i    ; conforming
-12_3.0_00_00@-12_34.56_78    ; conforming

-12_3.0_00_00-12_34.56_78_i   ; non-conforming
-12_3.0_00_00-12_34.56_78i_   ; non-conforming
-12_3.0_00_00_@-12_34.56_78   ; non-conforming
-12_3.0_00_00@_-12_34.56_78   ; non-conforming

Hypercomplex numbers

Kawa supports quaternions using the following syntax:

1+2i-3j+4k

By applying the rule a syntax like that can be extended as follows:

1_0+2_0i-3_0j+4_0k   ; conforming

1_0_+2_0i-3_0j+4_0k  ; non-conforming
1_0+2_0_i-3_0j+4_0k  ; non-conforming
1_0+2_0i-3_0j_+4_0k  ; non-conforming
1_0+2_0i-3_0j+4_0k_  ; non-conforming

Units of measure

Kawa supports units of measure using the following syntax:

123456cm^2

By applying the rule a syntax like that can be extended as follows:

123_456cm^2          ; conforming

123_456_cm^2         ; non-conforming
123_456.78_cm^2      ; non-conforming

Numbers with radix or exactness prefixes

#b10_10_10           ; conforming
#o23_45_67           ; conforming
#d45_67_89           ; conforming
#xAB_CD_EF           ; conforming
#x789_9B_C9_EF       ; conforming
#x-2_0               ; conforming
#o+2_345_6           ; conforming

#x-_2                ; non-conforming
_#x-_2               ; non-conforming
#d_45_67_89          ; non-conforming
#e_45/67_89          ; non-conforming
#i#o_1234            ; non-conforming
#i_#o_1234           ; non-conforming
#e#x1234_            ; non-conforming

Implementation

The sample implementation is a portable library that depends only on standard features in the R⁷RS small language. It is available at:

github.com/scheme-requests-for-implementation/srfi-169

The library exports one procedure (read-number) which takes no arguments. The procedure reads one Scheme number from current-input-port with support for optional underscores. It signals an error if underscores are used in a non-conforming way according to the rule stipulated in this SRFI, or if the number syntax (sans underscores) does not conform to the R⁷RS specification. The reader supports most of the R⁷RS numeric tower with the notable exception of complex numbers. The values of inexact numbers may diverge from the values produced by the native reader of a Scheme implementation if it uses different formulas for numeric conversion.

The code was tested against the examples in this SRFI. Correct results were obtained with Chibi-Scheme, Gauche and Kawa. The test harness is included with the implementation.

Acknowledgements

This SRFI is the result of an impromptu design session on the srfi-discuss mailing list over the weekend between April 12th and April 15th, 2019.

Lassi Kortela suggested the idea, worked out the examples and rationale, wrote this document and produced the sample implementation. However the design is entirely a group effort.

John Cowan provided invaluable expertise on human and computer languages. He cautioned against requiring a fixed number of digits per group and provided the Indian English example. John found the extensive list of programming languages already using underscores. John and Lassi cautioned against the ambiguity of using commas as delimiters.

Per Bothner introduced Kawa's extended number syntax and noted Common Lisp's potential numbers as prior art. Per and John made sure the underscore syntax works when units of measure are supported, considering prior art from Kawa's syntax and the JavaScript community. Per explained Kawa's syntax for quaternions which led to hypercomplex numbers being supported.

Shiro Kawai explained the approach of Gauche which can already skip underscores when reading #-prefixed numbers. He suggested extending Scheme's formal grammar.

Peter Bex cautioned against over-extending Scheme's already intricate number syntax and potentially breaking backward-compatibility for some programs. John and Lassi advocated forbidding leading, trailing and repeated underscores as a reasonable precaution. Arthur Gleckler suggested a dedicated #_ prefix and a user interface feature for text editors as two failsafe alternatives.

Shiro, Peter and John reminded us that identifiers starting with digits are forbidden in Scheme standards since R⁴RS. But Jim Rees, Arthur and John brought up peculiar identifiers which let implementations break that rule.

Hugo Hörnquist had the idea of using Scheme's display procedure to print numbers with underscores, reserving write for portable syntax. John advised that we postpone any decisions about printing, pointing to Alex Shinn's SRFI 159 and SRFI 166 as potential solutions with a view to the upcoming large edition of the R⁷RS standard.

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Editor: Arthur A. Gleckler