Underscores in numbers
Lassi Kortela
This SRFI is currently in final status. Here is
an
explanation of each status that a SRFI can hold. To provide
input on this SRFI, please send email to srfi-169@nospamsrfi.schemers.org
.
To subscribe to the list, follow these
instructions. You can access previous messages via the
mailing list archive.
Many people find that large numbers are easier to read when
the digits are broken into small groups. For example, the number
1582439
might be easier to read if written as
1 582 439
. This applies to source code as it does to
other writing. We propose an extension of Scheme syntax to allow
the underscore as a digit separator in numerical constants.
Western cultures tend to divide digits into groups of three.
This convention is not universal. For example, in India people
write numbers like 3 14 15 926
(read three crore
fourteen lakh fifteen thousand nine hundred and twenty-six
in Indian English).
For simplicity and universality, we propose that digit groups of all sizes may be mixed freely when writing a number. It is permissible to have just one digit in a group, and groups in a number don’t need to be ordered by increasing or decreasing digit count.
Human cultures and programming languages differ in what separator to use between groups.
The examples in this document so far have used a space. This is familiar to humans but not a good fit for most programming languages since whitespace has a prominent role as token separator. Scheme is no exception here.
The next natural alternative is to use a comma or a
period. This is likely to cause confusion in an international
community since countries that use a comma as the decimal
separator are as numerous as those that use a period. More
trouble comes from Scheme using the comma to splice things
into a quasiquoted list: e.g. `(1,2)
evaluates
to (1 2)
. Allowing commas in numbers would
change splicing behavior in a confusing way.
C++ uses an apostrophe which is somewhat exotic and may
call to mind units of measure, e.g. feet and inches. Scheme
also uses the apostrophe for quotation: e.g.
'(1'2)
evaluates to (1 (quote 2))
.
Allowing apostrophes in numbers would change the meaning of
this syntax.
The most popular digit group separator among programming languages is the underscore. It is in the standard syntax of Ada, C#, Clojure, Eiffel, Frink, Java, Julia, Kotlin, OCaml, Perl, Python, Ruby, Rust and Swift. It is also being added to JavaScript and is a common syntax extension in implementations of Standard ML. The Common Lisp standard permits it under the umbrella of potential numbers but we are not aware of implementations that use the opportunity. Of Scheme implementations, Gauche can read numbers with underscores when they have a radix or exactness prefix.
In light of the above, we consider the underscore to be the clear winner. It is the most widely compatible and least ambiguous choice, in both human and machine terms.
Languages in the Lisp family traditionally allow a larger set
of characters in identifiers than do most other languages. For
example, the tokens 1+
and 3*/!
parse
as symbols in Common Lisp. Scheme is slightly more restrictive:
none of R4RS, R5RS, R6RS and
R7RS recognize identifiers that begin with a decimal
digit. Implementations can be more relaxed with identifiers. For
example, MIT Scheme comes with 1+
and
-1+
procedures to increment and decrement numbers.
Several implementations presently parse tokens consisting
entirely of digits and underscores as identifiers. Some
implementations, such as Chicken, assume that anything they
cannot recognize as a number is an identifier.
Countless languages outside the Lisp family have a convention
of using underscores as word separators in multi-word
identifiers. Following that convention, Scheme’s
open-input-file
would be spelled
open_input_file
instead. In these languages it’s
common to use a leading underscore to mark private (as opposed to
public or exported) identifiers. This leads to potential
ambiguity with identifiers such as _123
that start
with an underscore and contain only underscores and digits. Such
tokens often parse as identifiers. If we made them parse as
numbers in Scheme it could confuse programmers and spell trouble
for code generators that translate Scheme identifiers to other
languages.
Scheme supports a rich numeric tower of integers, ratios, real and complex numbers. These come in exact and inexact varieties. For real numbers, we have decimal-point and exponent notation. The Kawa implementation of Scheme adds quaternions and units of measure to the mix. Common Lisp’s potential numbers offer a glimpse of how far numerical syntax can go. These intricate extensions, some of which we cannot even anticipate yet, make it even trickier for us to specify a digit separation scheme devoid of ambiguity.
We attempt to solve these problems with a conservative rule that allows underscores only between digits. After considering everything in the above paragraphs, we did not manage to come up with any concrete examples of present or future tasks that would be impeded by this restricted version of the syntax extension.
As an extra measure we also forbid trailing underscores, and forbid more than one consecutive underscore. We could not think of any particular situations these cause problems but decided to avoid them anyway. There are enough similar gotchas that caution seems the wise choice.
This SRFI does not specify anything about inserting underscores into numbers at print time. Printing with underscores would be as useful as reading is, especially when using a Scheme read-eval-print loop as a calculator. However, there is no consensus on how to best extend the Scheme printer. Major work is underway but it will not stabilize in time for the publication of this SRFI.
Apart from printer extension concerns the cultural conventions of where to place digit separators are also varied and complex. When reading numbers we can leave the decision to writers and simply accept a wide range of possibilities. When printing we would have to make those decisions, or else map out what printer options are needed and design good defaults for them.
For these reasons, decisions about printing are deferred to implementations and to future SRFIs.
We stipulate that conforming implementations of this SRFI must allow one underscore between any two digits, in any part of a number.
For the purpose of this rule, the term digit covers
all digits in any radix between 2 and 36 inclusive - not only
decimal digits. That means that the letters a-z
and
A-Z
are considered digits (but only in places where
the implementation parses that character as a digit).
We lament that it is impossible to give a precise formal definition of the underscore rule because a typical Scheme implementation does not have a complete formal grammar for its syntax. Even if it did, that grammar could change in new versions.
The next section gives what we believe to be a correct and complete extension to the formal grammar of standard Scheme. But since few implementations support the whole standard syntax and nothing but the standard syntax, implementors of this SRFI may encounter situations where their subjective judgment is called for. The subsequent section attempts to help by listing many examples of how the rule is intended to apply in particular situations. Unfortunately that list cannot be exhaustive either.
In situations where the letter of this specification does not say anything conclusive, we ask that implementors try to follow its spirit. When in doubt as to whether or not underscores should be supported in a particular part of number syntax, we suggest that implementors not allow them. They can always be allowed later once there is more clarity or consensus.
The standard syntax of Scheme is defined in:
The underscore rule can be implemented as an extension to either standard by adding the following grammar rules to the lexical syntax:
⟨digits R⟩ = ⟨digit R⟩+ ⟨more digits R⟩ ⟨more digits R⟩ = ⟨empty⟩ | ⟨one underscore⟩ ⟨digits R⟩ ⟨maybe digits R⟩ = ⟨empty⟩ | ⟨digits R⟩ ⟨one underscore⟩ = _
and then making the following substitutions in existing rules
(for all R
):
⟨digit R⟩+
with
⟨digits R⟩
⟨digit R⟩*
with
⟨maybe digits R⟩
Note that both standards also define the character classes
⟨digit⟩
and ⟨hex digit⟩
. Neither of
those should be amended. The ⟨digit⟩
class is used
for identifiers and to help define other character classes. The
⟨hex digit⟩
class is used to define the backslash
escape syntax for inserting characters into strings by their
hexadecimal value. This SRFI does not modify the syntax of string
escapes, and does not aim to modify the syntax of
identifiers.
The rule includes at least the following things:
Underscores between digits in numbers of any radix (binary, octal, decimal, hexadecimal and any others supported by the implementation).
Underscores between digits 0-9 a-z A-Z
when a
number is written in a radix higher than 10 (using the
standard hexadecimal read syntax, or any
implementation-defined read syntax).
Underscores in the numerator and/or denominator of a ratio.
Underscores in the integer, fractional and/or exponent part of a real number.
Underscores in the real and/or imaginary part of a complex number.
Underscores in any dimension of a hypercomplex number (for implementations with syntax for such numbers).
Underscores in both exact and inexact numbers.
Underscores in the quantity part of a number with a unit of measure (for implementations with syntax for units of measure).
Underscores between leading zeros (but not before the first zero).
The rule excludes at least the following things:
Leading underscores before digits.
Trailing underscores after digits.
Two or more consecutive underscores.
Underscores between sign and magnitude.
Underscores next to a letter in a prefix. This includes
the #b #o #d #x
radix prefixes, the #e
#i
exactness prefixes and the
#
nr
arbitrary radix prefix
of Chez Scheme.
Underscores next to #
unknown digit markers
in inexact numbers.
Underscores next to the d D e E f F l L s S
exponent markers.
Underscores next to the @ + - i j k
markers
in complex and hypercomplex numbers.
Underscores next to the R6RS |
mantissa width suffix.
Underscores next to and within the inf
and
nan
markers.
Conforming implementations may be more lenient in what they allow (perhaps to maintain compatibility with existing code). In this document, numbers written according to the underscore rule are called conforming. Other numbers (which may or may not be valid depending on the implementation) are called non-conforming.
0123 ; conforming
0_1_2_3 ; conforming
0_123 ; conforming
01_23 ; conforming
012_3 ; conforming
+0123 ; conforming
+0_123 ; conforming
-0123 ; conforming
-0_123 ; conforming
_0123 ; non-conforming
0123_ ; non-conforming
0123__ ; non-conforming
01__23 ; non-conforming
0_1__2___3 ; non-conforming
+_0123 ; non-conforming
+0123_ ; non-conforming
-_0123 ; non-conforming
-0123_ ; non-conforming
1_2_3/4_5_6_7 ; conforming
12_34/5_678 ; conforming
1_2_3/_4_5_6_7 ; non-conforming
_12_34/5_678 ; non-conforming
0_1_23.4_5_6 ; conforming
1_2_3.5e6 ; conforming
1_2e1_2 ; conforming
_0123.456 ; non-conforming
0123_.456 ; non-conforming
0123._456 ; non-conforming
0123.456_ ; non-conforming
123_.5e6 ; non-conforming
123._5e6 ; non-conforming
123.5_e6 ; non-conforming
123.5e_6 ; non-conforming
123.5e6_ ; non-conforming
12_e12 ; non-conforming
12e_12 ; non-conforming
12e12_ ; non-conforming
-12_3.0_00_00-12_34.56_78i ; conforming
-12_3.0_00_00@-12_34.56_78 ; conforming
-12_3.0_00_00-12_34.56_78_i ; non-conforming
-12_3.0_00_00-12_34.56_78i_ ; non-conforming
-12_3.0_00_00_@-12_34.56_78 ; non-conforming
-12_3.0_00_00@_-12_34.56_78 ; non-conforming
Kawa supports quaternions using the following syntax:
1+2i-3j+4k
By applying the rule a syntax like that can be extended as follows:
1_0+2_0i-3_0j+4_0k ; conforming
1_0_+2_0i-3_0j+4_0k ; non-conforming
1_0+2_0_i-3_0j+4_0k ; non-conforming
1_0+2_0i-3_0j_+4_0k ; non-conforming
1_0+2_0i-3_0j+4_0k_ ; non-conforming
Kawa supports units of measure using the following syntax:
123456cm^2
By applying the rule a syntax like that can be extended as follows:
123_456cm^2 ; conforming
123_456_cm^2 ; non-conforming
123_456.78_cm^2 ; non-conforming
#b10_10_10 ; conforming
#o23_45_67 ; conforming
#d45_67_89 ; conforming
#xAB_CD_EF ; conforming
#x789_9B_C9_EF ; conforming
#x-2_0 ; conforming
#o+2_345_6 ; conforming
#x-_2 ; non-conforming
_#x-_2 ; non-conforming
#d_45_67_89 ; non-conforming
#e_45/67_89 ; non-conforming
#i#o_1234 ; non-conforming
#i_#o_1234 ; non-conforming
#e#x1234_ ; non-conforming
The sample implementation is a portable library that depends only on standard features in the R7RS small language. It is available at:
github.com/scheme-requests-for-implementation/srfi-169
The library exports one procedure (read-number)
which takes no arguments. The procedure reads one Scheme number
from current-input-port
with support for optional
underscores. It signals an error if underscores are used in a
non-conforming way according to the rule stipulated in this SRFI,
or if the number syntax (sans underscores) does not conform to
the R7RS specification. The reader supports most of
the R7RS numeric tower with the notable exception of
complex numbers. The values of inexact numbers may diverge from
the values produced by the native reader of a Scheme
implementation if it uses different formulas for numeric
conversion.
The code was tested against the examples in this SRFI. Correct results were obtained with Chibi-Scheme, Gauche and Kawa. The test harness is included with the implementation.
This SRFI is the result of an impromptu design session on the
srfi-discuss
mailing list over the weekend between
April 12th and April 15th, 2019.
Lassi Kortela suggested the idea, worked out the examples and rationale, wrote this document and produced the sample implementation. However the design is entirely a group effort.
John Cowan provided invaluable expertise on human and computer languages. He cautioned against requiring a fixed number of digits per group and provided the Indian English example. John found the extensive list of programming languages already using underscores. John and Lassi cautioned against the ambiguity of using commas as delimiters.
Per Bothner introduced Kawa's extended number syntax and noted Common Lisp's potential numbers as prior art. Per and John made sure the underscore syntax works when units of measure are supported, considering prior art from Kawa's syntax and the JavaScript community. Per explained Kawa's syntax for quaternions which led to hypercomplex numbers being supported.
Shiro Kawai explained the approach of Gauche which can already
skip underscores when reading #
-prefixed numbers. He
suggested extending Scheme's formal grammar.
Peter Bex cautioned against over-extending Scheme's already
intricate number syntax and potentially breaking
backward-compatibility for some programs. John and Lassi
advocated forbidding leading, trailing and repeated underscores
as a reasonable precaution. Arthur Gleckler suggested a dedicated
#_
prefix and a user interface feature for text
editors as two failsafe alternatives.
Shiro, Peter and John reminded us that identifiers starting with digits are forbidden in Scheme standards since R4RS. But Jim Rees, Arthur and John brought up peculiar identifiers which let implementations break that rule.
Hugo Hörnquist had the idea of using Scheme's
display
procedure to print numbers with underscores,
reserving write
for portable syntax. John advised
that we postpone any decisions about printing, pointing to Alex
Shinn's SRFI 159 and SRFI 166 as potential solutions with a view
to the upcoming large edition of the R7RS
standard.
Copyright © Lassi Kortela (2019)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.