This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
and have not read every message in the archives.) I am very, very happy to see this much attention lavished on Unicode data representation. However, I strongly regret the extended string lexical syntax in this proposal, and so would like to propose an alternative solution. RECITATIONS 1. CL compatibility R5RS lexical strings are very close to CL lexical strings, though the wording of th Scheme spec leaves an out for extended string syntaxes. I would love to preserve consistency with CL here. 2. Simplicity In a CL string's external syntax, a backslash ("single escape character") does not change the meaning of the next character; rather, it removes any meta-meaning, forcing the next character as (READ) to be taken literally. This makes string literal parsing and construction very simple: in any place where you're assembling a literal string and for some reason don't know for certain what the next character you're appending is, you can prepend a backslash and thus ensure safety. And any parsing likewise doesn't need to take into account differences in various implementations' backslash escapes. 3. Archaic C escape sequences Do we need literal string codes for alarm, tab, linefeed, vertical tab? How many Scheme programs nowadays drive lineprinters with data embedded in static strings? 4. Non-extensibility of extended lexical syntax Escape sequences cannot be altered by the program, since they take effect at read time: by the time that the program receives the program text, the string has already been lexed, with escapes interpreted and meta information discarded. Having a bevy of magic sequences certainly suggests that a client could change or at least amend to those sequences ("Hey! backslash-q is free - let's use it to flag internationalization data" "Hey! this implementation doesn't support baclslash-a - let's add that"), perhaps adding compile-time or run-time information (e.g. field width specifiers in C printf). CL solves this by moving everything into the library. 5. Redundancy with character lexical syntax Finally, we have two separate ways to denote non-program-plaintext characters: one in character lexical syntax, and one in string syntax. Reducing this would be good for both implementation and user. PROPOSAL SRFI 75 could adopt a two-layered literal string syntax. The basic string literal syntax would be that of CL, or R5RS without the "behavior is unspecified" escape clause. A string datum is a sequence of characters read from the program text, delimited by double quotes. A backslash forces the next character in the sequence to be taken literally. Since the only magic characters in this syntax are double-quote and backslash, backslash only changes their behavior. The second layer would be the extended string syntax, introduced by sharp-doublequote- leftparen and closed by rightparen. Elements within the extended string delimiters are read like any token: they may include simple strings, or characters (including the extended syntax of SRFI 75), which become a single element in the string. The elements in an extended string are concatenated in sequence, yielding a single string literal. Including an element of other than simple string or character yields implementation-specific rsults. Examples: #"("this") -> "this" #"("this" "that") -> "thisthat" #"("this" #\a) -> "thisa" #"("this" #\space "that") -> "this that" #"("G" #\u0246 "del") -> "Goedel" #"("This is " "the symphony " "that Schubert wrote, but never finished") -> "This is the symphony that Schubert wrote, but never finished"#"("One fish." #\newline "Two fish." #newline "Red fish." #\newline "Blue fish.")
-> "One fish. Two fish. Red fish. Blue fish." Two obvious extensions to extended strings are symbols, which are converted to strings per standard, and numbers, which represent a single Unicode codepoint and are subject to the same limitations as characters. #"("this" symbol) -> "thissymbol" #"("G" 246 "del") -> "Goedel" #"("G" #xF6 "del") -> "Goedel" COUNTERARGUMENTS Per my recitations above Q1. Why carry on about CL compatibility, then? There's no such extended string syntax in CL A1. Yes, but the part that looks like CL now behaves very much like CL. Dual users can carry their expectations about backslash and doublequote back and forth between the two. Q2. Is that any simpler than backslashes? A2. It is for the purpose of emitting string literals in the double-quote syntax, which presumably will subsequently be consumed by another program. Parsing-wise, any system that can read a vector can now read an extended string. Q3. My popular and well-liked Scheme system already supports C-style string escapes. A3. [looks at feet, shuffles uncomfortably] Q4. This isn't extensible at runtime, either. A4. Right. But since it doesn't embed characters inside literally typed strings, it doesn't /look/ extensible. Any extensibility will have to take place through a real character in the string, such as the tilde used by FORMAT. And others Qa. Man, #"("G" #\u0246 "del") is ugly. If my dog was that ugly.... Aa. Uglier than "G\u0246del"? Neither is pretty. One's easier to parse and IMO read, one's easier to type. Extended gets better if you allow integers to represent codepoints ala #"("G" 246 "del"). Qb. Doesn't the WRITE procedure become more complicated in the presence of extended strings? Now the runtime must preprocess any written string to see if it requires the extended syntax before writing it. Ab. That is correct. Backslash syntax has the advantage of pushing this decision down to the per-character level. Anyway. For your consideration, Ben