revised w/nocase text, considering titlecase and cased

This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.

Below is the final intended text for final w/nocase.

Note that both (w/nocase upper) and (w/nocase lower)

are effectively ways to access the Unicode "cased"

property (L&), so we may want to add this explicitly

for completeness. Likewise, titlecase (Lt) can then be

accessed as (- cased upper lower) so we may as well

include this too. Both can be added as said delayed

expansions without requiring implementations to store

additional character tables.

Alex

(w/nocase sre ...)

Enclosed sres are case-insensitive. In a Unicode context character and string literals match with the default simple Unicode case-insensitive matching. Implementations may, but are not required to, handle variable length case conversions, such as #\x00DF "ß" matching the two characters "SS".

Character sets match if any character in the set matches case-insensitively to the input. Conceptually each cset-sre is expanded to contain all case variants for all of its characters. In a compound cset-sre the expansion is applied at the terminals consisting of characters, strings, embedded SRFI 14 char-sets, and named character sets. For simple unions this would be equivalent to computing the full union first and then expanding case variants, but the semantics can differ when differences and intersections are applied. For example, (w/nocase (~ ("Aab"))) is equivalent to (~ ("AaBb")), for which "B" is clearly not a member. However if you were to compute (~ ("Aab")) first then you would have a char-set containing "B", and after expanding case variants both "B" and "b" would be members.

In an ASCII context only the 52 ASCII letters (/ "a-zA-Z") match case-insensitively to each other.

In a Unicode context the only named cset-sre which are affected by w/nocase are upper and lower. Note that the case insensitive versions of these are not equivalent to letter as there are characters with the letter property but no case.

   (regexp-search "needle" "haynEEdlehay") => #f
   (regexp-search '(w/nocase "needle") "haynEEdlehay") => #<regexp-match>

   (regexp-search '(~ ("Aab")) "B") => #<regexp-match>
   (regexp-search '(~ ("Aab")) "b") => #f
   (regexp-search '(w/nocase (~ ("Aab"))) "B") => #<regexp-match>
   (regexp-search '(w/nocase (~ ("Aab"))) "b") => #<regexp-match>
   (regexp-search '(~ (w/nocase ("Aab"))) "B") => #<regexp-match>
   (regexp-search '(~ (w/nocase ("Aab"))) "b") => #<regexp-match>

Alex