275: URIs and IRIs

by Duncan Guthrie

Status

This SRFI is currently in draft status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-275@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Abstract

This SRFI proposes a programming interface for working with RFC 3986 universal resource identifiers (URIs), as well as RFC 3987's generalisation to internationalised resource identifiers (IRIs). This document defines record types, normalisation procedures, and conversion between URIs and IRIs. Additionally, we contribute a test suite to specify the behaviour of normalisation with respect to relative references, which has in the past been a source of divergence between implementations.

Issues

none so far

Rationale

RFC 3986 [1] describes an abstract syntax for uniform resource identifiers (URIs) as well as their relative references, which allow documents to be authored without knowing the final publishing location. RFC 3987 [2] defines IRIs, which generalise URIs to Unicode by defining an interpretation of escapes as sequences of UTF-8 octets.

URIs and IRIs are widely used to denote resources across the world wide web, and are the basis of a number of web standards. It is critical to present a programming interface for manipulation of different components in isolation (e.g. paths, hostnames), as working with an URI's coarse string representation directly is error-prone. RFC 3986 defines an abstract syntax and hence conveniently forms the basis of such a programming interface. Further, RFC 3987's generalisation to internationalised identifiers is a natural extension, allowing us to faithfully denote resources in a number of languages, using the universal character set. Indeed, IRIs form the basis of modern standards like Resource Description Format (RDF), a widespread formal model for metadata interchange and knowledge representation. We hence require support for both URIs and IRIs.

Specification

Relative references and polymorphic procedures

RFC 3986 and 3987 distinguish URIs and IRIs, respectively, from their relative references, which must be resolved against a base URI or IRI in order to be used. The main application of relative references is to allow one to refer to resources and author documents without knowing the final publishing location. For example, a graph database may produce RDF documents specified in RDF/XML, where resources are denoted with IRI-relative references, assuming that these documents would be interchanged with another system which mints full IRIs with respect to its hosting location.

Somewhat confusingly, RFC 3986 defines "URI-reference" as the most common usage of URI: either an URI or a relative reference. We follow this common usage by developing polymorphic getters and setters which work on URIs and relative references, with the predicate uri? holding true for both URIs and relative references. This procedure would hence correspond to testing for an RFC 3986 "URI-reference". (RFC 3987 follows an identical convention for IRIs and their relative references, so the same approach is followed for IRIs.)

The most divergent behaviour between widespread implementations of URIs and IRIs has been with respect to normalisation of relative references. We explicitly avoid defining path segment normalisation for relative references (regardless of whether the path is absolute or relative) because the behaviour is largely undefined by RFC 3986 and 3987 or any current RFC. See the section on normalisation for more details and the test suite.

A record type presentation

URIs and IRIs are structured data. Of course, record types are convenient as they generate dedicated getters and setters. More importantly, however, we think that record structures are necessary in this case for normalisation procedures to be structure-preserving. Specifically, normalisation procedures may alter the path, such that its segments may be mistaken for other parts of the URI or IRI, such as the <://> portion separating scheme from authority, or for the hostname. Updates to authority components or to the path need to similarly ensure that they produce valid URIs and IRIs when serialised.

We argue that a string representation makes it too easy to inadvertently modify the structure, because normalisation of individual components may yield a string which, when parsed again, is interpreted differently with respect to the structure. A good example of this is the restrictions on paths given an authority, because a non-empty authority is denoted in an URI using two slashes, which are characters which may also appear in paths.

A string representation of an URI or relative reference, or an IRI or relative reference (procedures uri->string and iri->string) is produced by concatenating string representations of the individual fields (scheme, hostname &c.), with the expected separators between components. For efficiency, no assumption should be made that the contents of individual fields can be checked at this point, which typically would involve additional, redundant parsing.

Pure and impure interfaces

This document defines an interface for both pure-functional and in-place updates to URIs and IRIs, with the latter interface defined in optional sub-libraries. The reasoning behind this is that although in-place setters may be more efficient, existing implementations of URIs and IRIs have not provided in-place setters, if at all. Further, this document is minimally prescriptive of the internal representation, as implementors may select different optimised data structures to represent immutable IRIs to take advantage of information sharing.

We require implementations to provide a pure-functional interface to URIs and IRIs, but not an in-place interface. The reasoning for this is that if implementors choose data structures optimised to purely functional programming, it is more cumbersome to create an impure interface, whereas it is not as cumbersome for implementors to create a (inefficient) pure-functional interface by copying the URI or IRI before updating in-place.

Indeed, while the pure-functional procedures are convenient to programmers, in the sample implementation, they are less efficient because they copy the URI or IRI before performing operations in-place. The sample implementation uses an internal (SRFI 160 [8]) bytevector representation to efficiently store both UTF-8 octets and percent-encoded escapes.

If an implementation provides disjoint mutable and immutable URI and IRI variants, then it is an error to call the in-place setters on an immutable variant.

Error behaviour of setters and updaters

Our design is to provide getters and setters which abstract away the internal representation, working on string representations of URI components. More generally, we suspect that existing implementations largely omit setters and updaters because preserving internal consistency is fairly cumbersome on the implementor and programmer, with validity of authority components being defined with respect to the path, and vice versa. The specific challenge is to ensure that setting a given component would not violate the URI grammar, as this may elicit a flat string representation which would have a different structure when parsed again.

First, the scheme, query and fragment components do not depend on validity of other components. Setters and updaters invoke the respective parsers on the string representation to be set, and raise an error if the parse failed.

Second, for the path component, validity depends on whether a) any authority component is set; and b) whether we are setting it for an URI, or for a relative reference.

  1. If any authority component is set, then an error is raised if the target path does not begin with a slash, or is otherwise non-empty (see RFC 3986's path-abempty production). An error is raised if there is a parse failure.
  2. If no authority component is set, then we first check whether we are dealing with an URI or a relative reference.
  3. For an URI, if the path has a leading slash (path-absolute production), then the first segment must not be empty (double slash would be parsed as the separator between scheme and authority).
  4. For a relative reference, this is modified so that non-absolute paths must not have an initial segment with a colon in it (as this would be parsed as separator between scheme and authority). An error is raised if a path is to be set which would violate these conditions.

The validity of an authority component likewise depends on the shape of the path. See RFC 3986 ABNF for details.

Normalisation

We support three, scheme-independent normalisation procedures:

  1. Case normalisation, in which U.S. ASCII may be case-insensitive, and in which escaped (percent-encoded) octets have a canonical upper-case form.
  2. Escape normalisation, in which percent-encoded octets may be decoded, or characters may be encoded as escapes.
  3. Path segment normalisation, in which dotted segments (. and ..) are eliminated.

Case normalisation

The scheme and host components of both URIs and IRIs are considered case-insensitive, with other components considered case-sensitive. For URIs, the repertoire of characters is within U.S. ASCII., whereas for IRIs, Unicode characters may appear. Nonetheless, for both URIs and IRIs, only U.S. ASCII is case-insensitive, with characters like "É" never being normalised to a lower-case form like "é".

Escapes (percent-encodings) take the form % 0-F 0-F (hexadecimal digits), and these hexadecimal digits have a canonical upper-case form. For example, %cf would be normalised to %CF. In practice, the sample implementation always parses these into a canonical form as it represents these internally as octets, not in the original string form. If implementations do preserve the original form, they must always support normalisation into the canonical upper-case form.

Escape (percent-encoding) normalisation

The interpretation of escapes differs for URIs and IRIs.

For URIs, octets may individually be decoded to ASCII characters. Essentially, this occurs if a character is not in the URI reserved range, and if it is permissible within a given URI component (e.g. path). Characters in the reserved range, if encountered in the clear, must never be percent-encoded. Conversely, characters not in the reserved range, and which are not permissible within a given URI component, are encoded as a series of escapes corresponding to a series of UTF-8 octets. This bears particular mention because, while other encodings are valid URIs, the RFC 3986 specification specifically requires this encoding, which enables compatibility with the closely related RFC 3987 specification for IRIs.

IRIs not only generalise the range of characters permissible within the IRI to certain Unicode ranges, but also interpret percent-encodings as UTF-8. Because all URIs are valid IRIs, the normalisation of IRIs with respect to percent-encodings is essentially the same as conversion from an URI to an IRI. During conversion, the entire IRI is interpreted as a UTF-8 code sequence, with any percent-encoding not part of a valid UTF-8 sequence being reencoded.

Path segment normalisation

Path segment normalisation is structurally identical for both URIs and IRIs. It interprets an entire path with respect to two control sequences: . (current working directory) and .. (upper working directory), similar to UNIX paths. Unlike the other two normalisation procedures, path segment normalisation is undefined for relative references, because the path portion is not meaningful except during relative reference resolution.

It should be noted that path segment normalisation is defined for non-relative IRIs with relative paths, such as a URN like <foo:a/b/../.././../../e>. This is a major source of diversion between RFC 3986 implementations. This appears to arise from reusing the remove-dot-segments procedure as defined in RFC 3986 verbatim. In relative reference resolution, this procedure is never called on relative paths, only on absolute paths.

For example, for the (non-relative) IRI <foo:a/b/../.././../../e>, implementations using the RFC 3986 procedure verbatim get <foo:/e>, whereas other implementations get <foo:e>. In the former camp are implementations like the Erlang/OTP system's built-in uri_string [4], and Guile-RDF [6], whereas in the latter camp are implementations like Haskell's network-uri [3]. We are in the latter camp.

A fixed remove-dot-segments might be implemented like as follows. This procedure works on an internal representation of a path, consisting of either a forward-slash character or a segment (a list of characters or percent-encodings). The pattern-matcher employed in this example is that of SRFI 262 [9] and the break procedure of SRFI 1 [7]:

(define (drop~1 cd)
  (match cd
    ['() '()]
    [(cons head tail) tail]))

(define (next-segment path)
  (match-values (break (lambda (seg) (eq? seg #\/)) path)
    [(r (cons #\/ ps1))
     (values (append r (list #\/)) ps1)]
    [(r _)
     (values r '())]))

(define (remove-dot-segments path)
  (match path
    ['() '()]
    [(cons #\/ seg*)
     (cons #\/ (remove-dot-segments/relative seg*))]
    [_
     (remove-dot-segments/relative path)]))

(define (remove-dot-segments/relative path)
  (let elim-dots ([path path]
                  [buffer '()])
    (match path
      ['()
       (if (null? buffer)
           buffer
           (flatten (reverse buffer)))]
      [(cons* (list #\.) #\/ next-path)
       (elim-dots next-path buffer)]
      [(list (list #\.))
       (elim-dots '() buffer)]
      [(cons* (list #\. #\.) #\/ next-path)
       (elim-dots next-path (drop~1 buffer))]
      [(list (list #\. #\.))
       (elim-dots '() (drop~1 buffer))]
      [_
       (let-values ([(shift-this next-path)
                     (next-segment path)])
         (elim-dots next-path (cons shift-this buffer)))])))
  

Additionally, the above suggested implementation of remove-dot-segments is considerably clearer than the stack-based description in RFC 3986, with fewer pattern-matching clauses required.

Notation and convention

Although we support both URIs and IRIs, for brevity we primarily describe behaviour for IRIs, and omit descriptions of the equivalent procedures for URIs where they behave the same. This works because URIs and IRIs are structurally identical, with the divergence between RFC 3986 and 3987 arising from the generalisation of the character set, and the treatment of normalisation.

Type signatures are specified as arrows from input arguments to a single output. Multiple values are denoted (typ ...) and () denotes unit or void (equivalent to R6RS (cond [#f #f])).

In type specifications, when we refer to iri, we refer to both IRIs and IRI-relative references (i.e. the iri type and the relative-iri type). We specify which specific type of IRI with either non-relative-iri or relative-iri. This is a little loose, but best reflects the use of polymorphism in this library interface.

Finally, throughout this document, we enclose IRIs and URIs in angle brackets, e.g. <http://example.org>.

IRI programming interface

Library: (srfi 275 iri)

Syntax: iri
Procedure: (non-relative-iri? ident): any → boolean
Procedure: (iri? ident): any → boolean
Procedure: (iri-scheme ident): non-relative-iri → string?
Procedure: (iri-user ident): iri → string?
Procedure: (iri-host ident): iri → string?
Procedure: (iri-port ident): iri → fixnum?
Procedure: (iri-path ident): iri → string?
Procedure: (iri-query ident): iri → string?
Procedure: (iri-fragment ident): iri → string?

IRI record type. The record's fields are derived from the RFC 3986 URI grammar. The procedures for components other than scheme are polymorphic on both IRIs and relative references (for which see below).

Examples:

(define example-IRI (string->iri "http://example.org:80/ex#IRI"))

example-IRI → <http://example.org:80/ex#IRI>

(iri? example-IRI) → #t
(non-relative-iri? example-IRI) → #t

(iri-scheme example-IRI) → "http"
(iri-user example-IRI) → #f
(iri-host example-IRI) → "example.org"
(iri-port example-IRI) → 80
(iri-path example-IRI) → "/ex"
(iri-query example-IRI) → #f
(iri-fragment example-IRI) → "IRI"
  
Procedure: (update-iri-scheme ident str): non-relative-iri → string? → non-relative-iri
Procedure: (update-iri-user ident str): iri → string? → iri
Procedure: (update-iri-host ident str): iri → string? → iri
Procedure: (update-iri-port ident str): iri → fixnum? → iri
Procedure: (update-iri-path ident str): iri → string? → iri
Procedure: (update-iri-query ident str): iri → string? → iri
Procedure: (update-iri-fragment ident str): iri → string? → iri

Pure-functional setters for IRIs and relative references. Apart from update-iri-port, these procedures take a string, parsing it to the relevant field. It is an error to pass a string which does not conform to the RFC 3987 grammar for that component. The update-iri-scheme procedure is undefined for relative references, and it is an error to call it on a relative reference.

Syntax: relative-iri
Procedure: (relative-iri? ident): any → boolean
Procedure: (iri? ident): any → boolean

Relative IRI record type. Relative references do not have a scheme, and it is an error to call iri-scheme on one. The remaining procedures are polymorphic on both relative references and IRIs.

Examples:

(define example-IRI (string->iri "/ex#IRI"))

example-IRI → </ex#IRI>

(iri? example-IRI) → #t
(non-relative-iri? example-IRI) → #f
(relative-iri? example-IRI) → #t

(iri-scheme example-IRI) → <raises an ERROR>
(iri-user example-IRI) → #f
(iri-host example-IRI) → #f
(iri-port example-IRI) → #f
(iri-path example-IRI) → "/ex"
(iri-query example-IRI) → #f
(iri-fragment example-IRI) → "IRI"
  
Procedure: (iri-authority ident): iri | relative-iri → #f | (string? string? fixnum?)
Procedure: (update-iri-authority ident user host port): iri → #f | string? → string? → fixnum? → iri

Authority is derived from the user, host and port. The iri-authority procedure simply retrieves these as multiple values. The update-iri-authority procedure mints a new IRI with these three values set.

Procedure: (iri-equal? A B): iri → iri → boolean

Holds true if the two arguments are IRIs and their fields are equal, or if the two arguments are relative references and their fields are equal. An IRI is never equal to a relative reference, and vice versa. It is an error to call this procedure when either argument is not an IRI or relative reference.

Library: (srfi 275 iri in-place)

Procedure: (set-iri-scheme! ident str): non-relative-iri → string? → ()
Procedure: (set-iri-user! ident str): iri → string? → ()
Procedure: (set-iri-host! ident str): iri → string? → ()
Procedure: (set-iri-port! ident str): iri → fixnum? → ()
Procedure: (set-iri-path! ident str): iri → string? → ()
Procedure: (set-iri-query! ident str): iri → string? → ()
Procedure: (set-iri-fragment! ident str): iri → string? → ()

In-place setters for IRIs and relative references. Apart from set-iri-port!, these procedures take a string, parsing it to the relevant field. It is an error to pass a string which does not conform to the RFC 3987 grammar for that component. The set-iri-scheme! procedure is undefined for relative references, and it is an error to call it on a relative reference.

Procedure: (set-iri-authority! ident user host port): iri → #f | string? → string? → fixnum? → ()

Set the user, host and port authority components in-place.

URI programming interface

Library: (srfi 275 uri)
Library: (srfi 275 uri in-place)

Identical programming interfaces to that of IRIs are given for URIs, with procedures and error messages renaming referencing uri instead of iri. The structure of a URI or URI-relative reference is identical. The programming interface differs in the character set permissible within an URI, i.e. string->uri signals an appropriate error, as do the setters like set-uri-host!.

Relative reference resolution

Library: (srfi 275 normalise)

Procedure: (resolve-iri-reference base ref): non-relative-iri → iri → non-relative-iri

Relative reference resolution against a base IRI. An error is signalled if the base IRI is a relative reference. Relative reference resolution is, however, defined for both IRIs and relative references, although it is unusual to resolve a non-relative reference. This procedure is pure-functional as it (usually) involves transforming a relative reference into an IRI.

Simple examples derived from the RDF Turtle test cases (see full test cases at end of document):

(define string-cases (list "g:h" "g" "./g" "g/" "/g" "//g"))

(define base01 (string->iri "http://a/bb/ccc/d;p?q"))
(define base02 (string->iri "http://a/bb/ccc/d/"))
(define base07 (string->iri "file:///a/bb/ccc/d;p?q"))

(define (resolve-with base-iri) ;; Higher-order.  Returns a function.
  (lambda (ref)
    (resolve-reference base-iri (string->iri ref))))

(map (resolve-with base01) string-cases)
→ (list <g:h>
        <http://a/bb/ccc/g> <http://a/bb/ccc/g> <http://a/bb/ccc/g/>
        <http://a/g> <http://g>)

(map (resolve-with base02) string-cases)
→ (list <g:h>
        <http://a/bb/ccc/d/g> <http://a/bb/ccc/d/g> <http://a/bb/ccc/d/g/>
        <http://a/g> <http://g>)

(map (resolve-with base07) string-cases)
→ (list <g:h>
        <file:///a/bb/ccc/g> <file:///a/bb/ccc/g> <file:///a/bb/ccc/g/>
        <file:///g> <file://g>)
  
Procedure: (resolve-uri-reference base ref): non-relative-iri → iri → non-relative-iri

Relative reference resolution against a base URI. Behaviour is structurally identical to that of relative reference resolution for IRIs, as is the error behaviour with respect to base URI.

Normalisation

Library: (srfi 275 normalise)

Procedure: (normalise-iri-case ident): iri → iri
Procedure: (normalise-uri-case ident): uri → uri

Normalise case-insensitive components (scheme and host), and convert escaped characters (percent-encodings) to canonical upper-case. These procedures are pure-functional, returning the new identifier. These procedures are structurally identical for both IRIs and URIs, with the only difference being in that they signal an error if the argument is not an IRI or URI respectively. These procedures are also well-defined for IRI and URI-relative references respectively.

Procedure: (normalise-iri-escape ident): iri → iri

Normalise an IRI or relative reference's escapes (percent-encodings), potentially interpreting escapes if part of valid UTF-8 octet sequences as characters. Conversely, characters which are not permissible within an IRI component will be encoded as a series of escapes corresponding to UTF-8 octets. These procedures are idempotent: a fully normalised IRI or relative reference will be normalised to itself, and iri-equal? will hold true between the two. This procedure is pure-functional and structure-preserving.

Procedure: (normalise-uri-escape ident): uri → uri

Normalise an URI or relative reference's escapes (percent-encodings), potentially interpreting escapes as octets in U.S. ASCII. Conversely, characters which are not permissible within an URI component will be encoded as a series of escapes corresponding to UTF-8 octets. This procedure is pure-functional and structure-preserving.

Procedure: (normalise-iri-path-segments ident): iri → iri
Procedure: (normalise-uri-path-segments ident): uri → uri

Normalise path segments depending on the control segments . and ... While path segments of IRI or URI-relative references are not normalised, these procedures simply have no effect and no error is signalled, for parity with the other normalisation procedures. While it is relatively uncommon for relative paths to appear in non-relative IRIs or URIs, they are permissible e.g. within URN components. This procedure is pure-functional and structure-preserving.

Procedure: (normalise-iri ident): iri → iri
Procedure: (normalise-uri ident): uri → uri

These procedures are essentially a sequence of the three normalisation procedures described previously, albeit the specific order is by escapes, by case and by path segments. These procedures are pure-functional and structure-preserving.

Library: (srfi 275 normalise in-place)

Procedure: (normalise-iri-case! ident): iri → ()
Procedure: (normalise-uri-case! ident): uri → ()

In-place variants of normalise-iri-case and normalise-uri-case.

Procedure: (normalise-iri-escape! ident): iri → ()
Procedure: (normalise-uri-escape! ident): uri → ()

In-place variants of normalise-iri-escape and normalise-uri-escape.

Procedure: (normalise-iri-path-segments! ident): iri → ()
Procedure: (normalise-uri-path-segments! ident): uri → ()

In-place variants of normalise-iri-path-segments and normalise-uri-path-segments.

Procedure: (normalise-iri! ident): iri → ()
Procedure: (normalise-uri! ident): uri → ()

In-place variants of normalise-iri and normalise-uri.

Equivalence and conversion

Library: (srfi 275 iri)

Procedure: (string->iri str): string → iri
Procedure: (iri->string ident): iri → string

Conversion from string to IRI or relative reference, and vice versa. string->iri is the fundamental constructor of IRIs and relative references.

Examples:

(define example-A (string->iri "http://example.org/some/where/place"))
(define example-B (string->iri "urn:/some/where/place"))

(iri->string example-A) → "http://example.org/some/where/place"
(iri->string example-B) → "urn:/some/where/place"
  

Library: (srfi 275 uri)

Procedure: (string->uri str): string → uri
Procedure: (uri->string ident): uri → string

Conversion from string to URI or relative reference, and vice versa. string->uri is the fundamental constructor of URIs and relative references. See the above examples for IRIs.

Library: (srfi 275 normalise)

Procedure: (iri-eqv? A B): iri → iri → boolean

Two IRIs are equivalent if iri-equal? holds, or, post-normalisation with normalise-iri, iri-equal? holds. Similarly, two URIs are equivalent if uri-equal? holds, or, post-normalisation with normalise-uri, uri-equal? holds. It is an error to call iri-eqv? when either argument is not an IRI or relative reference. Similarly, it is an error to call uri-eqv? when either argument is not an URI or relative reference.

Procedure: (iri->uri ident): iri → uri

Convert an IRI to an URI. This proceeds by encoding any character within certain ranges (see RFC 3987 ABNF ucschar and iprivate) to a series of escapes corresponding to those octets in UTF-8. This URI is also a valid IRI (albeit not normalised) as all URIs are valid IRIs. This procedure is structure-preserving: (non-relative) IRIs are never transformed into relative references, or vice-versa. It is an error to call this procedure where the argument is not an IRI or relative reference.

Procedure: (uri->iri ident): uri → iri

Convert an URI to an IRI. This procedure can be viewed as upgrading the URI structure to that of an IRI, then normalising it as an IRI. It is an error to call this procedure where the argument is not an URI or relative reference.

Test suite

In this section, we describe various test cases and the specific behaviour they evaluate. Because RFC 3986 and 3987 only provide examples for relative reference resolution, it is important to specify the exact behaviour of a correct implementation, especially with respect to normalisation. It is expected that these test cases could be the basis of a more comprehensive property-based test suite, e.g. test the entire range of reserved and unreserved characters in a particular URI component.

URI case normalisation test cases (normalise-uri-case)

All lower-case preserved
<http://example.org/ex#test><http://example.org/ex#test>
Mixed-case scheme to lower-case
<HttP://example.org/ex#test><http://example.org/ex#test>
Mixed-case user preserved
<http://MySelf@example.org/Examp#test><http://MySelf@example.org/Examp#test>
Mixed-case host to lower-case
<http://Example.ORG/ex#test><http://example.org/ex#test>
Mixed case path preserved
<http://example.org/Examp#test><http://example.org/Examp#test>
Mixed case query preserved
<http://example.org/examp?Qua#test><http://example.org/examp?Qua#test>
Mixed case fragment preserved
<http://example.org/examp#TeSt><http://example.org/examp#TeSt>
User percent-encodings to upper-case
<http://%aA@%AA%AB%AC%AD%AE/some/where/place><http://%AA@%AA%AB%AC%AD%AE/some/where/place>
Host percent-encodings to upper-case
<http://%aa%Ab%AC%aD%AE/some/where/place><http://%AA%AB%AC%AD%AE/some/where/place>
Path percent-encodings to upper-case
<http://myname@example.org/%Fa/%FB/%fC><http://myname@example.org/%FA/%FB/%FC>
Query percent-encodings to upper-case
<http://myname@example.org/%FA/%FB/%FC?%ff><http://myname@example.org/%FA/%FB/%FC?%FF>
Fragment percent-encodings to upper-case
<http://myname@example.org/%FA/%FB/%FC#%ff><http://myname@example.org/%FA/%FB/%FC#%FF>

IRI case normalisation test cases (normalise-iri-case)

Host non-U.S. ASCII is case-sensitive
<http://CRÊPES.example.org><http://crÊpes.example.org> No equivalent for scheme as that only contains ASCII even in IRIs

URI escape (percent-encoding) normalisation test cases (normalise-uri-escape)

User reserved character not escaped
<http://my!name@example.org/ex#test><http://my!name@example.org/ex#test>
Host reserved character not escaped
<http://myname@!example.org/ex#test><http://myname@!example.org/ex#test>
Path reserved character not escaped
<http://myname@example.org/ex!#test><http://myname@example.org/ex!#test>
Query reserved character not escaped
<http://myname@example.org/ex?!a#test><http://myname@example.org/ex?!a#test>
Fragment reserved character not escaped
<http://myname@example.org/ex?a#!test><http://myname@example.org/ex?a#!test>
User reserved escape not decoded
<http://my%40name@example.org/ex#test><http://my%40name@example.org/ex#test>
Host reserved escape not decoded
<http://myname@ex%40ample.org/ex#test><http://myname@ex%40ample.org/ex#test>
Path reserved escape not decoded
<http://myname@example.org/e%40x#test><http://myname@example.org/e%40x#test>
Query reserved escape not decoded
<http://myname@example.org/ex?a%40#test><http://myname@example.org/ex?a%40#test>
Fragment reserved escape not decoded
<http://myname@example.org/ex?a#t%40est><http://myname@example.org/ex?a#t%40est>
User permissible escape decoded
<http://my%2Ename@example.org/ex?a#test><http://my.name@example.org/ex?a#test>
Host permissible escape decoded
<http://myname@example%2Eorg/ex?a#test><http://myname@example.org/ex?a#test>
Path permissible escape decoded
<http://myname@example.org/misc%2Etxt#test><http://myname@example.org/misc.txt#test>
Query permissible escape decoded
<http://myname@example.org/misc.txt?%2E%2E%2E><http://myname@example.org/misc.txt?...>
Fragment permissible escape decoded
<http://myname@example.org/misc.txt#line%31%30><http://myname@example.org/misc.txt#line10>
User illegal characters are escaped
<http://dosh£@crepes.example.org><http://dosh%C2%A3@crepes.example.org>
Host illegal characters are escaped
<http://crêpes.example.org><http://cr%C3%AApes.example.org>
Path illegal characters are escaped
<http://crepes.example.org/in/Rhône><http://crepes.example.org/in/Rh%C3%B4ne>
Query illegal characters are escaped
<http://crepes.example.org/in/Rennes?Dim.‥Sam.><http://crepes.example.org/in/Rennes?Dim.%E2%80%A5Sam.>
Fragment illegal characters are escaped
<http://crepes.example.org/in/Rennes#L'Étage><http://crepes.example.org/in/Rennes#L'%C3%89tage>

IRI escape normalisation test cases (normalise-iri-escape)

User permissible escape decoded
<http://dosh%C2%A3@crepes.example.org><http://dosh£@crepes.example.org>
Host permissible escape decoded
<http://cr%C3%AApes.example.org><http://crêpes.example.org>
Path illegal characters are escaped
<http://crepes.example.org/in/Rh%C3%B4ne><http://crepes.example.org/in/Rhône>
Query illegal characters are escaped
<http://crepes.example.org/in/Rennes?Dim.%E2%80%A5Sam.><http://crepes.example.org/in/Rennes?Dim.‥Sam.>
Fragment illegal characters are escaped
<http://crepes.example.org/in/Rennes#L'%C3%89tage><http://crepes.example.org/in/Rennes#L'Étage>
Wholly escaped wholly decoded
<https://en.wiktionary.org/wiki/%E1%BF%AC%CF%8C%CE%B4%CE%BF%CF%82><https://en.wiktionary.org/wiki/Ῥόδος>
Partially normalised wholly decoded
<https://example.org/music/%C3%89irigh'sCuirOrtDoChuid%C3%89adaigh><https://example.org/music/Éirigh'sCuirOrtDoChuidÉadaigh>
Wholly normalised not reencoded
<https://en.wiktionary.org/wiki/Ῥόδος><https://en.wiktionary.org/wiki/Ῥόδος>
Repeated normalisation (idempotence)
<https://en.wiktionary.org/wiki/Ῥόδος><https://en.wiktionary.org/wiki/Ῥόδος> In this test case, normalise-iri-escape should be called multiple times, i.e. (compose normalise-iri-escape normalise-iri-escape)

Path normalisation test cases (normalise-uri-path-segments and normalise-iri-path-segments)

Absolute path without dotted segments unchanged
<http://example.org/some/where/place><http://example.org/some/where/place>
No authority absolute path without dotted segments unchanged
<urn:/some/where/place><urn:/some/where/place>
Relative path without dotted segments unchanged
<urn:some/where/place><urn:some/where/place>
Absolute path eliminates single dotted segments
<urn:/some/./where/././place/./><urn:/some/where/place/>
Relative path eliminates single dotted segments
<urn:some/./where/././place/./><urn:some/where/place/>
Absolute path empty segments treated like non-empty
<urn:/some//where//place//><urn:/some//where//place//>
Relative path empty segments treated like non-empty
<urn:some//where//place//><urn:some//where//place//>
Single leading slash not normalised
</></>
Multiple leading slashes not normalised
<//><//>
Absolute path reference not normalised (double-dot)
</a/b/../../c></a/b/../../c>
Absolute path reference not normalised (single-dot)
</a/b/././c></a/b/././c>
Absolute path reference not normalised (mixed dotted)
</a/b/../c/././d></a/b/../c/././d>
Relative path reference not normalised (double-dot)
<a/b/../../c><a/b/../../c>
Relative path reference not normalised (single-dot)
<a/b/././c><a/b/././c>
Relative path reference not normalised (mixed dotted)
<a/b/../c/././d><a/b/../c/././d>
1-segment path reference with leading dot not normalised
<./def><./def>
1-segment path reference with leading dot not normalised (colon)
<./abc:def><./abc:def>
Additional relative path case not normalised
<../../abc/./def><../../abc/./def>
Relative path must not be normalised to an absolute path
<foo:a/b/../.././../../e><foo:e> From Haskell network-uri [3]
Empty segments eliminated like non-empty (1)
<http://example.com////../..><http://example.com//> From Webkit [5]
Empty segments eliminated like non-empty (2)
<http://example.com/foo/bar//../..><http://example.com/foo/> From Webkit [5]
Empty segments eliminated like non-empty (3)
<http://example.com/foo/bar//..><http://example.com/foo/bar/> From Webkit [5]
General test case 1
<http://example/a/b/../../c><http://example/c> From Haskell network-uri [3]
General test case 2
<http://example/a/b/c/../../><http://example/a/> From Haskell network-uri [3]
General test case 3
<http://example/a/b/c/./><http://example/a/b/c/> From Haskell network-uri [3]
General test case 4
<http://example/a/b/c/.././><http://example/a/b/> From Haskell network-uri [3]
General test case 5
<http://example/a/b/c/d/../../../../e><http://example/e> From Haskell network-uri [3]
General test case 6
<http://example/a/b/c/d/../.././../../e><http://example/e> From Haskell network-uri [3]
General test case 7
<http://example/a/b/../.././../../e><http://example/e> From Haskell network-uri [3]

URI to IRI conversion (uri->iri)

Wholly escaped path normalised
<https://en.wiktionary.org/wiki/%E1%BF%AC%CF%8C%CE%B4%CE%BF%CF%82><https://en.wiktionary.org/wiki/Ῥόδος>
Partially escaped path normalised
<https://example.org/ceol/%C3%89irigh'sCuirOrtDoChuid%C3%89adaigh><https://example.org/ceol/Éirigh'sCuirOrtDoChuidÉadaigh>

IRI to URI conversion (iri->uri)

Wholly escaped path normalised
<https://en.wiktionary.org/wiki/Ῥόδος><https://en.wiktionary.org/wiki/%E1%BF%AC%CF%8C%CE%B4%CE%BF%CF%82>
Partially escapable path escaped
<https://example.org/ceol/Éirigh'sCuirOrtDoChuidÉadaigh><https://example.org/ceol/%C3%89irigh'sCuirOrtDoChuid%C3%89adaigh>

Parsing expected segments

schemeuserhostportpathqueryfragment
Empty URI: <>
N/A#f#f#f#f#f#f
Empty authority: <//>
N/A#f""#f#f#f#f
Empty user: <//@>
N/A""#f#f#f#f#f
Empty port: <//:>
N/A#f""#f#f#f#f
Empty query: <?>
N/A#f#f#f#f""#f
Empty fragment: <#>
N/A#f#f#f#f#f""
Path which looks like a hostname: <example.org>
N/A#f#f#f"example.org"#f#f
URN-like: <urn:something>
"urn"#f#f#f"something"#f#f
URN-like, path looks like hostname: <urn:example.org>
"urn"#f#f#f"example.org"#f#f
Path which looks like a URN: <./urn:something>
N/A#f#f#f"./urn:something"#f#f
User with colon segment: <http://a:b@c:29>
"http""a:b""c"29#f#f#f
User-like component appears as path: <http::@c:29>
"http"#f#f#f":@c:29"#f#f
Host-like component appears as user: <http://example.org:b@d/>
"http""example.org:b""d"#f"/"#f#f
Padded port as numeric value: <http://example.org:000080>
"http"#f"example.org"80#f#f#f
Query component with question mark: <http://example.org/abcd?efgh?ijkl>
"http"#f"example.org"#f"/abcd""efgh?ijkl"#f
Fragment component with question mark: <http://example.org/abcd#efgh?ijkl>
"http"#f"example.org"#f"/abcd"#f"efgh?ijkl"
Path where first segment looks like host: <http:///some/where/place>
"http"#f""#f"/some/where/place"#f#f
Scheme with nil host: <foo:>
"foo"#f#f#f#f#f#f
Scheme with path, empty host: <foo:////g>
"foo"#f""#f"//g"#f#f
Scheme with path, nil host: <foo:.///g>
"foo"#f#f#f".///g"#f#f
Scheme with non-empty host: <foo://g>
"foo"#f"g"#f#f#f#f
All components filled out: <http://user@example.org:80/some/where/place?qua#ought>
"http""user""example.org"80"/some/where/place""qua""ought"
All components except user filled out: <http://example.org:80/some/where/place?qua#ought>
"http"#f"example.org"80"/some/where/place""qua""ought"
All components except host filled out: <http://user@:80/some/where/place?qua#ought>
"http""user"#f80"/some/where/place""qua""ought"
All components except port filled out: <http://user@example.org/some/where/place?qua#ought>
"http""user""example.org"#f"/some/where/place""qua""ought"
All components except path filled out: <http://user@example.org:80?qua#ought>
"http""user""example.org"80#f"qua""ought"
All components except query filled out: <http://user@example.org:80/some/where/place#ought>
"http""user""example.org"80"/some/where/place"#f"ought"
All components except fragment filled out: <http://user@example.org:80/some/where/place?qua>
"http""user""example.org"80"/some/where/place""qua"#f
Empty host, nil user/port: <http:///some/where/place?qua#ought>
"http"#f""#f"/some/where/place""qua""ought"
Empty user, nil host/port: <http://@/some/where/place?qua#ought>
"http"""#f#f"/some/where/place""qua""ought"
Empty port implies empty host: <http://:/some/where/place?qua#ought>
"http"#f""#f"/some/where/place""qua""ought"
Relative reference, nil host: <////g>
N/A#f""#f"//g"#f#f
Relative reference, path, nil host: <.///g>
N/A#f#f#f".///g"#f#f
Relative reference, non-empty host: <//g>
N/A#f"g"#f#f#f#f
Path which looks like a query: <./p=q:r>
N/A#f#f#f"./p=q:r"#f#f
Relative reference, all components filled out: <//user@example.org:80/some/where/place?qua#ought>
N/A"user""example.org"80"/some/where/place""qua""ought"
Relative reference, all components except user filled out: <//example.org:80/some/where/place?qua#ought>
N/A#f"example.org"80"/some/where/place""qua""ought"
Relative reference, all components except host filled out: <//user@:80/some/where/place?qua#ought>
N/A"user"#f80"/some/where/place""qua""ought"
Relative reference, all components except port filled out: <//user@example.org/some/where/place?qua#ought>
N/A"user""example.org"#f"/some/where/place""qua""ought"
Relative reference, all components except path filled out: <//user@example.org:80?qua#ought>
N/A"user""example.org"80#f"qua""ought"
Relative reference, all components except query filled out: <//user@example.org:80/some/where/place#ought>
N/A"user""example.org"80"/some/where/place"#f"ought"
Relative reference, all components except fragment filled out: <//user@example.org:80/some/where/place?qua>
N/A"user""example.org"80"/some/where/place""qua"#f
Relative reference empty host, nil user/port: <///some/where/place?qua#ought>
N/A#f""#f"/some/where/place""qua""ought"
Relative reference empty user, nil host/port: <//@/some/where/place?qua#ought>
N/A""#f#f"/some/where/place""qua""ought"
Relative reference empty port implies empty host: <//:/some/where/place?qua#ought>
N/A#f""#f"/some/where/place""qua""ought"

Implementation

The sample implementation is written in portable R6RS, but uses chibi parse for parsing and imports SRFIs from the Chez-SRFI grab-bag.

References

Acknowledgements

© 2026 Duncan Guthrie.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Editor: Arthur A. Gleckler