This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
> From: bear <bear@xxxxxxxxx> > I can treat SCHEME_EXTRACT_STRING as a request to make a copy of > a string in a format acceptable to C code in some buffer, > register it as "floating garbage" with the GC, return a pointer > to that buffer, and lock the garbage collector until the scheme > runtime is reentered. If the C code just wants to _read_ what's > there, that's necessary and sufficient. [....] > What's missing is an explicit declaration that it is unspecified > whether or not values written into the buffer pointed at by the > result of SCHEME_EXTRACT_STRING mutate the scheme string that > was originally referred to, Interesting conlusion. I conclude that EXTRACT must allocate string data which the C code must explicitly free. I arrived at this through a fairly systematic exploration of the design space (described below). To motivate my design space exploration, let's observe that the draft currently says: char * SCHEME_EXTRACT_STRING(scheme_value) scheme_value SCHEME_ENTER_STRING(char *) (may GC) SCHEME_EXTRACT_STRING returns a pointer to the actual storage used by the Scheme string. If this is the case[sic], the pointer is valid only until the next garbage collection. Note that this string may not be null-terminated; SCHEME_STRING_LENGTH returns the number of characters in the string. Does SCHEME_STRING_SET! modify the data that C is seeing? It would seem so from "returns a pointer to the actual storage[...]" but in Pika, STRING-SET! can relocate a string (as when promoting an 8-bit to a 16-bit string representation). UTF-8 implementations will sometimes relocate strings when replacing a character with a character of a different length encoding. We have one design question with four possible answers: 1) Is the extracted data shared with Scheme? 1a) Yes, for reading and writing 1b) Yes, for reading 1c) Unspecified 1d) No and there is a dependent design question with six likely answers: 2) Must C code "explicitly free" extracted string data? 2a) Yes, using "free()" 2b) Yes, using whatever function goes with the allocation function that C passes as a parameter to EXTRACT 2c) Yes, using "SCHEME_STRING_DATA_FREE()" which is up to the FFI implementor to provide. 2d) No, it's lifetime is that of the Scheme string 2e) No, it's lifetime is up until the next GC point 2f) No, it's lifetime is up until the next execution of any FFI function which might mutate the string, including by GC. So we start with 24 possible designs (4 * 6). If you want to follow the arguments I give below carefully, I suggest that you get out some graph paper and make a table: 2a 2b 2c 2d 2e 2f 1a | | | | | ---|----|----|----|----|---- 1b | | | | | ---|----|----|----|----|---- 1c | | | | | ---|----|----|----|----|---- 1d | | | | | putting X's in boxes when you agree my arguments eliminate some possible design and ?'s in the one's where you think my arguments are bogus. I'm going to try to argue you down to 23 boxes with X's and one left blank -- a "first-principles" string FFI. (I expect that most people who actually do this will come up with some question marks in their graph --- but those will be handy for organizing any subsequent discussion.) Not all 24 possibilities are coherent. We can right-away eliminate: 1a + 2a 1a + 2b All would involve C code using a 1b + 2a non-FFI function to "free" string 1b + 2b data that does or might belong to 1c + 2a a Scheme object. 1c + 2b leaving 18. We can cut out all 4 possibilities that involve (2d) because I think everyone agrees that that unduly restricts GC and string representations. For example, GC would be forbidden from relocating string data if that string data might ever have been EXTRACTed by C. That leaves 14. If strings are _not_ shared (1d), then surely string-data lifetime must be explicitly managed (not 2e, 2f), leaving 12 designs. In some quite plausible string representations (UTF-8, UTF-16, Pika's) mutation to a string can change it's length. Changing a string's length means it's location in memory can change. There may be hairy work-arounds but I think that these are reasons enough to eliminate (1a+2c) because before the explicit free function is called the string may be mutated by Scheme. (1b+2c) for the same reason (1a+2e) because a string mutation between GC-points can change the string's length, hence location (1b+2e) for the same reason leaving 8 (1b+2f) has absolutely no advantage over (1c+2f). From the point of the FFI-user, they are operationally equivalent. From the point of view of the FFI-implementor, (1c+2f) leaves more implementation options. That leaves 7. Similarly, (1d+2c) has absolutely no advantage over (1d+2a) or (1d+2b). If strings are _not_ shared and C must free them, then use either free() or let the C code specify how they are allocated in the first place. There's no reason why the FFI should define how non-shared string data is freed. That leaves 6. These are: 1a+2f) r/w sharing, Scheme-mutation-bound lifetime 1c+2c) unspecified sharing, FFI free function 1c+2e) unspecified sharing, GC-point lifetime 1c+2f) unspecified sharing, Scheme-mutation-bound lifetime 1d+2a) no sharing, use free() 1d+2b) no sharing, C controls allocation and freeing I would argue that (1d+2b) is preferable to (1d+2a) because nothing else in the FFI already depends on malloc()/free(). (One could make the opposite decision, that 1d+2a is preferable to 1d+2b and the rest of this would still apply.) So that leaves 5: 1a+2f) r/w sharing, Scheme-mutation-bound lifetime 1c+2c) unspecified sharing, FFI free function 1c+2e) unspecified sharing, GC-point lifetime 1c+2f) unspecified sharing, Scheme-mutation-bound lifetime 1d+2b) no sharing, C controls allocation and freeing There is nearly no advantage to (1c+2e) compared to (1c+2f). In (1c+2e), if a C program crosses a Scheme-mutation-point between GC-points, while it can assume that the pointer to the string data remains valid, it can make no assumptions about the contents of that string data. This isn't an absolute refutation of (1c+2e) but I would argue that it is near enough as to make for nevermind. An analogous argument applies to (1c+2c) compared to (1c+2f). If C passes a Scheme-mutation point before calling the FFI free function, the data pointer may remain valid but it's contents are unspecified. Leaving: 1a+2f) r/w sharing, Scheme-mutation-bound lifetime 1c+2f) unspecified sharing, Scheme-mutation-bound lifetime 1d+2b) no sharing, C controls allocation and freeing (1a+2f), Scheme-mutation-bound r/w sharing, must be rejected as well. This is because it constrains implementations to represent strings internally in the exactly the same format seen by C -- because between Scheme mutation points, C may modify the string and that should be apparent to Scheme code that _reads_ the string. That leaves: 1c+2f) unspecified sharing, Scheme-mutation-bound lifetime 1d+2b) no sharing, C controls allocation and freeing (1c+2f) requires us to make a distinction between functions in the FFI similar to, but not necessarily identical to the "may GC" distinction. It requires us to distinguish "may mutate a string" functions. Is there _any_ function in the FFI that we would not want to put in the "may mutate a string" category? I'm not so sure that there is. In an Oaklisp-style implementation, for example, every FFI function can result in the execution of arbitrary Scheme code. Therefore, I think we can rephrase our remaining choices as: 1c+2f) unspecified sharing, data lifetime bound by next FFI call 1d+2b) no sharing, C controls allocation and freeing yet that would make even this simple FFI code _incorrect_: /* Incorrect code: */ s1 = STRING_EXTRACT_STRING (scheme_s1); s2 = STRING_EXTRACT_STRING (scheme_s2); That every FFI function should be "may mutate a string" is, I think, controversial but not dismissable. So let's call this 33% of a reason to reject (1c+2f). The benefit of (1c+2f) compared to (1d+2b) is that it doesn't _require_ allocation and copying of string data -- a potential performance benefit that will be available to _some_ implementations. But it is certain that _many_ (not all) uses of EXTRACT will be in a context in which the potential performance benefit does not apply because C will the string data lifetime to cross string mutation boundaries. In those many cases, the C code will have to allocate space and copy the string data anyway, eliminating the performance advantage. So let's call this another 33% of a reason to reject (1c+2f). Regardless of what _this_ SRFI does -- I think it certain that sometime in the future we will want a portable FFI which permits (in some form) r/w sharing of string data under constrained conditions. The "internal intefaces for Pika" that I posted earlier are a good example of what I think this should also look like in a portable FFI. The future appearence of those functions is not guaranteed (but not unlikely) -- and such appearence will eliminate nearly all remaining benefits of (1c+2f). Can we call this 34% of a reason? So I think the choice is clear: 1d+2b) no sharing, C controls allocation and freeing That that answer is _also_ compatible with a GC-anytime and async/concurrent-Scheme-code-permitted FFI is just a happy non-coincidence. -t