[Pharo-dev] Better management of encoding of environment variables

Guillermo Polito guillermopolito at gmail.com
Fri Jan 18 08:23:38 EST 2019


On Fri, Jan 18, 2019 at 1:48 PM Ben Coman via Pharo-dev <
pharo-dev at lists.pharo.org> wrote:

>
>
>
>
> On Wed, 16 Jan 2019 at 18:37, Sven Van Caekenberghe <sven at stfx.eu> wrote:
>
>> Still, one of the conclusions of previous discussions about the encoding
>> of environment variables was/is that there is no single correct solution.
>> OS's are not consistent in how the encoding is done in all (historical)
>> contexts (like sometimes,
>
>
>
>> 1 env var defines the encoding to use for others,
>
>
> ouch.  That one point nearly made my retract my comment next paragraph,
> but is there much more complexity?
> or just a case of  utf8<==>appSpecificEncoding  rather than
> ascii<==>appSpecificEncoding ?
>

It's not muuuuch more complex. The problem is that usually the bugs that
arise from wrongly managing such conversions can be super obscure.


> Sorry if I'm rehashing past discussion (do you have a link?), but
> considering...
> * 92% of web pages are UTF8 encoded[1] such that pragmatically UTF8 *is*
> the standard for text
> * Strings so pervasive in a system
> ...would there be an overall benefit to adopt UTF8 as the encoding for
> Strings
> consistently provided across the cross-platform vm interface?
> (i.e. fixing platforms that don't comply to the standard due to their
> historical baggage)
>
> And I found it interesting Microsoft are making some moves towards UTF8
> [2]...
> "With insider build 17035 and the April 2018 update (nominal build 17134)
> for Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support"
> checkbox appeared for setting the locale code page to UTF-8.[a] This allows
> for calling "narrow" functions, including fopen and SetWindowTextA, with
> UTF-8 strings. "
>
> The approach vm-side could be similar to Section 10 How to do text on
> Windows [3]
> with the philosophy of "performing the [conversions] as close to API calls
> as possible,
> and never holding the [converted] data."
>
> [1]
> https://w3techs.com/technologies/history_overview/character_encoding/ms/y
> [2] https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows
> [3] http://utf8everywhere.org/
>
>
> different applications do different things, and other such nice stuff),
>> and certainly not across platforms.
>>
>> So this is really complex.
>>
>> Do we want to hide this in some obscure VM C code that very few people
>> can see, read, let alone help with ?
>>
>> The image side is perfectly capable of dealing with platform differences
>> in a clean/clear way, and at least we can then use the full power of our
>> language and our tools.
>>
>
> Big question... Do we currently have primitives of the same name returning
> different encodings on different platforms?  I presume that would be
> awkward.
> If the image is handle encoding differences, should separate primitives be
> used? e.g. utf8GetEnv & utf16getEnv
>
> Could I get some feedback on [4] saying... **The Single Most Important
> Fact About Encodings**
> If you completely forget everything I just explained, please remember one
> extremely important fact.
> It does not make sense to have a string without knowing what encoding it
> uses. "
>
> And so... does our String nowadays require an 'encoding' instance variable
> such that this is *always* associated?
> This might remove any need for separate utf8GetEnv & utf16getEnv (if that
> was even a reasonable idea).
>

I think that will just overcomplicate things. Right now, all Strings in
Pharo are unicode strings. Characters are represented with their
corresponding unicode codepoint.
If all characters in a string have codepoints < 256 then they are just
stored in a bytestring. Otherwise they are WideStrings.

I think assuming a single representation for strings, and then encode when
interacting with external apps/APIs is MUCH simpler.


> cheers -ben
>
> [4]
> https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
>
>
>
>> > On 16 Jan 2019, at 10:59, Guillermo Polito <guillermopolito at gmail.com>
>> wrote:
>> >
>> > Hi Nicolas,
>> >
>> > On Wed, Jan 16, 2019 at 10:25 AM Nicolas Cellier <
>> nicolas.cellier.aka.nice at gmail.com> wrote:
>> > IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion
>> because the purpose of a VM is to provide an OS independant façade.
>> > I made progress recently in this area, but we should finish the
>> job/test/consolidate.
>> >
>> > I'm following your changes for windows from the shadows and I think
>> they are awesome :).
>> >
>> > If someone bypass the VM and use direct windows API thru FFI, then he
>> takes the responsibility, but uniformity doesn't hurt.
>> >
>> >  So far we are using FFI for this, as you say we create first
>> Win32WideStrings from utf8 strings and then we use ffi calls to the *W
>> functions.
>> > I don't think we can make it for Pharo7.0.0. The cycle to build, do
>> some acceptance tests, and then bless a new VM as stable is far too long
>> for our inminent release :).
>> >
>> > But this could be for a 7.1.0, and if you like I can surely give a hand
>> on this.
>> >
>> > Guille
>>
>>
>>

-- 



Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - *http://www.cnrs.fr
<http://www.cnrs.fr>*


*Web:* *http://guillep.github.io* <http://guillep.github.io>

*Phone: *+33 06 52 70 66 13
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pharo.org/pipermail/pharo-dev_lists.pharo.org/attachments/20190118/9a17a339/attachment.html>


More information about the Pharo-dev mailing list