[Pharo-project] squeakToUTF-8 and related?

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Mon Mar 29 12:50:05 EDT 2010


2010/3/29 Henrik Johansen <henrik.s.johansen at veloxit.no>:
>
> On Mar 29, 2010, at 2:00 09PM, Nicolas Cellier wrote:
>
>> 2010/3/29 Henrik Johansen <henrik.s.johansen at veloxit.no>:
>>>
>>> On Mar 29, 2010, at 11:52 43AM, Nicolas Cellier wrote:
>>>
>>>> 2010/3/29 Henrik Johansen <henrik.s.johansen at veloxit.no>:
>>>>>
>>>>> On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote:
>>>>>
>>>>>> I presume that under the idiom "latin1" you refer to code page 1252
>>>>>> rather than iso8859-L1, right ?
>>>>>>
>>>>>> Nicolas
>>>>> Good question :)
>>>>> What IS the presumed internal encoding of Bytestrings in Squeak?
>>>>> That's the one I meant, I merely assumed it was latin1 seeing as how the text converter refers to it as such.
>>>>> Personally I thought it was iso8859-L1, seeing as the bytestring to unicode conversion does a simple shift of chars > 127 to the 0080 - 00FF range.
>>>>>
>>>>> Cheers,
>>>>> Henry
>>>>>
>>>>
>>>> From what I understood, CP1252 is Microsoft "latin1" and use codes 128 to 159.
>>>> ISO8859-L1 match fisrt 256 codes of unicode latin-1 and has codes 128
>>>> to 159 unused.
>>>> You know, when Microsoft "uses" a standard, it's always a better standard ;)
>>>>
>>>> I have nothing against CP1252, it's an optimization which avoid
>>>> wasting 32 cheap codes.
>>>> But I'm not sure about various compatibility issues in/with the
>>>> external world...
>>>>
>>>> Squeak clearly uses CP1252.
>>>> For Pharo, there might be a mix of the two since Sophie-like
>>>> refactorings. Surely what John was refering to.
>>>>
>>>> Nicolas
>>>
>>> Ummm...
>>> All the utf8-converters in squeak use Unicode value:, which maps directly from charCode 128->255 to Unicode value 128->255.
>>> Unicode value 128->255 IS iso8859-L1, so if squeak uses CP1252 as internal format, all the converters in Squeak are wrong.
>>>
>>> Cheers,
>>> Henry
>>>
>>
>> ISO8859-L1 and CP1252 only differ for code points 16r80 to 16r9F.
>> Contrarily to what I said, these code points are assigned to G1
>> control characters (anyone ever used these ?).
>> See http://en.wikipedia.org/wiki/ISO_8859-1 and
>> http://en.wikipedia.org/wiki/Windows-1252
>
> Not to my knowledge :)
> The strong argument for using latin1 as internal charset for ByteString vs 1252 is the 1-1 mapping to unicode values.
>
>>
>> Now, I'm not so sure anymore why I thought squeak was CP1252. Is it ?
> Seems ambiguous.
>
>> My guess was probably based on macToSqueak and squeakToMac implementation.
>
> Yes, that does indeed do MacRoman -> 1252 transformation. As does MacRomanTextConverter, in Pharo as well...
> Converters assuming different internal encodings, fonts which render a charset different from both of them... Fun eh?
>
>> But endering of following snippet isn't CP1252 complying:
>>
>> String withAll: ((16r80 to: 16r9F) collect: [:e | Character value: e])
>> or
>> (16r80 to: 16r9F) collect: [:e | Character value: e] as: String
>> '•™≠∞≥∑∫Ω√≈…—‘Ÿ⁄∂∆Œ‚„‰ˆ˜˘˙˚˝˛ˇıƒ'
>>

I intentionnally included the above string in the mail just for the fun of it...
My gmail/firefox browser originally did display boxed control characters,
Now, in the same browser, I read back some math symbols in your answer...
... centered dot, Trade mark, different, infinity, greater or equal,
summation etc...
At least, you can see that "conforming to external world rules" might
be pretty difficult
I would add silly too :)


>> In Squeak 4.1 the different fonts don't agree on rendering these characters...
>> DefaultFixedTextStyle is still using MacRoman and display accented characters.
>> DefaultTextStyle hack first 4 entries with caret underscore left arrow
> Yup, Bitmap DejaVu is latin15 (some characters different from latin1, amongst them the € ), with 4 extra entries as you mentioned.
>> and up arrow (probably a Cuis hack)
>> Accu* just seem to have a hack for left arrow
> Yeah, they seem to cover... a blend of latin1, latin15 (has euro symbol), and something else (square-root :D ). Wee.
>
> Render with a Unicode font, and you get nothing but []'s, which would be the correct latin1-rendering of said string.
>
> Which is why I said an encoding property for the StrikeFonts was needed, so you can do the proper conversion of internal string charcodes to the charcode values the font expects. (Or rather, bitmap offsets)
> This of course means you'd have to come up with a  consistent definition of what the internal ByteString encoding in Squeak is first, though.
>
>
>> Maybe with a bit more clean-up (Character euro is answering the
>> MacRoman code for example,
> The keyboardinput handling in Squeak does strange things, at least on a Mac...
> Alt - § (which gives a euro symbol on my keyboard layout) is read as a WideChar with the correct unicode value on Pharo, but as Char 164 in Squeak.
> Alt- 5 (∞) does a similar thing, reads as correct widechar on Pharo, but on Squeak turns into char 129.
>> and taking macRoman conversions from
>> Sophie/Pharo), we could declare Squeak is using unicode...
>> Great !
>>
>> Nicolas
>
>
> That would be my dream as well.
> Or really, I'd settle for any unambiguous definition of what the ByteString encoding is.
> "A little more clean-up" may or may not be an understatement  though, it would involve going through all the converters, all keyboard-input processing code (seems to be more stable in Pharo on mac), and all places where strings enters/leaves the system. :)
>

I won't answer following mail, Michael took care of that in Pharo:)
Let's do it in Squeak too.

Nicolas

> Cheers,
> Henry
>
>
> _______________________________________________
> Pharo-project mailing list
> Pharo-project at lists.gforge.inria.fr
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project




More information about the Pharo-dev mailing list