[Pharo-dev] Character>>#leadingChar

Stéphane Ducasse stephane.ducasse at inria.fr
Mon Oct 21 12:30:46 EDT 2013


Hi guys

I would love to see some effort cleaning this part (removing the leading char) and using Unicode.
It will simplify a lot from what I understand. 
Who would like to think about a roadmap and share some effort?

Stef


On Oct 21, 2013, at 11:37 AM, Henrik Johansen <henrik.s.johansen at veloxit.no> wrote:

> As an added bonus, asInteger / asUnicode / codePoint / charCode / asciiValue would all share the same definition; ^value :)
> 
> Cheers,
> Henry
> 
> P.S. codePoint is currently bugged, it should be ^self asUnicode
> I'd hardly say the leadingChar-tagged value in potentially different character sets it currently returns meets the ANSI definition of: 
> "Return the encoding value of the receiver in the implementation defined execution character set."
> 
> 
> On Oct 21, 2013, at 11:18 , Henrik Johansen <henrik.s.johansen at veloxit.no> wrote:
> 
>> 
>> On Oct 18, 2013, at 6:34 , Sven Van Caekenberghe <sven at stfx.eu> wrote:
>> 
>>> Hi,
>>> 
>>> So once again we have an issue with Character>>#leadingChar, see
>>> 
>>> https://pharo.fogbugz.com/f/cases/6368
>>> 
>>> Do we really need this ?
>>> Any Japanese, Chinese or Korean users willing to comment ?
>>> 
>>> Thx,
>>> 
>>> Sven
>>> 
>> 
>> I'm not any of those, but my short answer would be no.
>> 
>> As for the long answer:
>> LeadingChar has too many responsibilities:
>> - Character set of string
>> - Font selection (see StrikeFontSet)
>> - Han unification disambiguation (through the above font selection)
>> 
>> The conflation of these, and confusion of which leadingChar actually implies, easily leads to bugs, and has done so already. (see Character >> asUnicode as opposed to JapaneseEnvironment >> fromJISX0208String: for example).
>> I would bet 100€ StrikeFontSet no longer works as intended either, that is, being able to display > latin1 glyphs using StrikeFonts. 
>> 
>> Now, here's why I feel those areas are not worth keeping, especially in their current, buggy state:
>> - Non-unicode character sets
>> The main reasons for supporting this would be
>> 1) Size reduction. All Widestrings are 32bits per character, so that's moot.
>> 2) No need for converting codepoints when using Fonts stored with JISX0208 etc. codePoints . I've yet to see a free/truetype font using anything but Unicode, and since we'd be the creators of any theoretical StrikeFontSet covering other languages, we'd be able to avoid it anyways.
>> 
>> If, in the future, it'd be desirable to support encodings other than Unicode for internal strings, I feel separate subclasses are a cleaner solution.
>> 
>> - Font selection / Han unification disambiguation
>> IMHO, obsoleted by the use of standard TrueType fonts. As long as one does not use StrikeFontSets to display a string, it currently has no benefits.
>> Yes, one could potentially select different FreeTypeFonts based on it when a run is encountered as well, but the fonts themselves do not contain metadata pertaining to which variant of the glyphs they include, afaik (if they even support them; automatic fallback to another font when current font doesn't cover a  glyph would be a higher area of priority)
>> Even in that case, it could be a property of the current locale instead, while it means you can't display both korean/japanese text in the same image correctly, it'd be a (imho) acceptable tradeoff.
>> 
>> Cheers,
>> Henry
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pharo.org/pipermail/pharo-dev_lists.pharo.org/attachments/20131021/d3cd5361/attachment-0002.html>


More information about the Pharo-dev mailing list