[Pharo-project] Fastest utf-8 encoder contest

Igor Stasenko siguctua at gmail.com
Thu Jun 14 11:01:31 EDT 2012

On 14 June 2012 14:05, Henrik Sperre Johansen
<henrik.s.johansen at veloxit.no> wrote:
> On 13.06.2012 14:59, Igor Stasenko wrote:
>> On 13 June 2012 10:31, Philippe Marschall
>> <philippe.marschall at netcetera.ch>  wrote:
>>> On 06/13/2012 04:44 AM, Igor Stasenko wrote:
>>>> Hi, hardcore hackers.
>>>> please take a look at the code and tell if it can be improved.
>>>> The AsmJit snippet below transforms an unicode integer value
>>>> to 1..4-byte sequence of utf-8
>>>> then the outer piece of code (which is not yet written) will
>>>> accumulate the results of this snippet
>>>> to do a memory-aligned (4byte) writes..
>>>> like that, if 4 unicode characters can be encoded into 4 utf-8 bytes
>>>> (which mostly the case for latin-1 char range), then there will be
>>>> 4 memory reads (to read four 32-bit unicode values) but only single
>>>> memory write (to write four 8-bit utf-8 encoded values).
>>>> The idea is to make utf-8 encoding speed close to memory copying speed
>>>> :)
>>> In Seaside we use an other trick that Andreas Raab come up with. The
>>> assumption is that most of the strings are ASCII [1]. We use a CharSet /
>>> bitmap to quickly scan the string for the index of the first non-ASCII
>>> character. If we find none we just answer the argument. No copying at
>>> all.
>> Well, in my case i will need copying because i need to null-terminate it,
>> to represent it as null-terminated string.
>> This is what cairo library expects as input for rendering text.
>> And this also means that i can use a single buffer for conversions to
>> avoid generating garbage, i.e.
>> i take input string, convert it to utf8 in private buffer, then pass
>> that buffer as input to external call,
>> on next call an input can be any other string, but output will be the
>> same private buffer.
>> I will be needing to allocate new buffer if incoming string does not
>> fits into it.
> So, is this a one-off for Cairo, or are you planning on introducing a
> <var: #myStringParameter encoding: #utfXX>
> pragma or something to NBFFI?

Well, there is utf-8 encoder in examples, see NBUTF8StringExample
which converts WideString to utf-8 encoded string on stack,
and then that string is passed as argument to external function.
The drawback of this approach is that results of conversion is not
accessible from language side,
they are visible only to external function and once call is made, it is gone.

> Because, on Windows system calls either expect the local system code page,
> or utf16 (depending on which version of the API you use), while the in other
> libraries the expected encoding may be different from the one of the system
> libraries.

Yes, but those will require many different conversion algorithms.. But
i am not interested in making
a general-purpose (our world)->all possible encodings converters right now.
I just wanna make fast utf-8 encoder for things i am working on :)

> Heck, the same lib may expect different encodings, case in point:
> The expected encoding of the Oracle client lib (and by extension DBXTalk*)
> depends on either an environment variable, what you told it to expect using
> an API function, or one of many other fallbacks I can't remember at the
> moment.
> Cheers,
> Henry
> *<rant> OpenDBX makes an explicity choice to simply pass the data you hand
> it over to whatever backend you have hooked up raw. Very nice of a library
> that does a good job of abstracting the actual API of communicating with a
> database to leave it up to you to decide which encoding to use (which
> depends on the abstracted away backend...) if you want your data stored
> correctly... There is a reason those benchmarks are fast! :) </rant>

Best regards,
Igor Stasenko.

More information about the Pharo-dev mailing list