[Pharo-project] Fastest utf-8 encoder contest

Igor Stasenko siguctua at gmail.com
Wed Jun 13 08:59:13 EDT 2012

On 13 June 2012 10:31, Philippe Marschall
<philippe.marschall at netcetera.ch> wrote:
> On 06/13/2012 04:44 AM, Igor Stasenko wrote:
>> Hi, hardcore hackers.
>> please take a look at the code and tell if it can be improved.
>> The AsmJit snippet below transforms an unicode integer value
>> to 1..4-byte sequence of utf-8
>> then the outer piece of code (which is not yet written) will
>> accumulate the results of this snippet
>> to do a memory-aligned (4byte) writes..
>> like that, if 4 unicode characters can be encoded into 4 utf-8 bytes
>> (which mostly the case for latin-1 char range), then there will be
>> 4 memory reads (to read four 32-bit unicode values) but only single
>> memory write (to write four 8-bit utf-8 encoded values).
>> The idea is to make utf-8 encoding speed close to memory copying speed :)
> In Seaside we use an other trick that Andreas Raab come up with. The
> assumption is that most of the strings are ASCII [1]. We use a CharSet /
> bitmap to quickly scan the string for the index of the first non-ASCII
> character. If we find none we just answer the argument. No copying at all.

Well, in my case i will need copying because i need to null-terminate it,
to represent it as null-terminated string.
This is what cairo library expects as input for rendering text.
And this also means that i can use a single buffer for conversions to
avoid generating garbage, i.e.
i take input string, convert it to utf8 in private buffer, then pass
that buffer as input to external call,
on next call an input can be any other string, but output will be the
same private buffer.
I will be needing to allocate new buffer if incoming string does not
fits into it.

> In Seaside 3.1 we go one step further. Imagine you have a long
> ByteString and only few non-ASCII characters. We do not want to have to
> copy the whole string just to utf-8 encode a few characters, so we
> combine the above approach with #next:putAll:startingAt: so that we only
> have to encode and copy the non-ASCII characters, everything else is not
> copied.
> I have become quite paranoid about allocation in Pharo. If I can remove
> about two or three #streamContents: I can get about 100 to 200 req/s more.
>  [1] This seems to be true for the rendering code in Seaside since it
> renders many small snippets. Even if you have several non-ASCII strings
> on a page since each string is rendered individually all the rest will
> still be ASCII.
> Cheers
> Philippe

Best regards,
Igor Stasenko.

More information about the Pharo-dev mailing list