[Pharo-project] Fastest utf-8 encoder contest

Philippe Marschall philippe.marschall at netcetera.ch
Wed Jun 13 04:31:37 EDT 2012


On 06/13/2012 04:44 AM, Igor Stasenko wrote:
> Hi, hardcore hackers.
> please take a look at the code and tell if it can be improved.
> 
> The AsmJit snippet below transforms an unicode integer value
> to 1..4-byte sequence of utf-8
> 
> then the outer piece of code (which is not yet written) will
> accumulate the results of this snippet
> to do a memory-aligned (4byte) writes..
> like that, if 4 unicode characters can be encoded into 4 utf-8 bytes
> (which mostly the case for latin-1 char range), then there will be
> 4 memory reads (to read four 32-bit unicode values) but only single
> memory write (to write four 8-bit utf-8 encoded values).
> 
> The idea is to make utf-8 encoding speed close to memory copying speed :)

In Seaside we use an other trick that Andreas Raab come up with. The
assumption is that most of the strings are ASCII [1]. We use a CharSet /
bitmap to quickly scan the string for the index of the first non-ASCII
character. If we find none we just answer the argument. No copying at all.

In Seaside 3.1 we go one step further. Imagine you have a long
ByteString and only few non-ASCII characters. We do not want to have to
copy the whole string just to utf-8 encode a few characters, so we
combine the above approach with #next:putAll:startingAt: so that we only
have to encode and copy the non-ASCII characters, everything else is not
copied.

I have become quite paranoid about allocation in Pharo. If I can remove
about two or three #streamContents: I can get about 100 to 200 req/s more.

 [1] This seems to be true for the rendering code in Seaside since it
renders many small snippets. Even if you have several non-ASCII strings
on a page since each string is rendered individually all the rest will
still be ASCII.

Cheers
Philippe





More information about the Pharo-dev mailing list