[Pharo-project] Fastest utf-8 encoder contest

Philippe Marschall philippe.marschall at netcetera.ch
Wed Jun 13 09:56:09 EDT 2012


On 06/13/2012 02:59 PM, Igor Stasenko wrote:
> On 13 June 2012 10:31, Philippe Marschall
> <philippe.marschall at netcetera.ch> wrote:
>> On 06/13/2012 04:44 AM, Igor Stasenko wrote:
>>> Hi, hardcore hackers.
>>> please take a look at the code and tell if it can be improved.
>>>
>>> The AsmJit snippet below transforms an unicode integer value
>>> to 1..4-byte sequence of utf-8
>>>
>>> then the outer piece of code (which is not yet written) will
>>> accumulate the results of this snippet
>>> to do a memory-aligned (4byte) writes..
>>> like that, if 4 unicode characters can be encoded into 4 utf-8 bytes
>>> (which mostly the case for latin-1 char range), then there will be
>>> 4 memory reads (to read four 32-bit unicode values) but only single
>>> memory write (to write four 8-bit utf-8 encoded values).
>>>
>>> The idea is to make utf-8 encoding speed close to memory copying speed :)
>>
>> In Seaside we use an other trick that Andreas Raab come up with. The
>> assumption is that most of the strings are ASCII [1]. We use a CharSet /
>> bitmap to quickly scan the string for the index of the first non-ASCII
>> character. If we find none we just answer the argument. No copying at all.
>>
> 
> Well, in my case i will need copying because i need to null-terminate it,
> to represent it as null-terminated string.
> This is what cairo library expects as input for rendering text.
> And this also means that i can use a single buffer for conversions to
> avoid generating garbage, i.e.
> i take input string, convert it to utf8 in private buffer, then pass
> that buffer as input to external call,
> on next call an input can be any other string, but output will be the
> same private buffer.
> I will be needing to allocate new buffer if incoming string does not
> fits into it.

I see, different use case.

Cheers
Philippe





More information about the Pharo-dev mailing list