[Pharo-project] Collection >> collectDistinct: ??

Levente Uzonyi leves at elte.hu
Wed Mar 10 21:33:15 EST 2010


On Wed, 10 Mar 2010, Randal L. Schwartz wrote:

>>>>>> "Randal" == Randal L Schwartz <merlyn at stonehenge.com> writes:
>
> Randal> I can envision collection types where collect: can create something
> Randal> that asSet can more easily turn into a set in bulk than adding each
> Randal> element individually like this.

Keys of dictionaries, but it's hackish.

>
> In particular, turning an array of 10,000 elements into a set
> can start with a particular hashing size, whereas starting with
> an empty set and adding 10,000 elements will likely involve rehasing
> a number of times.  If 10,000 isn't troubling enough, make that
> a million. :)
>

Lets see what we've got (in Squeak):

Smalltalk garbageCollect.
[ (1 to: 1000000) collect: #yourself as: Set ] timeToRun. "==> 2651."

Smalltalk garbageCollect.
[ (1 to: 1000000) asSet ] timeToRun. "==>  2522".

| s |
s := Set new.
Smalltalk garbageCollect.
[ 1 to: 1000000 do: [ :each | s add: each ] ] timeToRun. "==> 1192"

| s |
s := Set new: 1000000.
Smalltalk garbageCollect.
[ 1 to: 1000000 do: [ :each | s add: each ] ] timeToRun. "==> 472"

Note that adding numbers from 1 to n has the lowest cost for addition, 
since there will be no hash collisions at all.


So try it with randomized data:

a := (1 to: 1000000) asArray shuffled.

Smalltalk garbageCollect.
[ a collect: #yourself as: Set ] timeToRun. "==> 4143"

Smalltalk garbageCollect.
[ a asSet ] timeToRun. "==> 3995".

| s |
s := Set new.
Smalltalk garbageCollect.
[ s addAll: a ] timeToRun. "==> 157304 oops weak hashes".

| s |
s := Set new: 1000000.
Smalltalk garbageCollect.
[ s addAll: a ] timeToRun. "==> 636".


With better hashes:

a := Set new: 1000000.
[ a size < 1000000 ] whileTrue: [
 	a add: SmallInteger maxVal atRandom ].
a := a asArray.

Smalltalk garbageCollect.
[ a collect: #yourself as: Set ] timeToRun. "==>  3895"

Smalltalk garbageCollect.
[ a asSet ] timeToRun. "==>  3777".

| s |
s := Set new.
Smalltalk garbageCollect.
[ s addAll: a ] timeToRun. "==> 4729 much better".

| s |
s := Set new: 1000000.
Smalltalk garbageCollect.
[ s addAll: a ] timeToRun. "==> 774".


Conclusions:
- #collect:as: is (almost) as fast as #asSet
- better hashes can make a huge difference
- message sends cost a lot compared to bytecodes
- if you can guess the size of your collection, do it (applies for all 
growing collections and collection streaming)
- only optimize if you have to in your own code


Levente

> -- 
> Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
> <merlyn at stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
> Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
> See http://methodsandmessages.vox.com/ for Smalltalk and Seaside discussion
>
> _______________________________________________
> Pharo-project mailing list
> Pharo-project at lists.gforge.inria.fr
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>




More information about the Pharo-dev mailing list