Fun with CleanBlocks or: Speeding up Dictionary>>#at:

MD
Marcus Denker
Fri, May 19, 2023 11:01 AM

Dictionary lookup is used a lot. Let's look at Dictionary>>#at:

at: key
"Answer the value associated with the key."
^ self at: key ifAbsent: [self errorKeyNotFound: key]

The method looks simple and at a first glance, there are no obvious problems.

But if we look at it again, we see that for every execution of #at:, the block [self errorKeyNotFound: key]
has to be created. The method stores a CompiledBlock in the literals, and a bytecode "create block" creates
the block:

(Dictionary>>#at:) symbolic 

"'41 <4C> self
42 <40> pushTemp: 0
43 <40> pushTemp: 0
44 <F9 00 01> fullClosure:a CompiledBlock: [self errorKeyNotFound: key] NumCopied: 1
47 <A1> send: at:ifAbsent:
48 <5C> returnTop'"

This happens at every execution of #at:, even though the block that we spend time to create will never
be executed at runtime (outside of real errors).

If you look at the rest of the code path of #at: in a Dictionary, it is carefully written to avoid block creation
by using optimized contructs (ifTrue: and friends).

Could clean blocks help? Clean blocks are blocks that only use data that the compiler knows at compile time,
thus we can create a CleanBlockClosure (which has the CompiledBlock) and store that as a Literal.

This then would mean that we could just use as fast pushLiteral: to push the block, no creation at runtime needed.

Clean blocks are not yet enabled by default, but we can use a compiler option pragma to enable them just for this
method:

	<compilerOptions: #(+ optionCleanBlockClosure)>

The problem is that the block references both "self" and the key to be able to show a nice error message.
Referencing either one make the block not clean.

Could we change the block to be clean? I guess we could not refer self and inline the method #errorKeyNotFound:
(knowing the instance is not that important for logging the error in production, for example). But we do want
to know the key that is not found, it really simplifies debugging especially if you have to look at log files.

If we closely look at the block: we know a bit more about it and how it is used. We know that if it gets evaluated,
that evaluation will always happen with the homeContext of the block on the stack, as it is will always be #at:ifAbsent:
that evaluates that block.

And in that case, there is a trick that we can do: we can use reflection to read the needed values via the stack.

You might not be aware, but the debugger needs some quite interesting features from the reflective layer of the system to provide
the user experience that you take for granted (and that feels trivial). Imagine a block like that:

tt

	| temp1 temp2 |
	temp1 := 1.
	temp2 := 2.

	self class methods do: [ :each | self halt. each with: temp1 ].
	^ temp2

The block does not reference temp2, so it actually is created as a block that does not know temp2 at all. temp2 is not accessibe from the block or it's context.
Yet, when debugging, you want to just be able to write temp2 in the block and eval it (without saving the method), as if you would access
the temp in this block and recompile, the compiler would compile a different block that would know temp2.

So the reflective API of Context (and the infrastructure of reading Variables reflectively), is build in a way that #readVariableNamed: will
use the stack to find the value of temps that are not available in the block context, but could be available if they would be referenced
statically.

And if you think about it, we have just a case like that here, in reverse: if we would use #readVariableNamed: on the context to read the argument,
the compiler would not see a read of "key", thus compile a block that does not have that temp available. And if we then read "self" via thisContext, too,
the compiler sees a block that it can compile as a clean block:

at: key
"Answer the value associated with the key."

<compilerOptions: #(+ optionCleanBlockClosure)>

^ self at: key ifAbsent: [

	"this block is never executed, yet we pay runtime cost to create it if it is a full block.
	We use instead the reflective API to read the argument and receiver via the stack, making
	the block clean.

	We enable cleanBlocks for this method by setting optionCleanBlockClosure as it is not yet
	enabled by default"
	
	KeyNotFound signalFor: (thisContext readVariableNamed: #key) in: thisContext sender receiver
	"This is equivalent to:
		self errorKeyNotFound: key"
	]
	
	

Let's look at the printed symbolic bytecode:

Dictionary>>#at:) symbolic  

"'41 <4C> self
42 <40> pushTemp: 0
43 <20> pushConstant: [
	
		"this block is never executed, yet we pay runtime cost to create it if it is a full block.
		We use instead the reflective API to read the argument and receiver via the stack, making
		the block clean.
	
		We enable cleanBlocks for this method by setting optionCleanBlockClosure as it is not yet
		enabled by default"
		
		KeyNotFound signalFor: (thisContext readVariableNamed: #key) in: thisContext sender receiver
		"This is equivalent to:
			self errorKeyNotFound: key"
		]
44 <A1> send: at:ifAbsent:
45 <5C> returnTop'"

you see that it now uses pushConstant: to push the pre-compiled CleanBlockClosure, which is much faster.

A very naive benchmark using a fairly large Dictionary (Smalltalk globals):

[Smalltalk globals at: #Object] bench 

8591996.801/7040335.266 "1.220395972119888"

Shows a speedup of ~20%, which is qute a lot!

Of course, this is a hack. For one, it just solves this for this one case. And we do not want to add code like
that everywhere. And real world impact of speed will be quite limited due to that (it just speeds up
Dictionary>#at:)

The alternative would be to inline #at:ifAbsent: This inlining, due to the existing subclasses, leads to quite some
needed refactorings, and has the same problem of being a solution for this problem in this one place.

The real solution in the long term is a JIT that removes blocks like that by inlining automatically for the code it executes,
while not changing the image level code at all. This would then solve this for all similar cases everywhere all at once.

But the nice aspect of the hack is: it is local (touches just this one method), gives us some speedup now and will be
trivially removable when we deploy a better solution in the future.

But the code is quite ugly and indeed relies on reflection... which maybe for Dictionary>>#at: is not that nice
(e.g. if you want to create a minimal image without reflection)

But it is save to add this to your own image, it works in Pharo11 and Pharo12 and I guess even in Pharo10
(not tested there).

Marcus
Dictionary lookup is used a lot. Let's look at Dictionary>>#at: at: key "Answer the value associated with the key." ^ self at: key ifAbsent: [self errorKeyNotFound: key] The method looks simple and at a first glance, there are no obvious problems. But if we look at it again, we see that for every execution of #at:, the block [self errorKeyNotFound: key] has to be created. The method stores a CompiledBlock in the literals, and a bytecode "create block" creates the block: ``` (Dictionary>>#at:) symbolic "'41 <4C> self 42 <40> pushTemp: 0 43 <40> pushTemp: 0 44 <F9 00 01> fullClosure:a CompiledBlock: [self errorKeyNotFound: key] NumCopied: 1 47 <A1> send: at:ifAbsent: 48 <5C> returnTop'" ``` This happens at *every* execution of #at:, even though the block that we spend time to create will never be executed at runtime (outside of real errors). If you look at the rest of the code path of #at: in a Dictionary, it is carefully written to avoid block creation by using optimized contructs (ifTrue: and friends). Could clean blocks help? Clean blocks are blocks that only use data that the compiler knows at compile time, thus we can create a CleanBlockClosure (which has the CompiledBlock) and store *that* as a Literal. This then would mean that we could just use as fast pushLiteral: to push the block, no creation at runtime needed. Clean blocks are not yet enabled by default, but we can use a compiler option pragma to enable them just for this method: ``` <compilerOptions: #(+ optionCleanBlockClosure)> ``` The problem is that the block references both "self" and the key to be able to show a nice error message. Referencing either one make the block not clean. Could we change the block to be clean? I guess we could not refer self and inline the method #errorKeyNotFound: (knowing the instance is not that important for logging the error in production, for example). But we do want to know the key that is not found, it really simplifies debugging especially if you have to look at log files. If we closely look at the block: we know a bit more about it and how it is used. We know that if it gets evaluated, that evaluation will *always* happen with the homeContext of the block on the stack, as it is will always be #at:ifAbsent: that evaluates that block. And in that case, there is a trick that we can do: we can use reflection to read the needed values via the stack. You might not be aware, but the debugger needs some quite interesting features from the reflective layer of the system to provide the user experience that you take for granted (and that feels trivial). Imagine a block like that: ``` tt | temp1 temp2 | temp1 := 1. temp2 := 2. self class methods do: [ :each | self halt. each with: temp1 ]. ^ temp2 ``` The block does not reference temp2, so it actually is created as a block that does not know temp2 at all. temp2 is not accessibe from the block or it's context. Yet, when debugging, you want to just be able to write temp2 in the block and eval it (without saving the method), as *if* you would access the temp in this block and recompile, the compiler would compile a different block that would know temp2. So the reflective API of Context (and the infrastructure of reading Variables reflectively), is build in a way that #readVariableNamed: will use the stack to find the value of temps that are not available in the block context, but could be available if they would be referenced statically. And if you think about it, we have just a case like that here, in reverse: if we would use #readVariableNamed: on the context to read the argument, the compiler would not see a read of "key", thus compile a block that does not have that temp available. And if we then read "self" via thisContext, too, the compiler sees a block that it can compile as a clean block: at: key "Answer the value associated with the key." <compilerOptions: #(+ optionCleanBlockClosure)> ^ self at: key ifAbsent: [ "this block is never executed, yet we pay runtime cost to create it if it is a full block. We use instead the reflective API to read the argument and receiver via the stack, making the block clean. We enable cleanBlocks for this method by setting optionCleanBlockClosure as it is not yet enabled by default" KeyNotFound signalFor: (thisContext readVariableNamed: #key) in: thisContext sender receiver "This is equivalent to: self errorKeyNotFound: key" ] Let's look at the printed symbolic bytecode: ``` Dictionary>>#at:) symbolic "'41 <4C> self 42 <40> pushTemp: 0 43 <20> pushConstant: [ "this block is never executed, yet we pay runtime cost to create it if it is a full block. We use instead the reflective API to read the argument and receiver via the stack, making the block clean. We enable cleanBlocks for this method by setting optionCleanBlockClosure as it is not yet enabled by default" KeyNotFound signalFor: (thisContext readVariableNamed: #key) in: thisContext sender receiver "This is equivalent to: self errorKeyNotFound: key" ] 44 <A1> send: at:ifAbsent: 45 <5C> returnTop'" ``` you see that it now uses pushConstant: to push the pre-compiled CleanBlockClosure, which is *much* faster. A very naive benchmark using a fairly large Dictionary (Smalltalk globals): ``` [Smalltalk globals at: #Object] bench 8591996.801/7040335.266 "1.220395972119888" ``` Shows a speedup of ~20%, which is qute a lot! Of course, this is a hack. For one, it just solves this for this one case. And we do not want to add code like that everywhere. And real world impact of speed will be quite limited due to that (it just speeds up Dictionary>#at:) The alternative would be to inline #at:ifAbsent: This inlining, due to the existing subclasses, leads to quite some needed refactorings, and has the same problem of being a solution for this problem in this one place. The real solution in the long term is a JIT that removes blocks like that by inlining automatically for the code it executes, while not changing the image level code at all. This would then solve this for all similar cases everywhere all at once. But the nice aspect of the hack is: it is local (touches just this one method), gives us some speedup now and will be trivially removable when we deploy a better solution in the future. But the code is quite ugly and indeed relies on reflection... which maybe for Dictionary>>#at: is not that nice (e.g. if you want to create a minimal image without reflection) But it is save to add this to your own image, it works in Pharo11 and Pharo12 and I guess even in Pharo10 (not tested there). Marcus
TM
Tim Mackinnon
Fri, May 19, 2023 11:22 AM

I love these little discussions/examples that you share... things you take for granted have subtle impacts that are worth reasoning about. Whenever I read them, it definitely improves my understanding and makes me a better programmer. Thanks Markus.

Tim

On Fri, 19 May 2023, at 12:01 PM, Marcus Denker wrote:

Dictionary lookup is used a lot. Let's look at Dictionary>>#at:

at: key
"Answer the value associated with the key."
^ self at: key ifAbsent: [self errorKeyNotFound: key]

The method looks simple and at a first glance, there are no obvious problems.

But if we look at it again, we see that for every execution of #at:,
the block [self errorKeyNotFound: key]
has to be created. The method stores a CompiledBlock in the literals,
and a bytecode "create block" creates
the block:

(Dictionary>>#at:) symbolic 

"'41 <4C> self
42 <40> pushTemp: 0
43 <40> pushTemp: 0
44 <F9 00 01> fullClosure:a CompiledBlock: [self errorKeyNotFound: key] 
NumCopied: 1
47 <A1> send: at:ifAbsent:
48 <5C> returnTop'"

This happens at every execution of #at:, even though the block that
we spend time to create will never
be executed at runtime (outside of real errors).

If you look at the rest of the code path of #at: in a Dictionary, it is
carefully written to avoid block creation
by using optimized contructs (ifTrue: and friends).

Could clean blocks help? Clean blocks are blocks that only use data
that the compiler knows at compile time,
thus we can create a CleanBlockClosure (which has the CompiledBlock)
and store that as a Literal.

This then would mean that we could just use as fast pushLiteral: to
push the block, no creation at runtime needed.

Clean blocks are not yet enabled by default, but we can use a compiler
option pragma to enable them just for this
method:

	<compilerOptions: #(+ optionCleanBlockClosure)>

The problem is that the block references both "self" and the key to be
able to show a nice error message.
Referencing either one make the block not clean.

Could we change the block to be clean? I guess we could not refer self
and inline the method #errorKeyNotFound:
(knowing the instance is not that important for logging the error in
production, for example). But we do want
to know the key that is not found, it really simplifies debugging
especially if you have to look at log files.

If we closely look at the block: we know a bit more about it and how it
is used. We know that if it gets evaluated,
that evaluation will always happen with the homeContext of the block
on the stack, as it is will always be #at:ifAbsent:
that evaluates that block.

And in that case, there is a trick that we can do: we can use
reflection to read the needed values via the stack.

You might not be aware, but the debugger needs some quite interesting
features from the reflective layer of the system to provide
the user experience that you take for granted (and that feels trivial).
Imagine a block like that:

tt

	| temp1 temp2 |
	temp1 := 1.
	temp2 := 2.

	self class methods do: [ :each | self halt. each with: temp1 ].
	^ temp2

The block does not reference temp2, so it actually is created as a
block that does not know temp2 at all. temp2 is not accessibe from the
block or it's context.
Yet, when debugging, you want to just be able to write temp2 in the
block and eval it (without saving the method), as if you would access
the temp in this block and recompile, the compiler would compile a
different block that would know temp2.

So the reflective API of Context (and the infrastructure of reading
Variables reflectively), is build in a way that #readVariableNamed:
will
use the stack to find the value of temps that are not available in the
block context, but could be available if they would be referenced
statically.

And if you think about it, we have just a case like that here, in
reverse: if we would use #readVariableNamed: on the context to read the
argument,
the compiler would not see a read of "key", thus compile a block that
does not have that temp available. And if we then read "self" via
thisContext, too,
the compiler sees a block that it can compile as a clean block:

at: key
"Answer the value associated with the key."

<compilerOptions: #(+ optionCleanBlockClosure)>

^ self at: key ifAbsent: [

	"this block is never executed, yet we pay runtime cost to create it 

if it is a full block.
We use instead the reflective API to read the argument and receiver
via the stack, making
the block clean.

	We enable cleanBlocks for this method by setting 

optionCleanBlockClosure as it is not yet
enabled by default"

	KeyNotFound signalFor: (thisContext readVariableNamed: #key) in: 

thisContext sender receiver
"This is equivalent to:
self errorKeyNotFound: key"
]

Let's look at the printed symbolic bytecode:

Dictionary>>#at:) symbolic  

"'41 <4C> self
42 <40> pushTemp: 0
43 <20> pushConstant: [
	
		"this block is never executed, yet we pay runtime cost to create it 
if it is a full block.
		We use instead the reflective API to read the argument and receiver 
via the stack, making
		the block clean.
	
		We enable cleanBlocks for this method by setting 
optionCleanBlockClosure as it is not yet
		enabled by default"
		
		KeyNotFound signalFor: (thisContext readVariableNamed: #key) in: 
thisContext sender receiver
		"This is equivalent to:
			self errorKeyNotFound: key"
		]
44 <A1> send: at:ifAbsent:
45 <5C> returnTop'"

you see that it now uses pushConstant: to push the pre-compiled
CleanBlockClosure, which is much faster.

A very naive benchmark using a fairly large Dictionary (Smalltalk globals):

[Smalltalk globals at: #Object] bench 

8591996.801/7040335.266 "1.220395972119888"

Shows a speedup of ~20%, which is qute a lot!

Of course, this is a hack. For one, it just solves this for this one
case. And we do not want to add code like
that everywhere. And real world impact of speed will be quite limited
due to that (it just speeds up
Dictionary>#at:)

The alternative would be to inline #at:ifAbsent: This inlining, due to
the existing subclasses, leads to quite some
needed refactorings, and has the same problem of being a solution for
this problem in this one place.

The real solution in the long term is a JIT that removes blocks like
that by inlining automatically for the code it executes,
while not changing the image level code at all. This would then solve
this for all similar cases everywhere all at once.

But the nice aspect of the hack is: it is local (touches just this one
method), gives us some speedup now and will be
trivially removable when we deploy a better solution in the future.

But the code is quite ugly and indeed relies on reflection... which
maybe for Dictionary>>#at: is not that nice
(e.g. if you want to create a minimal image without reflection)

But it is save to add this to your own image, it works in Pharo11 and
Pharo12 and I guess even in Pharo10
(not tested there).

Marcus
I love these little discussions/examples that you share... things you take for granted have subtle impacts that are worth reasoning about. Whenever I read them, it definitely improves my understanding and makes me a better programmer. Thanks Markus. Tim On Fri, 19 May 2023, at 12:01 PM, Marcus Denker wrote: > Dictionary lookup is used a lot. Let's look at Dictionary>>#at: > > > at: key > "Answer the value associated with the key." > ^ self at: key ifAbsent: [self errorKeyNotFound: key] > > > The method looks simple and at a first glance, there are no obvious problems. > > But if we look at it again, we see that for every execution of #at:, > the block [self errorKeyNotFound: key] > has to be created. The method stores a CompiledBlock in the literals, > and a bytecode "create block" creates > the block: > > ``` > (Dictionary>>#at:) symbolic > > "'41 <4C> self > 42 <40> pushTemp: 0 > 43 <40> pushTemp: 0 > 44 <F9 00 01> fullClosure:a CompiledBlock: [self errorKeyNotFound: key] > NumCopied: 1 > 47 <A1> send: at:ifAbsent: > 48 <5C> returnTop'" > ``` > > This happens at *every* execution of #at:, even though the block that > we spend time to create will never > be executed at runtime (outside of real errors). > > If you look at the rest of the code path of #at: in a Dictionary, it is > carefully written to avoid block creation > by using optimized contructs (ifTrue: and friends). > > Could clean blocks help? Clean blocks are blocks that only use data > that the compiler knows at compile time, > thus we can create a CleanBlockClosure (which has the CompiledBlock) > and store *that* as a Literal. > > This then would mean that we could just use as fast pushLiteral: to > push the block, no creation at runtime needed. > > Clean blocks are not yet enabled by default, but we can use a compiler > option pragma to enable them just for this > method: > > ``` > <compilerOptions: #(+ optionCleanBlockClosure)> > ``` > > The problem is that the block references both "self" and the key to be > able to show a nice error message. > Referencing either one make the block not clean. > > Could we change the block to be clean? I guess we could not refer self > and inline the method #errorKeyNotFound: > (knowing the instance is not that important for logging the error in > production, for example). But we do want > to know the key that is not found, it really simplifies debugging > especially if you have to look at log files. > > If we closely look at the block: we know a bit more about it and how it > is used. We know that if it gets evaluated, > that evaluation will *always* happen with the homeContext of the block > on the stack, as it is will always be #at:ifAbsent: > that evaluates that block. > > And in that case, there is a trick that we can do: we can use > reflection to read the needed values via the stack. > > You might not be aware, but the debugger needs some quite interesting > features from the reflective layer of the system to provide > the user experience that you take for granted (and that feels trivial). > Imagine a block like that: > > ``` > tt > > | temp1 temp2 | > temp1 := 1. > temp2 := 2. > > self class methods do: [ :each | self halt. each with: temp1 ]. > ^ temp2 > ``` > > > The block does not reference temp2, so it actually is created as a > block that does not know temp2 at all. temp2 is not accessibe from the > block or it's context. > Yet, when debugging, you want to just be able to write temp2 in the > block and eval it (without saving the method), as *if* you would access > the temp in this block and recompile, the compiler would compile a > different block that would know temp2. > > So the reflective API of Context (and the infrastructure of reading > Variables reflectively), is build in a way that #readVariableNamed: > will > use the stack to find the value of temps that are not available in the > block context, but could be available if they would be referenced > statically. > > And if you think about it, we have just a case like that here, in > reverse: if we would use #readVariableNamed: on the context to read the > argument, > the compiler would not see a read of "key", thus compile a block that > does not have that temp available. And if we then read "self" via > thisContext, too, > the compiler sees a block that it can compile as a clean block: > > > at: key > "Answer the value associated with the key." > > <compilerOptions: #(+ optionCleanBlockClosure)> > > ^ self at: key ifAbsent: [ > > "this block is never executed, yet we pay runtime cost to create it > if it is a full block. > We use instead the reflective API to read the argument and receiver > via the stack, making > the block clean. > > We enable cleanBlocks for this method by setting > optionCleanBlockClosure as it is not yet > enabled by default" > > KeyNotFound signalFor: (thisContext readVariableNamed: #key) in: > thisContext sender receiver > "This is equivalent to: > self errorKeyNotFound: key" > ] > > > Let's look at the printed symbolic bytecode: > > > ``` > Dictionary>>#at:) symbolic > > "'41 <4C> self > 42 <40> pushTemp: 0 > 43 <20> pushConstant: [ > > "this block is never executed, yet we pay runtime cost to create it > if it is a full block. > We use instead the reflective API to read the argument and receiver > via the stack, making > the block clean. > > We enable cleanBlocks for this method by setting > optionCleanBlockClosure as it is not yet > enabled by default" > > KeyNotFound signalFor: (thisContext readVariableNamed: #key) in: > thisContext sender receiver > "This is equivalent to: > self errorKeyNotFound: key" > ] > 44 <A1> send: at:ifAbsent: > 45 <5C> returnTop'" > ``` > > you see that it now uses pushConstant: to push the pre-compiled > CleanBlockClosure, which is *much* faster. > > A very naive benchmark using a fairly large Dictionary (Smalltalk globals): > > ``` > [Smalltalk globals at: #Object] bench > > 8591996.801/7040335.266 "1.220395972119888" > ``` > > Shows a speedup of ~20%, which is qute a lot! > > Of course, this is a hack. For one, it just solves this for this one > case. And we do not want to add code like > that everywhere. And real world impact of speed will be quite limited > due to that (it just speeds up > Dictionary>#at:) > > The alternative would be to inline #at:ifAbsent: This inlining, due to > the existing subclasses, leads to quite some > needed refactorings, and has the same problem of being a solution for > this problem in this one place. > > The real solution in the long term is a JIT that removes blocks like > that by inlining automatically for the code it executes, > while not changing the image level code at all. This would then solve > this for all similar cases everywhere all at once. > > But the nice aspect of the hack is: it is local (touches just this one > method), gives us some speedup now and will be > trivially removable when we deploy a better solution in the future. > > But the code is quite ugly and indeed relies on reflection... which > maybe for Dictionary>>#at: is not that nice > (e.g. if you want to create a minimal image without reflection) > > But it is save to add this to your own image, it works in Pharo11 and > Pharo12 and I guess even in Pharo10 > (not tested there). > > Marcus