Equality/Hash Modification and Image Rehashing

S
sean@clipperadams.com
Tue, Aug 30, 2022 5:00 PM

From Pharo Discord:

IIUC when I change the comparison behavior of a class (i.e. reimplement #= and #hash), sets containing those objects start acting oddly e.g. happily storing duplicates of equal objects and failing to find objects they contain. A solution seems to be Set allSubInstances do: #rehash.

Is my understanding correct? If so, I find it odd that I've been using and researching Smalltalk for well over a decade and I never remember reading/hearing about this. I wonder if it can be baked into the IDE somehow e.g. via a critic...

And stephan responded:

The mailing list posts about that I remember were at the time where the number of object bits used for the hash were increased.


Following up on his clue, I found this old post, which seems to confirm my suspicion. This seems like a serious trap for users to fall into (I’m having flashbacks to my C++ days lol). So, can/should we somehow handle this in the IDE? Something like a refactoring (that has a confirmation window) that gives you the choice to rehash image collections when redefining these methods?

In a 1GB image, `Set allSubInstances do: #rehash` is almost instantaneous and `Dictionary allSubInstances do: #rehash` takes several seconds.

From Pharo Discord: IIUC when I change the comparison behavior of a class (i.e. reimplement #= and #hash), sets containing those objects start acting oddly e.g. happily storing duplicates of equal objects and failing to find objects they contain. A solution seems to be Set allSubInstances do: #rehash. Is my understanding correct? If so, I find it odd that I've been using and researching Smalltalk for well over a decade and I never remember reading/hearing about this. I wonder if it can be baked into the IDE somehow e.g. via a critic... And stephan responded: > The mailing list posts about that I remember were at the time where the number of object bits used for the hash were increased. --- Following up on his clue, I found [this old post](https://lists.pharo.org/empathy/thread/LKCE5UXYPXPLVOGNZYW64BWHYJY5NMHQ?hash=LKCE5UXYPXPLVOGNZYW64BWHYJY5NMHQ#LKCE5UXYPXPLVOGNZYW64BWHYJY5NMHQ "this old post"), which seems to confirm my suspicion. This seems like a serious trap for users to fall into (I’m having flashbacks to my C++ days lol). So, can/should we somehow handle this in the IDE? Something like a refactoring (that has a confirmation window) that gives you the choice to rehash image collections when redefining these methods? In a 1GB image, \`Set allSubInstances do: #rehash\` is almost instantaneous and \`Dictionary allSubInstances do: #rehash\` takes several seconds.
DS
Daniel Slomovits
Tue, Aug 30, 2022 5:28 PM

Everything you say is true, but I don't consider it nearly as serious a
trap as you say, because I wouldn't generally expect a "user" (that is, a
developer not explicitly working on Pharo itself) to modify the #=/#hash
implementation of a system class (that is, one included in the base image
and instances of which may be expected to exist at all times). A user
working on their own application and modifying a #=/#hash implementation
there has a very straightforward way to resolve the problem—"reboot" their
own application, tearing down and rebuilding any structures it uses. For
the applications I work on this is very easy, because it needs to be in
order for tests to set up a clean environment each run.

For a developer who is explicitly working on Pharo, I think appropriate
comments in the #=/#hash implementations of classes critical to the
functioning of the image (most obviously the default implementation in
[Proto]Object) would be enough (and may already exist). I know in Dolphin
Smalltalk there are comments in the #identityHash methods explaining the
related point that a binary-filed collection may need to be rehashed on
load as its contents will not have the same identities and thus
#identityHash-es. Dolphin also changed the number of hash bits similar to
what stephan mentioned, and this definitely involved jumping through some
hoops to perform safely.

More generally, I think this is something anyone with a solid computer
science background and/or experience working on language implementations
would be expected to pick up by osmosis and/or reasoning from first
principles. I don't mean this as an insult to people who just want to write
cool apps, but, those are exactly the people who are unlikely to encounter
the problem, or if they do, will shrug, restart everything, and the problem
will go away. That said, there's certainly nothing wrong with some good
comments.

The thing is that the Set/Dictionary allSubInstances do: #rehash approach
is extremely heavy-handed. I suppose it should be largely safe, though to
be certain it would need to be done at very high priority to avoid the
possibility of concurrent modification (and even then there is a
theoretical window of vulnerability). But the cost/benefit analysis feels
pretty marginal IMO for making it a part of the IDE. One refinement, if it
were to be added—if there are no instances of the class which is being
modified, no rehashing can possibly be needed, and in fact in most cases I
would expect this to be true. One could additionally scan Sets for
instances of that class before rehashing them, which might not actually be
vastly more performant but would reduce the heavy-handedness of the
operation.

On Tue, Aug 30, 2022 at 1:01 PM sean@clipperadams.com wrote:

From Pharo Discord:

IIUC when I change the comparison behavior of a class (i.e. reimplement #=
and #hash), sets containing those objects start acting oddly e.g. happily
storing duplicates of equal objects and failing to find objects they
contain. A solution seems to be Set allSubInstances do: #rehash.

Is my understanding correct? If so, I find it odd that I've been using and
researching Smalltalk for well over a decade and I never remember
reading/hearing about this. I wonder if it can be baked into the IDE
somehow e.g. via a critic...

And stephan responded:

The mailing list posts about that I remember were at the time where the
number of object bits used for the hash were increased.


Following up on his clue, I found this old post
https://lists.pharo.org/empathy/thread/LKCE5UXYPXPLVOGNZYW64BWHYJY5NMHQ?hash=LKCE5UXYPXPLVOGNZYW64BWHYJY5NMHQ#LKCE5UXYPXPLVOGNZYW64BWHYJY5NMHQ,
which seems to confirm my suspicion. This seems like a serious trap for
users to fall into (I’m having flashbacks to my C++ days lol). So,
can/should we somehow handle this in the IDE? Something like a refactoring
(that has a confirmation window) that gives you the choice to rehash image
collections when redefining these methods?

In a 1GB image, Set allSubInstances do: #rehash is almost instantaneous
and Dictionary allSubInstances do: #rehash takes several seconds.

Everything you say is true, but I don't consider it nearly as serious a trap as you say, because I wouldn't generally expect a "user" (that is, a developer not explicitly working on Pharo itself) to modify the #=/#hash implementation of a system class (that is, one included in the base image and instances of which may be expected to exist at all times). A user working on their own application and modifying a #=/#hash implementation there has a very straightforward way to resolve the problem—"reboot" their own application, tearing down and rebuilding any structures it uses. For the applications I work on this is very easy, because it needs to be in order for tests to set up a clean environment each run. For a developer who *is* explicitly working on Pharo, I think appropriate comments in the #=/#hash implementations of classes critical to the functioning of the image (most obviously the default implementation in [Proto]Object) would be enough (and may already exist). I know in Dolphin Smalltalk there are comments in the #identityHash methods explaining the related point that a binary-filed collection may need to be rehashed on load as its contents will not have the same identities and thus #identityHash-es. Dolphin also changed the number of hash bits similar to what stephan mentioned, and this definitely involved jumping through some hoops to perform safely. More generally, I think this is something anyone with a solid computer science background and/or experience working on language implementations would be expected to pick up by osmosis and/or reasoning from first principles. I don't mean this as an insult to people who just want to write cool apps, but, those are exactly the people who are unlikely to encounter the problem, or if they do, will shrug, restart everything, and the problem will go away. That said, there's certainly nothing wrong with some good comments. The thing is that the `Set/Dictionary allSubInstances do: #rehash` approach is extremely heavy-handed. I suppose it should be largely safe, though to be certain it would need to be done at very high priority to avoid the possibility of concurrent modification (and even then there is a theoretical window of vulnerability). But the cost/benefit analysis feels pretty marginal IMO for making it a part of the IDE. One refinement, if it were to be added—if there are no instances of the class which is being modified, no rehashing can possibly be needed, and in fact in most cases I would expect this to be true. One could additionally scan Sets for instances of that class before rehashing them, which might not actually be vastly more performant but would reduce the heavy-handedness of the operation. On Tue, Aug 30, 2022 at 1:01 PM <sean@clipperadams.com> wrote: > From Pharo Discord: > > IIUC when I change the comparison behavior of a class (i.e. reimplement #= > and #hash), sets containing those objects start acting oddly e.g. happily > storing duplicates of equal objects and failing to find objects they > contain. A solution seems to be Set allSubInstances do: #rehash. > > Is my understanding correct? If so, I find it odd that I've been using and > researching Smalltalk for well over a decade and I never remember > reading/hearing about this. I wonder if it can be baked into the IDE > somehow e.g. via a critic... > > And stephan responded: > > The mailing list posts about that I remember were at the time where the > number of object bits used for the hash were increased. > > ------------------------------ > > Following up on his clue, I found this old post > <https://lists.pharo.org/empathy/thread/LKCE5UXYPXPLVOGNZYW64BWHYJY5NMHQ?hash=LKCE5UXYPXPLVOGNZYW64BWHYJY5NMHQ#LKCE5UXYPXPLVOGNZYW64BWHYJY5NMHQ>, > which seems to confirm my suspicion. This seems like a serious trap for > users to fall into (I’m having flashbacks to my C++ days lol). So, > can/should we somehow handle this in the IDE? Something like a refactoring > (that has a confirmation window) that gives you the choice to rehash image > collections when redefining these methods? > > In a 1GB image, `Set allSubInstances do: #rehash` is almost instantaneous > and `Dictionary allSubInstances do: #rehash` takes several seconds. >
SV
Sven Van Caekenberghe
Tue, Aug 30, 2022 7:16 PM

A use case in the system is #stonPostReferenceResolution for Dictionary and Set. You could have a look at that (it is an edge case, but still).

On 30 Aug 2022, at 19:00, sean@clipperadams.com wrote:

From Pharo Discord:

IIUC when I change the comparison behavior of a class (i.e. reimplement #= and #hash), sets containing those objects start acting oddly e.g. happily storing duplicates of equal objects and failing to find objects they contain. A solution seems to be Set allSubInstances do: #rehash.

Is my understanding correct? If so, I find it odd that I've been using and researching Smalltalk for well over a decade and I never remember reading/hearing about this. I wonder if it can be baked into the IDE somehow e.g. via a critic...

And stephan responded:

The mailing list posts about that I remember were at the time where the number of object bits used for the hash were increased.

Following up on his clue, I found this old post, which seems to confirm my suspicion. This seems like a serious trap for users to fall into (I’m having flashbacks to my C++ days lol). So, can/should we somehow handle this in the IDE? Something like a refactoring (that has a confirmation window) that gives you the choice to rehash image collections when redefining these methods?

In a 1GB image, Set allSubInstances do: #rehash is almost instantaneous and Dictionary allSubInstances do: #rehash takes several seconds.

A use case in the system is #stonPostReferenceResolution for Dictionary and Set. You could have a look at that (it is an edge case, but still). > On 30 Aug 2022, at 19:00, sean@clipperadams.com wrote: > > From Pharo Discord: > > IIUC when I change the comparison behavior of a class (i.e. reimplement #= and #hash), sets containing those objects start acting oddly e.g. happily storing duplicates of equal objects and failing to find objects they contain. A solution seems to be Set allSubInstances do: #rehash. > > Is my understanding correct? If so, I find it odd that I've been using and researching Smalltalk for well over a decade and I never remember reading/hearing about this. I wonder if it can be baked into the IDE somehow e.g. via a critic... > > And stephan responded: > > The mailing list posts about that I remember were at the time where the number of object bits used for the hash were increased. > > Following up on his clue, I found this old post, which seems to confirm my suspicion. This seems like a serious trap for users to fall into (I’m having flashbacks to my C++ days lol). So, can/should we somehow handle this in the IDE? Something like a refactoring (that has a confirmation window) that gives you the choice to rehash image collections when redefining these methods? > > In a 1GB image, `Set allSubInstances do: #rehash` is almost instantaneous and `Dictionary allSubInstances do: #rehash` takes several seconds. >
S
sean@clipperadams.com
Wed, Sep 7, 2022 2:24 AM

Daniel Slomovits wrote:

A user working on their own application and modifying a #=/#hash implementation there has a very straightforward way to resolve the problem—"reboot" their own application, tearing down and rebuilding any structures it uses.

For a developer in the common usage, as distinct from a user, I suspect you may be right that it’s not too much of an issue. However, Smalltalk tends to erase that distinction and there seems to be a solid community who are both user and developer and “do brain surgery on themselves” regularly. I am such a user and got surprised and a little damaged by this gotcha.

The thing is that the Set/Dictionary allSubInstances do: #rehash approach
is extremely heavy-handed.

I agree. That would be one way to mitigate. I’m interested in others e.g. a code critic or warning in the browser like some refactoring have.

Daniel Slomovits wrote: > A user working on their own application and modifying a #=/#hash implementation there has a very straightforward way to resolve the problem—"reboot" their own application, tearing down and rebuilding any structures it uses. For a developer in the common usage, as distinct from a user, I suspect you may be right that it’s not too much of an issue. However, Smalltalk tends to erase that distinction and there seems to be a solid community who are both user and developer and “do brain surgery on themselves” regularly. I am such a user and got surprised and a little damaged by this gotcha. > The thing is that the `Set/Dictionary allSubInstances do: #rehash` approach > is extremely heavy-handed. I agree. That would be one way to mitigate. I’m interested in others e.g. a code critic or warning in the browser like some refactoring have.