pharo-users@lists.pharo.org

Any question about pharo is welcome

View all threads

A question about #beginsWith: and #endsWith:

DT
David T. Lewis
Fri, Apr 29, 2022 1:35 AM

FYI it was fixed in Squeak a couple of days ago:

http://lists.squeakfoundation.org/pipermail/squeak-dev/2022-April/220089.html

Richard O'Keefe's point of view was overwhelmingly supported, notably including this post:

http://lists.squeakfoundation.org/pipermail/squeak-dev/2022-April/220038.html

Dave

On Sat, Apr 23, 2022 at 12:37:48PM +1200, Richard O'Keefe wrote:

Dan Ingalls is of course a big NAME in the
world of Smalltalk, but the stated reason
for changing the behaviour of #beginsWith:
and #endsWith: makes no sense.

We have many ways to define a partial order
on strings.
x <= y iff y beginsWith: x
x <= y iff y endsWith: x
x <= y iff y includesSubCollection: x
x <= y iff y includesSubSequence: x
These things are supposed to obey laws:
if a beginsWith: b , c
then a beginsWith: b
if a endsWith: b , c
then a endsWith: c
if a includesSubCollection: b , c
then a includesSubCollection: b
and a includesSubCollection: c
if a includesSubSequence: b , c
then a includesSubSequence: b
and a includesSubSequence: c.

We also expect the usual rules of equality
to hold.  So
(1) a beginsWith: a
(2) a = '' , a
(3) THEREFORE a beginsWith: ''

(1) a endsWith: a
(2) a = a , ''
(3) THEREFORE a endsWith: ''

(1) a includesSubCollection: a
(2) a = '' , a
(3) THEREFORE a includesSubCollect: ''

Reasoning about strings (as values) gets
enormously more complicated if the operations
do not follow simple sensible rules, and
having '' be the least string under these
orderings and having '' be a prefix and a
suffix of any string is essential if the
rules are going to be simple and coherent.

'' is to strings (and more generally
empty sequences are to sequences) pretty
much what 0 is to integers.  Denying that
'abc' beginsWith: '' is structurally
just like denying that 0 <= 123.

Now as it happens I can see a use for
versions of #beginsWith: and #endsWith:
that diverge from the ones we have, but
this is not where they need to diverge.
a beginsWithGraphemesOf: b
iff a asGraphemes = b asGraphemes , c asGraphemes
for some c, where s asGraphemes returns a
sequence of strings each of which is a maximally
long grapheme cluster, such that concatenating
s asGraphemes recovers s.  That is,
#beginsWithGraphemesOf: and
#endsWithGraphemesOf: would respect the
Unicode Text Segmentation boundaries.
But s beginsWithGraphemesOf: ''
would still need to be true.

The thing is, in practice you often DON'T
KNOW whether a potential affix is empty or
not.  Here are some of my test cases.

 testTestData
   "Ensure that the sample string has no duplicates."
   [(Set withAll: string) size = stringSize] assert.

 testBeginsWith
   "Test that every prefix of the sample IS a prefix of it."
   0 to: stringSize do: [:n |
     [string beginsWith: (string copyFrom: 1 to: n)] assert].

 testEndsWith
   "Test that every suffix of the sample IS a suffix of it."
   0 to: stringSize do: [:n |
     [string endsWith: (string copyFrom: stringSize - n + 1 to:

stringSize)] assert].

 testIndexOfSubCollectionAtBeginning
   "Test that every prefix of 'abcd' is found at the beginning."
   0 to: stringSize do: [:n | |s i t|
     s := string copyFrom: 1 to: n.
     i := string indexOfSubCollection: s startingAt: 1.
     [1 = i] assert.
     t := string copyFrom: i to: i - 1 + n.
     [t = s] assert].

 testIndexOfSubCollectionAtEnd
   "Test that every proper suffix of the sample is found at the end."
   1 to: stringSize do: [:n | |s i t|
     s := string copyFrom: stringSize - n + 1 to: stringSize.
     i := string indexOfSubCollection: s startingAt: 1.
     [stringSize + 1 - n = i] assert.
     t := string copyFrom: i to: i - 1 + n.
     [t = s] assert].

 testLastIndexOfSubCollectionAtBeginning
   "Test that every proper prefix of the sample is found at the

beginning."
1 to: stringSize do: [:n | |s i t|
s := string copyFrom: 1 to: n.
i := string lastIndexOfSubCollection: s startingAt: stringSize.
[1 = i] assert.
t := string copyFrom: i to: i - 1 + n.
[t = s] assert].

 testLastIndexOfSubCollectionAtEnd
   "Test that every suffix of the sample is found at the end."
   0 to: stringSize do: [:n | |s i t|
     s := string copyFrom: stringSize - n + 1 to: stringSize.
     i := string lastIndexOfSubCollection: s startingAt: stringSize.
     [stringSize + 1 - n = i] assert.
     t := string copyFrom: i to: i - 1 + n.
     [t = s] assert].

 testOccurrencesOfEmptyCollection
   "Test that the empty string occurs at the beginning,
    at the end, and in between every pair of adjacent characters."
   [(string occurrencesOfSubCollection: '') = (stringSize + 1)] assert.

 testOccurrencesOfUniqueParts
   "Test that unique parts occur as many times as they should."
   |repeated|
   repeated := string , string , string.
   1 to: stringSize do: [:start |
     start to: stringSize do: [:finish | |s n|
       s := string copyFrom: start to: finish.
       n := string occurrencesOfSubCollection: s.
       [n = 1] assert.
       n := repeated occurrencesOfSubCollection: s.
       [n = 3] assert]].

On Fri, 22 Apr 2022 at 13:57, David T. Lewis lewis@mail.msen.com wrote:

Hi Richard,

(CC squeak-dev list, replies to the relevant list please)

On Thu, Apr 21, 2022 at 12:07:32AM +1200, Richard O'Keefe wrote:

I've just tracked down a nasty little problem
porting some code to Pharo.  As a result, I
have added to the comments in my own versions
of these methods.    beginsWith: aSequence
"Answer true if aSequence is a prefix of the receiver.
This makes sense for all sequences.
There is a compatibility issue concerning 'abc' beginsWith: ''
+ VisualWorks, Dolphin, astc, GNU ST (where the method is
called #startsWith:) and VisualAge (where the method
is called #wbBeginsWith:)
agree than EVERY sequence begins with an empty prefix.
- Squeak and Pharo
agree that NO sequence begins with an empty sequence.
# ST/X chooses compatibility with Squeak, heaving a big unhappy
sigh, and adds #startsWith: to have something sensible to use.
Now ST/X thinks it is compatible with VW, though it isn't, so
I wonder if this was a bug that VW fixed and Squeak didn't?
astc goes with the majority here.  This is also compatible with
Haskell, ML, and with StartsWith in C# and startsWith in Java."
^self beginsWith: aSequence ignoringCase: false

endsWith: aSequence
  "Answer true if aSequence is a suffix of the receiver.
   This makes sense for all sequences.
   There is a compatibility issue concerning 'abc' endsWith: ''.
   + VisualWorks, Dolphin, astc, GNU ST, and VisualAge (where
     the method is called #wbEndsWith:)
     agree that EVERY sequence ends with an empty suffix.
   - Squeak and Pharo
     agree that NO sequence ends with an empty suffix.
   # ST/X chooses compatibility with the majority, apparently
     unaware that this makes #beginsWith: and #endsWith:

inconsistent.

   astc goes with the majority here.  This is also compatible with
   Haskell, ML, C#, and Java."
  ^self endsWith: aSequence ignoringCase: false

Does anyone have any idea

  • why Squeak and Pharo are the odd ones out?
  • why anyone thought making #beginsWith: and #endsWith:, um, "quirky"
    was a good idea (it's pretty standard in books on the theory of
    strings to define "x is a prefix of y iff there is a z such that
    y = x concatenated with z")

The Squeak behavior was introduced in November 1998 by Dan Ingalls
in conjunction with some VM improvements that he was doing, and was
included in the Squeak2.2 release at that time.

Prior to that update, Squeak did this:
'abc' beginsWith: '' ==> true
'abc' endsWith: '' ==> true

For Squeak2.2 and later it is:
'abc' beginsWith: '' ==> false
'abc' endsWith: '' ==> false

Pharo presumably inherits this from Squeak.

I am attaching Dan's original change set from the early Squeak
update stream so you can see the context.

The change set comment says that "endsWith:, beginsWith:, and match
have been rewritten to take advantage of this considerably faster method"
(referring to a new primitive that Dan had added to the VM in this
change set).

I was about to try to file a bug report for the first time,
then realised that maybe other people don't think this IS a bug.

I don't think it is a bug. I can't think of a case where it makes
sense to say that a string of characters "begins with" or "ends with"
a string that contains nothing.

In any case, that's the history of the change in Squeak and Pharo as
best as I can reconstruct it.

HTH,
Dave

FYI it was fixed in Squeak a couple of days ago: http://lists.squeakfoundation.org/pipermail/squeak-dev/2022-April/220089.html Richard O'Keefe's point of view was overwhelmingly supported, notably including this post: http://lists.squeakfoundation.org/pipermail/squeak-dev/2022-April/220038.html Dave On Sat, Apr 23, 2022 at 12:37:48PM +1200, Richard O'Keefe wrote: > Dan Ingalls is of course a big NAME in the > world of Smalltalk, but the stated reason > for changing the behaviour of #beginsWith: > and #endsWith: makes no sense. > > > We have many ways to define a partial order > on strings. > x <= y iff y beginsWith: x > x <= y iff y endsWith: x > x <= y iff y includesSubCollection: x > x <= y iff y includesSubSequence: x > These things are supposed to obey laws: > if a beginsWith: b , c > then a beginsWith: b > if a endsWith: b , c > then a endsWith: c > if a includesSubCollection: b , c > then a includesSubCollection: b > and a includesSubCollection: c > if a includesSubSequence: b , c > then a includesSubSequence: b > and a includesSubSequence: c. > > We also expect the usual rules of equality > to hold. So > (1) a beginsWith: a > (2) a = '' , a > (3) THEREFORE a beginsWith: '' > > (1) a endsWith: a > (2) a = a , '' > (3) THEREFORE a endsWith: '' > > (1) a includesSubCollection: a > (2) a = '' , a > (3) THEREFORE a includesSubCollect: '' > > Reasoning about strings (as values) gets > enormously more complicated if the operations > do not follow simple sensible rules, and > having '' be the least string under these > orderings and having '' be a prefix and a > suffix of any string is essential if the > rules are going to be simple and coherent. > > '' is to strings (and more generally > empty sequences are to sequences) pretty > much what 0 is to integers. Denying that > 'abc' beginsWith: '' is *structurally* > just like denying that 0 <= 123. > > Now as it happens I *can* see a use for > versions of #beginsWith: and #endsWith: > that diverge from the ones we have, but > *this* is not where they need to diverge. > a beginsWithGraphemesOf: b > iff a asGraphemes = b asGraphemes , c asGraphemes > for some c, where s asGraphemes returns a > sequence of strings each of which is a maximally > long grapheme cluster, such that concatenating > s asGraphemes recovers s. That is, > #beginsWithGraphemesOf: and > #endsWithGraphemesOf: would respect the > Unicode Text Segmentation boundaries. > But s beginsWithGraphemesOf: '' > would still need to be true. > > The thing is, in practice you often DON'T > KNOW whether a potential affix is empty or > not. Here are some of my test cases. > > testTestData > "Ensure that the sample string has no duplicates." > [(Set withAll: string) size = stringSize] assert. > > testBeginsWith > "Test that every prefix of the sample IS a prefix of it." > 0 to: stringSize do: [:n | > [string beginsWith: (string copyFrom: 1 to: n)] assert]. > > testEndsWith > "Test that every suffix of the sample IS a suffix of it." > 0 to: stringSize do: [:n | > [string endsWith: (string copyFrom: stringSize - n + 1 to: > stringSize)] assert]. > > testIndexOfSubCollectionAtBeginning > "Test that every prefix of 'abcd' is found at the beginning." > 0 to: stringSize do: [:n | |s i t| > s := string copyFrom: 1 to: n. > i := string indexOfSubCollection: s startingAt: 1. > [1 = i] assert. > t := string copyFrom: i to: i - 1 + n. > [t = s] assert]. > > testIndexOfSubCollectionAtEnd > "Test that every proper suffix of the sample is found at the end." > 1 to: stringSize do: [:n | |s i t| > s := string copyFrom: stringSize - n + 1 to: stringSize. > i := string indexOfSubCollection: s startingAt: 1. > [stringSize + 1 - n = i] assert. > t := string copyFrom: i to: i - 1 + n. > [t = s] assert]. > > testLastIndexOfSubCollectionAtBeginning > "Test that every proper prefix of the sample is found at the > beginning." > 1 to: stringSize do: [:n | |s i t| > s := string copyFrom: 1 to: n. > i := string lastIndexOfSubCollection: s startingAt: stringSize. > [1 = i] assert. > t := string copyFrom: i to: i - 1 + n. > [t = s] assert]. > > testLastIndexOfSubCollectionAtEnd > "Test that every suffix of the sample is found at the end." > 0 to: stringSize do: [:n | |s i t| > s := string copyFrom: stringSize - n + 1 to: stringSize. > i := string lastIndexOfSubCollection: s startingAt: stringSize. > [stringSize + 1 - n = i] assert. > t := string copyFrom: i to: i - 1 + n. > [t = s] assert]. > > testOccurrencesOfEmptyCollection > "Test that the empty string occurs at the beginning, > at the end, and in between every pair of adjacent characters." > [(string occurrencesOfSubCollection: '') = (stringSize + 1)] assert. > > testOccurrencesOfUniqueParts > "Test that unique parts occur as many times as they should." > |repeated| > repeated := string , string , string. > 1 to: stringSize do: [:start | > start to: stringSize do: [:finish | |s n| > s := string copyFrom: start to: finish. > n := string occurrencesOfSubCollection: s. > [n = 1] assert. > n := repeated occurrencesOfSubCollection: s. > [n = 3] assert]]. > > On Fri, 22 Apr 2022 at 13:57, David T. Lewis <lewis@mail.msen.com> wrote: > > > Hi Richard, > > > > (CC squeak-dev list, replies to the relevant list please) > > > > On Thu, Apr 21, 2022 at 12:07:32AM +1200, Richard O'Keefe wrote: > > > I've just tracked down a nasty little problem > > > porting some code to Pharo. As a result, I > > > have added to the comments in my own versions > > > of these methods. beginsWith: aSequence > > > "Answer true if aSequence is a prefix of the receiver. > > > This makes sense for all sequences. > > > There is a compatibility issue concerning 'abc' beginsWith: '' > > > + VisualWorks, Dolphin, astc, GNU ST (where the method is > > > called #startsWith:) and VisualAge (where the method > > > is called #wbBeginsWith:) > > > agree than EVERY sequence begins with an empty prefix. > > > - Squeak and Pharo > > > agree that NO sequence begins with an empty sequence. > > > # ST/X chooses compatibility with Squeak, heaving a big unhappy > > > sigh, and adds #startsWith: to have something sensible to use. > > > Now ST/X *thinks* it is compatible with VW, though it isn't, so > > > I wonder if this was a bug that VW fixed and Squeak didn't? > > > astc goes with the majority here. This is also compatible with > > > Haskell, ML, and with StartsWith in C# and startsWith in Java." > > > ^self beginsWith: aSequence ignoringCase: false > > > > > > endsWith: aSequence > > > "Answer true if aSequence is a suffix of the receiver. > > > This makes sense for all sequences. > > > There is a compatibility issue concerning 'abc' endsWith: ''. > > > + VisualWorks, Dolphin, astc, GNU ST, and VisualAge (where > > > the method is called #wbEndsWith:) > > > agree that EVERY sequence ends with an empty suffix. > > > - Squeak and Pharo > > > agree that NO sequence ends with an empty suffix. > > > # ST/X chooses compatibility with the majority, apparently > > > unaware that this makes #beginsWith: and #endsWith: > > inconsistent. > > > astc goes with the majority here. This is also compatible with > > > Haskell, ML, C#, and Java." > > > ^self endsWith: aSequence ignoringCase: false > > > > > > Does anyone have any idea > > > - why Squeak and Pharo are the odd ones out? > > > - why anyone thought making #beginsWith: and #endsWith:, um, "quirky" > > > was a good idea (it's pretty standard in books on the theory of > > > strings to define "x is a prefix of y iff there is a z such that > > > y = x concatenated with z") > > > > The Squeak behavior was introduced in November 1998 by Dan Ingalls > > in conjunction with some VM improvements that he was doing, and was > > included in the Squeak2.2 release at that time. > > > > Prior to that update, Squeak did this: > > 'abc' beginsWith: '' ==> true > > 'abc' endsWith: '' ==> true > > > > For Squeak2.2 and later it is: > > 'abc' beginsWith: '' ==> false > > 'abc' endsWith: '' ==> false > > > > Pharo presumably inherits this from Squeak. > > > > I am attaching Dan's original change set from the early Squeak > > update stream so you can see the context. > > > > The change set comment says that "endsWith:, beginsWith:, and match > > have been rewritten to take advantage of this considerably faster method" > > (referring to a new primitive that Dan had added to the VM in this > > change set). > > > > > > > > I was about to try to file a bug report for the first time, > > > then realised that maybe other people don't think this IS a bug. > > > > I don't think it is a bug. I can't think of a case where it makes > > sense to say that a string of characters "begins with" or "ends with" > > a string that contains nothing. > > > > In any case, that's the history of the change in Squeak and Pharo as > > best as I can reconstruct it. > > > > HTH, > > Dave > > > >
SM
Steffen Märcker
Fri, Apr 29, 2022 7:14 AM

Hi Kasper,

I've thought about that approach too. But then asked myself whether it is
more likely that there is code that relies on this bug than code where this
went unnoticed and is therefore broken. What do the other think about that
matter and the fix in Squeak?

Best, Steffen

Kasper Osterbye schrieb am Donnerstag, 28. April 2022 19:03:07 (+02:00):

Kasper Osterbye schrieb am Dienstag, 26. April 2022 14:50:51 (+02:00):

I have now raised it as an issue on the issue tracker

Issue #11165 in https://github.com/pharo-project/pharo/issues/11165

If I may suggest a solution it will be to:
add two new methods - prefixedBy:  and suffixedBy: to handle the empty
prefix/suffix correctly
add comments to beginsWith:  and endsWith:  referring to the two new
methods

The problem is that there is client code which depend on the (wrong)
implementation of the beginsWith and endsWith methods.

Hi Kasper, I've thought about that approach too. But then asked myself whether it is more likely that there is code that relies on this bug than code where this went unnoticed and is therefore broken. What do the other think about that matter and the fix in Squeak? Best, Steffen Kasper Osterbye schrieb am Donnerstag, 28. April 2022 19:03:07 (+02:00): Kasper Osterbye schrieb am Dienstag, 26. April 2022 14:50:51 (+02:00): I have now raised it as an issue on the issue tracker Issue #11165 in https://github.com/pharo-project/pharo/issues/11165 If I may suggest a solution it will be to: add two new methods - prefixedBy: and suffixedBy: to handle the empty prefix/suffix correctly add comments to beginsWith: and endsWith: referring to the two new methods The problem is that there is client code which depend on the (wrong) implementation of the beginsWith and endsWith methods.
RO
Richard O'Keefe
Fri, Apr 29, 2022 11:28 AM

I like proverbs.  One of mine is "zero is also a number."

Way back in the 1980s the University in my home town was
given a new IBM mainframe by IBM.  The Computer Centre
wrote a archival program for it that scanned each user's
minidisk (= fixed partition) for files that had not been
used for a while and logged them to a master file, then
ran a job that moved the files to a tape.

One day everything went horribly wrong.

It turned out that the reason was very simple, but so
alien that nobody had noticed when they read the
documentation.  It was the kind of thing that was so
unbelievable you couldn't even see it when it was
hinted at in front of you.

Consider the following process:
(1) create-or-replace a file.
(2) write N records to it.
(3) close the file.
Simple.  Obvious.  And wrong.  Because step (1) didn't
actually happen until step (3), and if N = 0, it didn't
happen then either.  In that IBM mainframe operating system,
you literally could not create an empty file.

So when one day there weren't any files that needed
moving to tape (because they had all been done the previous
day), the process that was expected to write an empty list
of files simply left the previous day's list unchanged.
Result: thousands of multipage error reports when the
archiver tried to move files that weren't there any more.

All because someone at IBM had not been able to conceive
of a file with no records!

I hope the analogy is clear.  But it gets better.
If you wrote a file in C on that system, it was in
effect line-buffered.  Your text got written to the
file when (and only when) you wrote a newline
character.  So if you imitated common MS-DOS practice
and began lines with a newline, the last line of your
file would never be written out.  The equivalent of
s := FileStream write: 'foo text a'.
s nextPutAll: 'Hello world!'.
s close.
would write no records, and thus not create or replace
the file.  (Because a file with N = 0 newline characters
is another of those N = 0 things that cannot exist!)

I had to redesign the I/O library for a certain programming
language that ran on a lot of Unix systems, on VAX/VMS, and
we thought was running well through the C library on IBM
mainframes.  All sorts of finicky little details to deal
with N=0.  So at the end of a file, if nothing had been
written, we wrote an empty record, unless you turned that
option off.  At the end of each record, if nothing had been
written, we wrote a space.  Yes, that N = 0 case was
another problem.

It has been over 30 years since I had anything to do with
IBM mainframes, but some experiences leave scars.

I note that in Pharo 9.0, there is a subtle difference
between SequenceableCollection>>copyReplaceAll:with:
and String>>copyReplaceAll:with:
and the latter is not consistent with itself.
Let x and y be Strings.
x copyReplaceAll: '' with: y
is always x, except when x is empty, in which case it is y.
Let x and y be Arrays.
x copyReplaceAll: #() with: y
is always x.  This is not the subtle difference I mentioned
above.  That difference is that the string version takes
linear time (assuming fast search) while the general version
takes quadratic time (also assuming fast search).

The ANSI specification for #copyReplaceAll:with: is quite
clear:
Answer a new collection with the elements of the receiver
in their original order, except where a subsequence in the
receiver matches targetElements.  A subsequence in the
receiver is said to match the elements of targetElements if:

  1. They have the same number of elements.
  2. For all indices of the subsequence, the element in the
    subsequence at a given index is equivalent to the element
    in targetElements at the same index.

By this definition, an empty sequence has matches
in any sequence.

I'm not advancing this as evidence that I'm right about
#beginsWith: and #endsWith:, because the following text in
the standard makes it clear that for some reason, empty
sequences were literally UNTHINKABLE for the framers of
the ANSI Smalltalk standard, at least in the context of
search.  I'm mentioning it as evidence that for many
people, zero is the hardest number to think of.

I've finally realised why some people stubbornly cling
to the idea that the natural numbers begin with 1, not
zero (as if the Peano axioms are not a thing).
If you are thinking of the natural numbers in terms of
multiplication (and primes, and so on), well,
the positive integers are a cancellative monoid with
identity 1, but not if you include 0.
If you are thinking of the natural numbers in terms of
addition, then the non-negative integers are a
cancellative monoid with identity 0.
So there are two different monoids, for one of which
including 0 would be many kinds of trouble, and for
the other of which excluding 0 would be absurd and awkward.

Strings are cancellative monoids under concatenation with
'' as the identity.  (Don't ask me about Unicode.)  So why
is '' so hard to think about?  I don't think it is just
Strings.  In functional programming, it was surprisingly
hard to teach students to remember the empty list.  (For
myself, I got into the habit of writing the base case of
a recursive function first to avoid the mistake of
leaving out the base case.)  I suspect that there are
other places with surprising behaviour for 'trivial' cases.

On Fri, 29 Apr 2022 at 13:35, David T. Lewis lewis@mail.msen.com wrote:

FYI it was fixed in Squeak a couple of days ago:

http://lists.squeakfoundation.org/pipermail/squeak-dev/2022-April/220089.html

Richard O'Keefe's point of view was overwhelmingly supported, notably
including this post:

http://lists.squeakfoundation.org/pipermail/squeak-dev/2022-April/220038.html

Dave

On Sat, Apr 23, 2022 at 12:37:48PM +1200, Richard O'Keefe wrote:

Dan Ingalls is of course a big NAME in the
world of Smalltalk, but the stated reason
for changing the behaviour of #beginsWith:
and #endsWith: makes no sense.

We have many ways to define a partial order
on strings.
x <= y iff y beginsWith: x
x <= y iff y endsWith: x
x <= y iff y includesSubCollection: x
x <= y iff y includesSubSequence: x
These things are supposed to obey laws:
if a beginsWith: b , c
then a beginsWith: b
if a endsWith: b , c
then a endsWith: c
if a includesSubCollection: b , c
then a includesSubCollection: b
and a includesSubCollection: c
if a includesSubSequence: b , c
then a includesSubSequence: b
and a includesSubSequence: c.

We also expect the usual rules of equality
to hold.  So
(1) a beginsWith: a
(2) a = '' , a
(3) THEREFORE a beginsWith: ''

(1) a endsWith: a
(2) a = a , ''
(3) THEREFORE a endsWith: ''

(1) a includesSubCollection: a
(2) a = '' , a
(3) THEREFORE a includesSubCollect: ''

Reasoning about strings (as values) gets
enormously more complicated if the operations
do not follow simple sensible rules, and
having '' be the least string under these
orderings and having '' be a prefix and a
suffix of any string is essential if the
rules are going to be simple and coherent.

'' is to strings (and more generally
empty sequences are to sequences) pretty
much what 0 is to integers.  Denying that
'abc' beginsWith: '' is structurally
just like denying that 0 <= 123.

Now as it happens I can see a use for
versions of #beginsWith: and #endsWith:
that diverge from the ones we have, but
this is not where they need to diverge.
a beginsWithGraphemesOf: b
iff a asGraphemes = b asGraphemes , c asGraphemes
for some c, where s asGraphemes returns a
sequence of strings each of which is a maximally
long grapheme cluster, such that concatenating
s asGraphemes recovers s.  That is,
#beginsWithGraphemesOf: and
#endsWithGraphemesOf: would respect the
Unicode Text Segmentation boundaries.
But s beginsWithGraphemesOf: ''
would still need to be true.

The thing is, in practice you often DON'T
KNOW whether a potential affix is empty or
not.  Here are some of my test cases.

 testTestData
   "Ensure that the sample string has no duplicates."
   [(Set withAll: string) size = stringSize] assert.

 testBeginsWith
   "Test that every prefix of the sample IS a prefix of it."
   0 to: stringSize do: [:n |
     [string beginsWith: (string copyFrom: 1 to: n)] assert].

 testEndsWith
   "Test that every suffix of the sample IS a suffix of it."
   0 to: stringSize do: [:n |
     [string endsWith: (string copyFrom: stringSize - n + 1 to:

stringSize)] assert].

 testIndexOfSubCollectionAtBeginning
   "Test that every prefix of 'abcd' is found at the beginning."
   0 to: stringSize do: [:n | |s i t|
     s := string copyFrom: 1 to: n.
     i := string indexOfSubCollection: s startingAt: 1.
     [1 = i] assert.
     t := string copyFrom: i to: i - 1 + n.
     [t = s] assert].

 testIndexOfSubCollectionAtEnd
   "Test that every proper suffix of the sample is found at the end."
   1 to: stringSize do: [:n | |s i t|
     s := string copyFrom: stringSize - n + 1 to: stringSize.
     i := string indexOfSubCollection: s startingAt: 1.
     [stringSize + 1 - n = i] assert.
     t := string copyFrom: i to: i - 1 + n.
     [t = s] assert].

 testLastIndexOfSubCollectionAtBeginning
   "Test that every proper prefix of the sample is found at the

beginning."
1 to: stringSize do: [:n | |s i t|
s := string copyFrom: 1 to: n.
i := string lastIndexOfSubCollection: s startingAt: stringSize.
[1 = i] assert.
t := string copyFrom: i to: i - 1 + n.
[t = s] assert].

 testLastIndexOfSubCollectionAtEnd
   "Test that every suffix of the sample is found at the end."
   0 to: stringSize do: [:n | |s i t|
     s := string copyFrom: stringSize - n + 1 to: stringSize.
     i := string lastIndexOfSubCollection: s startingAt: stringSize.
     [stringSize + 1 - n = i] assert.
     t := string copyFrom: i to: i - 1 + n.
     [t = s] assert].

 testOccurrencesOfEmptyCollection
   "Test that the empty string occurs at the beginning,
    at the end, and in between every pair of adjacent characters."
   [(string occurrencesOfSubCollection: '') = (stringSize + 1)]

assert.

 testOccurrencesOfUniqueParts
   "Test that unique parts occur as many times as they should."
   |repeated|
   repeated := string , string , string.
   1 to: stringSize do: [:start |
     start to: stringSize do: [:finish | |s n|
       s := string copyFrom: start to: finish.
       n := string occurrencesOfSubCollection: s.
       [n = 1] assert.
       n := repeated occurrencesOfSubCollection: s.
       [n = 3] assert]].

On Fri, 22 Apr 2022 at 13:57, David T. Lewis lewis@mail.msen.com

wrote:

Hi Richard,

(CC squeak-dev list, replies to the relevant list please)

On Thu, Apr 21, 2022 at 12:07:32AM +1200, Richard O'Keefe wrote:

I've just tracked down a nasty little problem
porting some code to Pharo.  As a result, I
have added to the comments in my own versions
of these methods.    beginsWith: aSequence
"Answer true if aSequence is a prefix of the receiver.
This makes sense for all sequences.
There is a compatibility issue concerning 'abc' beginsWith: ''
+ VisualWorks, Dolphin, astc, GNU ST (where the method is
called #startsWith:) and VisualAge (where the method
is called #wbBeginsWith:)
agree than EVERY sequence begins with an empty prefix.
- Squeak and Pharo
agree that NO sequence begins with an empty sequence.
# ST/X chooses compatibility with Squeak, heaving a big

unhappy

      sigh, and adds #startsWith: to have something sensible to

use.

    Now ST/X *thinks* it is compatible with VW, though it isn't,

so

    I wonder if this was a bug that VW fixed and Squeak didn't?
    astc goes with the majority here.  This is also compatible

with

    Haskell, ML, and with StartsWith in C# and startsWith in

Java."

   ^self beginsWith: aSequence ignoringCase: false

 endsWith: aSequence
   "Answer true if aSequence is a suffix of the receiver.
    This makes sense for all sequences.
    There is a compatibility issue concerning 'abc' endsWith: ''.
    + VisualWorks, Dolphin, astc, GNU ST, and VisualAge (where
      the method is called #wbEndsWith:)
      agree that EVERY sequence ends with an empty suffix.
    - Squeak and Pharo
      agree that NO sequence ends with an empty suffix.
    # ST/X chooses compatibility with the majority, apparently
      unaware that this makes #beginsWith: and #endsWith:

inconsistent.

    astc goes with the majority here.  This is also compatible

with

    Haskell, ML, C#, and Java."
   ^self endsWith: aSequence ignoringCase: false

Does anyone have any idea

  • why Squeak and Pharo are the odd ones out?
  • why anyone thought making #beginsWith: and #endsWith:, um,

"quirky"

was a good idea (it's pretty standard in books on the theory of
strings to define "x is a prefix of y iff there is a z such that
y = x concatenated with z")

The Squeak behavior was introduced in November 1998 by Dan Ingalls
in conjunction with some VM improvements that he was doing, and was
included in the Squeak2.2 release at that time.

Prior to that update, Squeak did this:
'abc' beginsWith: '' ==> true
'abc' endsWith: '' ==> true

For Squeak2.2 and later it is:
'abc' beginsWith: '' ==> false
'abc' endsWith: '' ==> false

Pharo presumably inherits this from Squeak.

I am attaching Dan's original change set from the early Squeak
update stream so you can see the context.

The change set comment says that "endsWith:, beginsWith:, and match
have been rewritten to take advantage of this considerably faster

method"

(referring to a new primitive that Dan had added to the VM in this
change set).

I was about to try to file a bug report for the first time,
then realised that maybe other people don't think this IS a bug.

I don't think it is a bug. I can't think of a case where it makes
sense to say that a string of characters "begins with" or "ends with"
a string that contains nothing.

In any case, that's the history of the change in Squeak and Pharo as
best as I can reconstruct it.

HTH,
Dave

I like proverbs. One of mine is "zero is also a number." Way back in the 1980s the University in my home town was given a new IBM mainframe by IBM. The Computer Centre wrote a archival program for it that scanned each user's minidisk (= fixed partition) for files that had not been used for a while and logged them to a master file, then ran a job that moved the files to a tape. One day everything went horribly wrong. It turned out that the reason was very simple, but so alien that nobody had noticed when they read the documentation. It was the kind of thing that was so unbelievable you couldn't even *see* it when it was hinted at in front of you. Consider the following process: (1) create-or-replace a file. (2) write N records to it. (3) close the file. Simple. Obvious. And wrong. Because step (1) didn't actually *happen* until step (3), and if N = 0, it didn't happen then either. In that IBM mainframe operating system, you literally could not create an empty file. So when one day there weren't any files that needed moving to tape (because they had all been done the previous day), the process that was expected to write an empty list of files simply left the previous day's list unchanged. Result: thousands of multipage error reports when the archiver tried to move files that weren't there any more. All because someone at IBM had not been able to conceive of a file with no records! I hope the analogy is clear. But it gets better. If you wrote a file in C on that system, it was in effect line-buffered. Your text got written to the file when (and only when) you wrote a newline character. So if you imitated common MS-DOS practice and *began* lines with a newline, the last line of your file would never be written out. The equivalent of s := FileStream write: 'foo text a'. s nextPutAll: 'Hello world!'. s close. would write no records, and thus not create or replace the file. (Because a file with N = 0 newline characters is another of those N = 0 things that cannot exist!) I had to redesign the I/O library for a certain programming language that ran on a lot of Unix systems, on VAX/VMS, and we *thought* was running well through the C library on IBM mainframes. All sorts of finicky little details to deal with N=0. So at the end of a file, if nothing had been written, we wrote an empty record, unless you turned that option off. At the end of each record, if nothing had been written, we wrote a space. Yes, that N = 0 case was another problem. It has been over 30 years since I had anything to do with IBM mainframes, but some experiences leave scars. I note that in Pharo 9.0, there is a subtle difference between SequenceableCollection>>copyReplaceAll:with: and String>>copyReplaceAll:with: and the latter is not consistent with itself. Let x and y be Strings. x copyReplaceAll: '' with: y is always x, except when x is empty, in which case it is y. Let x and y be Arrays. x copyReplaceAll: #() with: y is always x. This is not the subtle difference I mentioned above. *That* difference is that the string version takes linear time (assuming fast search) while the general version takes quadratic time (also assuming fast search). The ANSI specification for #copyReplaceAll:with: is quite clear: Answer a new collection with the elements of the receiver in their original order, except where a subsequence in the receiver matches targetElements. A subsequence in the receiver is said to match the elements of targetElements if: 1. They have the same number of elements. 2. For all indices of the subsequence, the element in the subsequence at a given index is equivalent to the element in targetElements at the same index. By this definition, an empty sequence has matches in any sequence. I'm not advancing this as evidence that I'm right about #beginsWith: and #endsWith:, because the following text in the standard makes it clear that for some reason, empty sequences were literally UNTHINKABLE for the framers of the ANSI Smalltalk standard, at least in the context of search. I'm mentioning it as evidence that for many people, zero is the hardest number to think of. I've finally realised why some people stubbornly cling to the idea that the natural numbers begin with 1, not zero (as if the Peano axioms are not a thing). If you are thinking of the natural numbers in terms of *multiplication* (and primes, and so on), well, the positive integers are a cancellative monoid with identity 1, but not if you include 0. If you are thinking of the natural numbers in terms of *addition*, then the non-negative integers are a cancellative monoid with identity 0. So there are two *different* monoids, for one of which including 0 would be many kinds of trouble, and for the other of which excluding 0 would be absurd and awkward. Strings are cancellative monoids under concatenation with '' as the identity. (Don't ask me about Unicode.) So why is '' so hard to think about? I don't think it is just Strings. In functional programming, it was surprisingly hard to teach students to remember the empty list. (For myself, I got into the habit of writing the base case of a recursive function *first* to avoid the mistake of leaving out the base case.) I suspect that there are other places with surprising behaviour for 'trivial' cases. On Fri, 29 Apr 2022 at 13:35, David T. Lewis <lewis@mail.msen.com> wrote: > FYI it was fixed in Squeak a couple of days ago: > > > http://lists.squeakfoundation.org/pipermail/squeak-dev/2022-April/220089.html > > Richard O'Keefe's point of view was overwhelmingly supported, notably > including this post: > > > http://lists.squeakfoundation.org/pipermail/squeak-dev/2022-April/220038.html > > Dave > > On Sat, Apr 23, 2022 at 12:37:48PM +1200, Richard O'Keefe wrote: > > Dan Ingalls is of course a big NAME in the > > world of Smalltalk, but the stated reason > > for changing the behaviour of #beginsWith: > > and #endsWith: makes no sense. > > > > > > We have many ways to define a partial order > > on strings. > > x <= y iff y beginsWith: x > > x <= y iff y endsWith: x > > x <= y iff y includesSubCollection: x > > x <= y iff y includesSubSequence: x > > These things are supposed to obey laws: > > if a beginsWith: b , c > > then a beginsWith: b > > if a endsWith: b , c > > then a endsWith: c > > if a includesSubCollection: b , c > > then a includesSubCollection: b > > and a includesSubCollection: c > > if a includesSubSequence: b , c > > then a includesSubSequence: b > > and a includesSubSequence: c. > > > > We also expect the usual rules of equality > > to hold. So > > (1) a beginsWith: a > > (2) a = '' , a > > (3) THEREFORE a beginsWith: '' > > > > (1) a endsWith: a > > (2) a = a , '' > > (3) THEREFORE a endsWith: '' > > > > (1) a includesSubCollection: a > > (2) a = '' , a > > (3) THEREFORE a includesSubCollect: '' > > > > Reasoning about strings (as values) gets > > enormously more complicated if the operations > > do not follow simple sensible rules, and > > having '' be the least string under these > > orderings and having '' be a prefix and a > > suffix of any string is essential if the > > rules are going to be simple and coherent. > > > > '' is to strings (and more generally > > empty sequences are to sequences) pretty > > much what 0 is to integers. Denying that > > 'abc' beginsWith: '' is *structurally* > > just like denying that 0 <= 123. > > > > Now as it happens I *can* see a use for > > versions of #beginsWith: and #endsWith: > > that diverge from the ones we have, but > > *this* is not where they need to diverge. > > a beginsWithGraphemesOf: b > > iff a asGraphemes = b asGraphemes , c asGraphemes > > for some c, where s asGraphemes returns a > > sequence of strings each of which is a maximally > > long grapheme cluster, such that concatenating > > s asGraphemes recovers s. That is, > > #beginsWithGraphemesOf: and > > #endsWithGraphemesOf: would respect the > > Unicode Text Segmentation boundaries. > > But s beginsWithGraphemesOf: '' > > would still need to be true. > > > > The thing is, in practice you often DON'T > > KNOW whether a potential affix is empty or > > not. Here are some of my test cases. > > > > testTestData > > "Ensure that the sample string has no duplicates." > > [(Set withAll: string) size = stringSize] assert. > > > > testBeginsWith > > "Test that every prefix of the sample IS a prefix of it." > > 0 to: stringSize do: [:n | > > [string beginsWith: (string copyFrom: 1 to: n)] assert]. > > > > testEndsWith > > "Test that every suffix of the sample IS a suffix of it." > > 0 to: stringSize do: [:n | > > [string endsWith: (string copyFrom: stringSize - n + 1 to: > > stringSize)] assert]. > > > > testIndexOfSubCollectionAtBeginning > > "Test that every prefix of 'abcd' is found at the beginning." > > 0 to: stringSize do: [:n | |s i t| > > s := string copyFrom: 1 to: n. > > i := string indexOfSubCollection: s startingAt: 1. > > [1 = i] assert. > > t := string copyFrom: i to: i - 1 + n. > > [t = s] assert]. > > > > testIndexOfSubCollectionAtEnd > > "Test that every proper suffix of the sample is found at the end." > > 1 to: stringSize do: [:n | |s i t| > > s := string copyFrom: stringSize - n + 1 to: stringSize. > > i := string indexOfSubCollection: s startingAt: 1. > > [stringSize + 1 - n = i] assert. > > t := string copyFrom: i to: i - 1 + n. > > [t = s] assert]. > > > > testLastIndexOfSubCollectionAtBeginning > > "Test that every proper prefix of the sample is found at the > > beginning." > > 1 to: stringSize do: [:n | |s i t| > > s := string copyFrom: 1 to: n. > > i := string lastIndexOfSubCollection: s startingAt: stringSize. > > [1 = i] assert. > > t := string copyFrom: i to: i - 1 + n. > > [t = s] assert]. > > > > testLastIndexOfSubCollectionAtEnd > > "Test that every suffix of the sample is found at the end." > > 0 to: stringSize do: [:n | |s i t| > > s := string copyFrom: stringSize - n + 1 to: stringSize. > > i := string lastIndexOfSubCollection: s startingAt: stringSize. > > [stringSize + 1 - n = i] assert. > > t := string copyFrom: i to: i - 1 + n. > > [t = s] assert]. > > > > testOccurrencesOfEmptyCollection > > "Test that the empty string occurs at the beginning, > > at the end, and in between every pair of adjacent characters." > > [(string occurrencesOfSubCollection: '') = (stringSize + 1)] > assert. > > > > testOccurrencesOfUniqueParts > > "Test that unique parts occur as many times as they should." > > |repeated| > > repeated := string , string , string. > > 1 to: stringSize do: [:start | > > start to: stringSize do: [:finish | |s n| > > s := string copyFrom: start to: finish. > > n := string occurrencesOfSubCollection: s. > > [n = 1] assert. > > n := repeated occurrencesOfSubCollection: s. > > [n = 3] assert]]. > > > > On Fri, 22 Apr 2022 at 13:57, David T. Lewis <lewis@mail.msen.com> > wrote: > > > > > Hi Richard, > > > > > > (CC squeak-dev list, replies to the relevant list please) > > > > > > On Thu, Apr 21, 2022 at 12:07:32AM +1200, Richard O'Keefe wrote: > > > > I've just tracked down a nasty little problem > > > > porting some code to Pharo. As a result, I > > > > have added to the comments in my own versions > > > > of these methods. beginsWith: aSequence > > > > "Answer true if aSequence is a prefix of the receiver. > > > > This makes sense for all sequences. > > > > There is a compatibility issue concerning 'abc' beginsWith: '' > > > > + VisualWorks, Dolphin, astc, GNU ST (where the method is > > > > called #startsWith:) and VisualAge (where the method > > > > is called #wbBeginsWith:) > > > > agree than EVERY sequence begins with an empty prefix. > > > > - Squeak and Pharo > > > > agree that NO sequence begins with an empty sequence. > > > > # ST/X chooses compatibility with Squeak, heaving a big > unhappy > > > > sigh, and adds #startsWith: to have something sensible to > use. > > > > Now ST/X *thinks* it is compatible with VW, though it isn't, > so > > > > I wonder if this was a bug that VW fixed and Squeak didn't? > > > > astc goes with the majority here. This is also compatible > with > > > > Haskell, ML, and with StartsWith in C# and startsWith in > Java." > > > > ^self beginsWith: aSequence ignoringCase: false > > > > > > > > endsWith: aSequence > > > > "Answer true if aSequence is a suffix of the receiver. > > > > This makes sense for all sequences. > > > > There is a compatibility issue concerning 'abc' endsWith: ''. > > > > + VisualWorks, Dolphin, astc, GNU ST, and VisualAge (where > > > > the method is called #wbEndsWith:) > > > > agree that EVERY sequence ends with an empty suffix. > > > > - Squeak and Pharo > > > > agree that NO sequence ends with an empty suffix. > > > > # ST/X chooses compatibility with the majority, apparently > > > > unaware that this makes #beginsWith: and #endsWith: > > > inconsistent. > > > > astc goes with the majority here. This is also compatible > with > > > > Haskell, ML, C#, and Java." > > > > ^self endsWith: aSequence ignoringCase: false > > > > > > > > Does anyone have any idea > > > > - why Squeak and Pharo are the odd ones out? > > > > - why anyone thought making #beginsWith: and #endsWith:, um, > "quirky" > > > > was a good idea (it's pretty standard in books on the theory of > > > > strings to define "x is a prefix of y iff there is a z such that > > > > y = x concatenated with z") > > > > > > The Squeak behavior was introduced in November 1998 by Dan Ingalls > > > in conjunction with some VM improvements that he was doing, and was > > > included in the Squeak2.2 release at that time. > > > > > > Prior to that update, Squeak did this: > > > 'abc' beginsWith: '' ==> true > > > 'abc' endsWith: '' ==> true > > > > > > For Squeak2.2 and later it is: > > > 'abc' beginsWith: '' ==> false > > > 'abc' endsWith: '' ==> false > > > > > > Pharo presumably inherits this from Squeak. > > > > > > I am attaching Dan's original change set from the early Squeak > > > update stream so you can see the context. > > > > > > The change set comment says that "endsWith:, beginsWith:, and match > > > have been rewritten to take advantage of this considerably faster > method" > > > (referring to a new primitive that Dan had added to the VM in this > > > change set). > > > > > > > > > > > I was about to try to file a bug report for the first time, > > > > then realised that maybe other people don't think this IS a bug. > > > > > > I don't think it is a bug. I can't think of a case where it makes > > > sense to say that a string of characters "begins with" or "ends with" > > > a string that contains nothing. > > > > > > In any case, that's the history of the change in Squeak and Pharo as > > > best as I can reconstruct it. > > > > > > HTH, > > > Dave > > > > > > >
RS
Richard Sargent
Fri, Apr 29, 2022 4:41 PM

Making the code correct is "untwisting the rope".
Keeping the code wrong and adding a bizarre workaround is "twisting the
rope tighter".

The former is the right direction.

On Fri, Apr 29, 2022 at 12:15 AM Steffen Märcker merkste@web.de wrote:

Hi Kasper,

I've thought about that approach too. But then asked myself whether it is
more likely that there is code that relies on this bug than code where this
went unnoticed and is therefore broken. What do the other think about that
matter and the fix in Squeak?

Best, Steffen

Kasper Osterbye schrieb am Donnerstag, 28. April 2022 19:03:07 (+02:00):

Kasper Osterbye schrieb am Dienstag, 26. April 2022 14:50:51 (+02:00):

I have now raised it as an issue on the issue tracker

Issue #11165 https://github.com/pharo-project/pharo/issues/11165 in
https://github.com/pharo-project/pharo/issues/11165

If I may suggest a solution it will be to:

- add two new methods - *prefixedBy: * and *suffixedBy:* to handle the
empty prefix/suffix correctly
- add comments to *beginsWith: * and *endsWith: * referring to the two
new methods

The problem is that there is client code which depend on the (wrong)
implementation of the beginsWith and endsWith methods.

Making the code correct is "untwisting the rope". Keeping the code wrong and adding a bizarre workaround is "twisting the rope tighter". The former is the right direction. On Fri, Apr 29, 2022 at 12:15 AM Steffen Märcker <merkste@web.de> wrote: > Hi Kasper, > > I've thought about that approach too. But then asked myself whether it is > more likely that there is code that relies on this bug than code where this > went unnoticed and is therefore broken. What do the other think about that > matter and the fix in Squeak? > > Best, Steffen > > > > > Kasper Osterbye schrieb am Donnerstag, 28. April 2022 19:03:07 (+02:00): > > Kasper Osterbye schrieb am Dienstag, 26. April 2022 14:50:51 (+02:00): > > I have now raised it as an issue on the issue tracker > > Issue #11165 <https://github.com/pharo-project/pharo/issues/11165> in > https://github.com/pharo-project/pharo/issues/11165 > > > If I may suggest a solution it will be to: > > - add two new methods - *prefixedBy: * and *suffixedBy:* to handle the > empty prefix/suffix correctly > - add comments to *beginsWith: * and *endsWith: * referring to the two > new methods > > > The problem is that there is client code which depend on the (wrong) > implementation of the beginsWith and endsWith methods. > > >
KO
Kasper Osterbye
Fri, Apr 29, 2022 4:44 PM

The two methods are changed to do the right thing.
We are still waiting for the CI server to check if that broke anything.

During the change, 4 places was identified which was simplified by doing it right (aka empty prefix always true).

Best,

Kasper

The two methods are changed to do the right thing. We are still waiting for the CI server to check if that broke anything. During the change, 4 places was identified which was simplified by doing it right (aka empty prefix always true). Best, Kasper