GP
Guillermo Polito
Thu, Nov 9, 2023 9:13 AM
Hi all,
We started (with many interruptions over the last months) working a bit with Stephane on understanding what is the (positive and negative) impact of stack-overflow support in Pharo.
The key idea is that if a process consumes too much stack (potentially because of an infinite recursion) then the process should stop with an exception.
Why we want better stack consumption control
This idea comes up to solve issues that are pretty common and hit especially newbies.
For example, imagine you accidentally write an accessor such as
A >> foo
^ self foo
Students do this all the time, and I’ve also seen it in experienced people who go too fast :).
More importantly, such recursions could happen also with not-so-obvious indirect recursions (a sends b, b sends c, c sends a), and these could hit anybody.
This is aggravated because the current execution model allows us to have infinite stacks —meaning: limited by available memory only.
This is indeed a nice feature for many use cases but it has its own drawbacks when one of these kind of recursions are hit:
- code just loops forever taking space in the stack
- when there is no more stack space, context objects are created and moved to the heap
- but those contexts are strongly held, so they are never GCed and take up extra space
- even worse! they are there adding more work to the GC every time and making the GC run more often looking for space that is not there
Why Ctrl-dot does not always work
Of course, super users know there is this “Ctrl dot” hidden feature that should help you recover from this.
First, let's take out of the equation that this is only known by super users.
Now, in this situation, when Ctrl-dot is hit it will trigger a handler that suspends the problematic process and opens a debugger on it.
But it could happen that,
- the stack is so big that the debugger is very sluggish (best-case scenario)
- the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
- if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
What are we working on
The main idea here is: Can we have a simple and efficient way to prevent such kinds of situations?
After many discussions around detecting recursion, we kinda arrived at the simple solution of just detecting a stack overflow.
The solution is easy to understand (because it’s like other languages work) and easy to implement because there is already support for that.
But this leaves open two questions:
- what happens when people want to use the “infinite stack” feature?
- when should a process stack overflow? What is a sensitive default value?
Our draft implementation here https://github.com/pharo-project/pharo-vm/pull/710 https://github.com/pharo-project/pharo-vm/pull/710 does the following to cope with this:
- we can now parametrize the size of the stack (of each stack page to be more accurate) when the VM starts up
- the stack overflow check can be disabled per process
We also are running experiments to see what could be a sensitive stack size for our normal usages. Here, for example, we ran almost all test cases in Pharo separately (one suite per line below), and we observed how many tests broke (x-axis) with different stack sizes (y-axis).
Here we see that most test suites require at least 20-24k to run properly, some go up to 36k of stack before converging (i.e., the number of broken tests does not change).
You’ll notice in the graph that There are some scenarios that break all the time. This is because exception handling itself is recursive and may produce more stack overflows depending on the size of the stack between the exception and the exception handler.
So some more work is still required, mostly changing Pharo libraries to properly support this. For example:
- should tests run in a fresh process with a fresh stack?
- should the exception mechanism use less recursion?
- resumable exceptions add stack pressure because they do not “unstack” until the exception is finally handled, meaning that the stack used by exception handling just adds up to the stack of the original code, can we do better here?
Probably there are more interesting questions here, that’s the “why" behind this email.
I’m interested in opinions and scenarios you may come up with that should be taken into account.
Cheers,
Guille
Hi all,
We started (with many interruptions over the last months) working a bit with Stephane on understanding what is the (positive and negative) impact of stack-overflow support in Pharo.
The key idea is that if a process consumes too much stack (potentially because of an infinite recursion) then the process should stop with an exception.
## Why we want better stack consumption control
This idea comes up to solve issues that are pretty common and hit especially newbies.
For example, imagine you accidentally write an accessor such as
```
A >> foo
^ self foo
```
Students do this all the time, and I’ve also seen it in experienced people who go too fast :).
More importantly, such recursions could happen also with not-so-obvious indirect recursions (a sends b, b sends c, c sends a), and these could hit anybody.
This is aggravated because the current execution model allows us to have infinite stacks —meaning: limited by available memory only.
This is indeed a nice feature for many use cases but it has its own drawbacks when one of these kind of recursions are hit:
- code just loops forever taking space in the stack
- when there is no more stack space, context objects are created and moved to the heap
- but those contexts are strongly held, so they are never GCed and take up extra space
- even worse! they are there adding more work to the GC every time and making the GC run more often looking for space that is not there
## Why Ctrl-dot does not always work
Of course, super users know there is this “Ctrl dot” hidden feature that should help you recover from this.
First, let's take out of the equation that this is only known by super users.
Now, in this situation, when Ctrl-dot is hit it will trigger a handler that suspends the problematic process and opens a debugger on it.
But it could happen that,
- the stack is so big that the debugger is very sluggish (best-case scenario)
- the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
- if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
## What are we working on
The main idea here is: Can we have a simple and efficient way to prevent such kinds of situations?
After many discussions around detecting recursion, we kinda arrived at the simple solution of just detecting a stack overflow.
The solution is easy to understand (because it’s like other languages work) and easy to implement because there is already support for that.
But this leaves open two questions:
- what happens when people want to use the “infinite stack” feature?
- when should a process stack overflow? What is a sensitive default value?
Our draft implementation here https://github.com/pharo-project/pharo-vm/pull/710 <https://github.com/pharo-project/pharo-vm/pull/710> does the following to cope with this:
- we can now parametrize the size of the stack (of each stack page to be more accurate) when the VM starts up
- the stack overflow check can be disabled per process
We also are running experiments to see what could be a sensitive stack size for our normal usages. Here, for example, we ran almost all test cases in Pharo separately (one suite per line below), and we observed how many tests broke (x-axis) with different stack sizes (y-axis).
Here we see that most test suites require at least 20-24k to run properly, some go up to 36k of stack before converging (i.e., the number of broken tests does not change).
You’ll notice in the graph that There are some scenarios that break all the time. This is because exception handling itself is recursive and may produce more stack overflows depending on the size of the stack between the exception and the exception handler.
So some more work is still required, mostly changing Pharo libraries to properly support this. For example:
- should tests run in a fresh process with a fresh stack?
- should the exception mechanism use less recursion?
- resumable exceptions add stack pressure because they do not “unstack” until the exception is finally handled, meaning that the stack used by exception handling just adds up to the stack of the original code, can we do better here?
Probably there are more interesting questions here, that’s the “why" behind this email.
I’m interested in opinions and scenarios you may come up with that should be taken into account.
Cheers,
Guille
DM
David Mason
Thu, Nov 9, 2023 12:35 PM
Tail call elimination would reduce stack usage significantly. See
Tail Call Elimination in OpenSmalltalk, Ralston & Mason, IWST2019
This really should be part of any stack solution. Particularly people
coming from a functional language background are likely to use it. Also
double dispatch would benefit significantly.
../Dave
On Thu, 9 Nov 2023 at 04:38, Guillermo Polito guillermopolito@gmail.com
wrote:
Hi all,
We started (with many interruptions over the last months) working a bit
with Stephane on understanding what is the (positive and negative) impact
of stack-overflow support in Pharo.
The key idea is that if a process consumes too much stack (potentially
because of an infinite recursion) then the process should stop with an
exception.
Why we want better stack consumption control
This idea comes up to solve issues that are pretty common and hit
especially newbies.
For example, imagine you accidentally write an accessor such as
A >> foo
^ self foo
Students do this all the time, and I’ve also seen it in experienced people
who go too fast :).
More importantly, such recursions could happen also with not-so-obvious
indirect recursions (a sends b, b sends c, c sends a), and these could hit
anybody.
This is aggravated because the current execution model allows us to have
infinite stacks —meaning: limited by available memory only.
This is indeed a nice feature for many use cases but it has its own
drawbacks when one of these kind of recursions are hit:
- code just loops forever taking space in the stack
- when there is no more stack space, context objects are created and
moved to the heap
- but those contexts are strongly held, so they are never GCed and take
up extra space
- even worse! they are there adding more work to the GC every time and
making the GC run more often looking for space that is not there
Why Ctrl-dot does not always work
Of course, super users know there is this “Ctrl dot” hidden feature that
should help you recover from this.
First, let's take out of the equation that this is only known by super
users.
Now, in this situation, when Ctrl-dot is hit it will trigger a handler
that suspends the problematic process and opens a debugger on it.
But it could happen that,
- the stack is so big that the debugger is very sluggish (best-case
scenario)
- the VM is just flooded doing GCs so maybe the Ctrl dot event does not
even arrive at Pharo or the trigger
- if the recursion is hit when printing an object (which is more common
than you could imagine), opening the debugger could trigger a new recursion
and never give back the control to the user
What are we working on
The main idea here is: Can we have a simple and efficient way to prevent
such kinds of situations?
After many discussions around detecting recursion, we kinda arrived at the
simple solution of just detecting a stack overflow.
The solution is easy to understand (because it’s like other languages
work) and easy to implement because there is already support for that.
But this leaves open two questions:
- what happens when people want to use the “infinite stack” feature?
- when should a process stack overflow? What is a sensitive default value?
Our draft implementation here
https://github.com/pharo-project/pharo-vm/pull/710 does the following to
cope with this:
- we can now parametrize the size of the stack (of each stack page to be
more accurate) when the VM starts up
- the stack overflow check can be disabled per process
We also are running experiments to see what could be a sensitive stack
size for our normal usages. Here, for example, we ran almost all test cases
in Pharo separately (one suite per line below), and we observed how many
tests broke (x-axis) with different stack sizes (y-axis).
Here we see that most test suites require at least 20-24k to run properly,
some go up to 36k of stack before converging (i.e., the number of broken
tests does not change).
You’ll notice in the graph that There are some scenarios that break all
the time. This is because exception handling itself is recursive and may
produce more stack overflows depending on the size of the stack between the
exception and the exception handler.
So some more work is still required, mostly changing Pharo libraries to
properly support this. For example:
- should tests run in a fresh process with a fresh stack?
- should the exception mechanism use less recursion?
- resumable exceptions add stack pressure because they do not “unstack”
until the exception is finally handled, meaning that the stack used by
exception handling just adds up to the stack of the original code, can we
do better here?
Probably there are more interesting questions here, that’s the “why"
behind this email.
I’m interested in opinions and scenarios you may come up with that should
be taken into account.
Cheers,
Guille
Tail call elimination would reduce stack usage significantly. See
Tail Call Elimination in OpenSmalltalk, Ralston & Mason, IWST2019
This really should be part of any stack solution. Particularly people
coming from a functional language background are likely to use it. Also
double dispatch would benefit significantly.
../Dave
On Thu, 9 Nov 2023 at 04:38, Guillermo Polito <guillermopolito@gmail.com>
wrote:
> Hi all,
>
> We started (with many interruptions over the last months) working a bit
> with Stephane on understanding what is the (positive and negative) impact
> of stack-overflow support in Pharo.
> The key idea is that if a process consumes too much stack (potentially
> because of an infinite recursion) then the process should stop with an
> exception.
>
> ## Why we want better stack consumption control
>
> This idea comes up to solve issues that are pretty common and hit
> especially newbies.
> For example, imagine you accidentally write an accessor such as
>
> ```
> A >> foo
> ^ self foo
> ```
>
> Students do this all the time, and I’ve also seen it in experienced people
> who go too fast :).
> More importantly, such recursions could happen also with not-so-obvious
> indirect recursions (a sends b, b sends c, c sends a), and these could hit
> anybody.
>
> This is aggravated because the current execution model allows us to have
> infinite stacks —meaning: limited by available memory only.
> This is indeed a nice feature for many use cases but it has its own
> drawbacks when one of these kind of recursions are hit:
> - code just loops forever taking space in the stack
> - when there is no more stack space, context objects are created and
> moved to the heap
> - but those contexts are strongly held, so they are never GCed and take
> up extra space
> - even worse! they are there adding more work to the GC every time and
> making the GC run more often looking for space that is not there
>
> ## Why Ctrl-dot does not always work
>
> Of course, super users know there is this “Ctrl dot” hidden feature that
> should help you recover from this.
> First, let's take out of the equation that this is only known by super
> users.
> Now, in this situation, when Ctrl-dot is hit it will trigger a handler
> that suspends the problematic process and opens a debugger on it.
> But it could happen that,
> - the stack is so big that the debugger is very sluggish (best-case
> scenario)
> - the VM is just flooded doing GCs so maybe the Ctrl dot event does not
> even arrive at Pharo or the trigger
> - if the recursion is hit when printing an object (which is more common
> than you could imagine), opening the debugger could trigger a new recursion
> and never give back the control to the user
>
> ## What are we working on
>
> The main idea here is: Can we have a simple and efficient way to prevent
> such kinds of situations?
>
> After many discussions around detecting recursion, we kinda arrived at the
> simple solution of just detecting a stack overflow.
> The solution is easy to understand (because it’s like other languages
> work) and easy to implement because there is already support for that.
> But this leaves open two questions:
> - what happens when people want to use the “infinite stack” feature?
> - when should a process stack overflow? What is a sensitive default value?
>
> Our draft implementation here
> https://github.com/pharo-project/pharo-vm/pull/710 does the following to
> cope with this:
> - we can now parametrize the size of the stack (of each stack page to be
> more accurate) when the VM starts up
> - the stack overflow check can be disabled per process
>
> We also are running experiments to see what could be a sensitive stack
> size for our normal usages. Here, for example, we ran almost all test cases
> in Pharo separately (one suite per line below), and we observed how many
> tests broke (x-axis) with different stack sizes (y-axis).
> Here we see that most test suites require at least 20-24k to run properly,
> some go up to 36k of stack before converging (i.e., the number of broken
> tests does not change).
>
>
> You’ll notice in the graph that There are some scenarios that break all
> the time. This is because exception handling itself is recursive and may
> produce more stack overflows depending on the size of the stack between the
> exception and the exception handler.
> So some more work is still required, mostly changing Pharo libraries to
> properly support this. For example:
> - should tests run in a fresh process with a fresh stack?
> - should the exception mechanism use less recursion?
> - resumable exceptions add stack pressure because they do not “unstack”
> until the exception is finally handled, meaning that the stack used by
> exception handling just adds up to the stack of the original code, can we
> do better here?
>
> Probably there are more interesting questions here, that’s the “why"
> behind this email.
> I’m interested in opinions and scenarios you may come up with that should
> be taken into account.
>
> Cheers,
> Guille
>
SV
Sven Van Caekenberghe
Thu, Nov 9, 2023 1:27 PM
This is super important.
I always thought that this was an implementation problem in the sense that the overflow check was too costly/slow.
Anyway, in most Common Lisp implementations the stack is also limited. When you hit the limit, a resumable exception is raised, allowing you to make the stack larger and continue. And of course you can set a large initial stack size per process.
On 9 Nov 2023, at 10:13, Guillermo Polito guillermopolito@gmail.com wrote:
Hi all,
We started (with many interruptions over the last months) working a bit with Stephane on understanding what is the (positive and negative) impact of stack-overflow support in Pharo.
The key idea is that if a process consumes too much stack (potentially because of an infinite recursion) then the process should stop with an exception.
Why we want better stack consumption control
This idea comes up to solve issues that are pretty common and hit especially newbies.
For example, imagine you accidentally write an accessor such as
A >> foo
^ self foo
Students do this all the time, and I’ve also seen it in experienced people who go too fast :).
More importantly, such recursions could happen also with not-so-obvious indirect recursions (a sends b, b sends c, c sends a), and these could hit anybody.
This is aggravated because the current execution model allows us to have infinite stacks —meaning: limited by available memory only.
This is indeed a nice feature for many use cases but it has its own drawbacks when one of these kind of recursions are hit:
- code just loops forever taking space in the stack
- when there is no more stack space, context objects are created and moved to the heap
- but those contexts are strongly held, so they are never GCed and take up extra space
- even worse! they are there adding more work to the GC every time and making the GC run more often looking for space that is not there
Why Ctrl-dot does not always work
Of course, super users know there is this “Ctrl dot” hidden feature that should help you recover from this.
First, let's take out of the equation that this is only known by super users.
Now, in this situation, when Ctrl-dot is hit it will trigger a handler that suspends the problematic process and opens a debugger on it.
But it could happen that,
- the stack is so big that the debugger is very sluggish (best-case scenario)
- the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
- if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
What are we working on
The main idea here is: Can we have a simple and efficient way to prevent such kinds of situations?
After many discussions around detecting recursion, we kinda arrived at the simple solution of just detecting a stack overflow.
The solution is easy to understand (because it’s like other languages work) and easy to implement because there is already support for that.
But this leaves open two questions:
- what happens when people want to use the “infinite stack” feature?
- when should a process stack overflow? What is a sensitive default value?
Our draft implementation here https://github.com/pharo-project/pharo-vm/pull/710 does the following to cope with this:
- we can now parametrize the size of the stack (of each stack page to be more accurate) when the VM starts up
- the stack overflow check can be disabled per process
We also are running experiments to see what could be a sensitive stack size for our normal usages. Here, for example, we ran almost all test cases in Pharo separately (one suite per line below), and we observed how many tests broke (x-axis) with different stack sizes (y-axis).
Here we see that most test suites require at least 20-24k to run properly, some go up to 36k of stack before converging (i.e., the number of broken tests does not change).
<ImagenPegada-10.tiff>
You’ll notice in the graph that There are some scenarios that break all the time. This is because exception handling itself is recursive and may produce more stack overflows depending on the size of the stack between the exception and the exception handler.
So some more work is still required, mostly changing Pharo libraries to properly support this. For example:
- should tests run in a fresh process with a fresh stack?
- should the exception mechanism use less recursion?
- resumable exceptions add stack pressure because they do not “unstack” until the exception is finally handled, meaning that the stack used by exception handling just adds up to the stack of the original code, can we do better here?
Probably there are more interesting questions here, that’s the “why" behind this email.
I’m interested in opinions and scenarios you may come up with that should be taken into account.
Cheers,
Guille
This is super important.
I always thought that this was an implementation problem in the sense that the overflow check was too costly/slow.
Anyway, in most Common Lisp implementations the stack is also limited. When you hit the limit, a resumable exception is raised, allowing you to make the stack larger and continue. And of course you can set a large initial stack size per process.
> On 9 Nov 2023, at 10:13, Guillermo Polito <guillermopolito@gmail.com> wrote:
>
> Hi all,
>
> We started (with many interruptions over the last months) working a bit with Stephane on understanding what is the (positive and negative) impact of stack-overflow support in Pharo.
> The key idea is that if a process consumes too much stack (potentially because of an infinite recursion) then the process should stop with an exception.
>
> ## Why we want better stack consumption control
>
> This idea comes up to solve issues that are pretty common and hit especially newbies.
> For example, imagine you accidentally write an accessor such as
>
> ```
> A >> foo
> ^ self foo
> ```
>
> Students do this all the time, and I’ve also seen it in experienced people who go too fast :).
> More importantly, such recursions could happen also with not-so-obvious indirect recursions (a sends b, b sends c, c sends a), and these could hit anybody.
>
> This is aggravated because the current execution model allows us to have infinite stacks —meaning: limited by available memory only.
> This is indeed a nice feature for many use cases but it has its own drawbacks when one of these kind of recursions are hit:
> - code just loops forever taking space in the stack
> - when there is no more stack space, context objects are created and moved to the heap
> - but those contexts are strongly held, so they are never GCed and take up extra space
> - even worse! they are there adding more work to the GC every time and making the GC run more often looking for space that is not there
>
> ## Why Ctrl-dot does not always work
>
> Of course, super users know there is this “Ctrl dot” hidden feature that should help you recover from this.
> First, let's take out of the equation that this is only known by super users.
> Now, in this situation, when Ctrl-dot is hit it will trigger a handler that suspends the problematic process and opens a debugger on it.
> But it could happen that,
> - the stack is so big that the debugger is very sluggish (best-case scenario)
> - the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
> - if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
>
> ## What are we working on
>
> The main idea here is: Can we have a simple and efficient way to prevent such kinds of situations?
>
> After many discussions around detecting recursion, we kinda arrived at the simple solution of just detecting a stack overflow.
> The solution is easy to understand (because it’s like other languages work) and easy to implement because there is already support for that.
> But this leaves open two questions:
> - what happens when people want to use the “infinite stack” feature?
> - when should a process stack overflow? What is a sensitive default value?
>
> Our draft implementation here https://github.com/pharo-project/pharo-vm/pull/710 does the following to cope with this:
> - we can now parametrize the size of the stack (of each stack page to be more accurate) when the VM starts up
> - the stack overflow check can be disabled per process
>
> We also are running experiments to see what could be a sensitive stack size for our normal usages. Here, for example, we ran almost all test cases in Pharo separately (one suite per line below), and we observed how many tests broke (x-axis) with different stack sizes (y-axis).
> Here we see that most test suites require at least 20-24k to run properly, some go up to 36k of stack before converging (i.e., the number of broken tests does not change).
>
> <ImagenPegada-10.tiff>
> You’ll notice in the graph that There are some scenarios that break all the time. This is because exception handling itself is recursive and may produce more stack overflows depending on the size of the stack between the exception and the exception handler.
> So some more work is still required, mostly changing Pharo libraries to properly support this. For example:
> - should tests run in a fresh process with a fresh stack?
> - should the exception mechanism use less recursion?
> - resumable exceptions add stack pressure because they do not “unstack” until the exception is finally handled, meaning that the stack used by exception handling just adds up to the stack of the original code, can we do better here?
>
> Probably there are more interesting questions here, that’s the “why" behind this email.
> I’m interested in opinions and scenarios you may come up with that should be taken into account.
>
> Cheers,
> Guille
GP
Guille Polito
Thu, Nov 9, 2023 2:18 PM
El 9 nov. 2023, a las 13:35, David Mason dmason@torontomu.ca escribió:
Tail call elimination would reduce stack usage significantly. See
Tail Call Elimination in OpenSmalltalk, Ralston & Mason, IWST2019
This really should be part of any stack solution. Particularly people coming from a functional language background are likely to use it. Also double dispatch would benefit significantly.
Yes, I agree that tail-call elimination will largely improve performance.
But it will also harm debugging unless there is a way to reconstruct the stack when there is a problem, which I suspect could be very expensive.
Anyways, this solves a performance problem, not the recursion problem that motivates my mail, right?
../Dave
On Thu, 9 Nov 2023 at 04:38, Guillermo Polito <guillermopolito@gmail.com mailto:guillermopolito@gmail.com> wrote:
Hi all,
We started (with many interruptions over the last months) working a bit with Stephane on understanding what is the (positive and negative) impact of stack-overflow support in Pharo.
The key idea is that if a process consumes too much stack (potentially because of an infinite recursion) then the process should stop with an exception.
Why we want better stack consumption control
This idea comes up to solve issues that are pretty common and hit especially newbies.
For example, imagine you accidentally write an accessor such as
A >> foo
^ self foo
Students do this all the time, and I’ve also seen it in experienced people who go too fast :).
More importantly, such recursions could happen also with not-so-obvious indirect recursions (a sends b, b sends c, c sends a), and these could hit anybody.
This is aggravated because the current execution model allows us to have infinite stacks —meaning: limited by available memory only.
This is indeed a nice feature for many use cases but it has its own drawbacks when one of these kind of recursions are hit:
- code just loops forever taking space in the stack
- when there is no more stack space, context objects are created and moved to the heap
- but those contexts are strongly held, so they are never GCed and take up extra space
- even worse! they are there adding more work to the GC every time and making the GC run more often looking for space that is not there
Why Ctrl-dot does not always work
Of course, super users know there is this “Ctrl dot” hidden feature that should help you recover from this.
First, let's take out of the equation that this is only known by super users.
Now, in this situation, when Ctrl-dot is hit it will trigger a handler that suspends the problematic process and opens a debugger on it.
But it could happen that,
- the stack is so big that the debugger is very sluggish (best-case scenario)
- the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
- if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
What are we working on
The main idea here is: Can we have a simple and efficient way to prevent such kinds of situations?
After many discussions around detecting recursion, we kinda arrived at the simple solution of just detecting a stack overflow.
The solution is easy to understand (because it’s like other languages work) and easy to implement because there is already support for that.
But this leaves open two questions:
- what happens when people want to use the “infinite stack” feature?
- when should a process stack overflow? What is a sensitive default value?
Our draft implementation here https://github.com/pharo-project/pharo-vm/pull/710 https://github.com/pharo-project/pharo-vm/pull/710 does the following to cope with this:
- we can now parametrize the size of the stack (of each stack page to be more accurate) when the VM starts up
- the stack overflow check can be disabled per process
We also are running experiments to see what could be a sensitive stack size for our normal usages. Here, for example, we ran almost all test cases in Pharo separately (one suite per line below), and we observed how many tests broke (x-axis) with different stack sizes (y-axis).
Here we see that most test suites require at least 20-24k to run properly, some go up to 36k of stack before converging (i.e., the number of broken tests does not change).
<ImagenPegada-10.tiff>
You’ll notice in the graph that There are some scenarios that break all the time. This is because exception handling itself is recursive and may produce more stack overflows depending on the size of the stack between the exception and the exception handler.
So some more work is still required, mostly changing Pharo libraries to properly support this. For example:
- should tests run in a fresh process with a fresh stack?
- should the exception mechanism use less recursion?
- resumable exceptions add stack pressure because they do not “unstack” until the exception is finally handled, meaning that the stack used by exception handling just adds up to the stack of the original code, can we do better here?
Probably there are more interesting questions here, that’s the “why" behind this email.
I’m interested in opinions and scenarios you may come up with that should be taken into account.
Cheers,
Guille
Pharo-vm mailing list -- pharo-vm@lists.pharo.org
To unsubscribe send an email to pharo-vm-leave@lists.pharo.org
> El 9 nov. 2023, a las 13:35, David Mason <dmason@torontomu.ca> escribió:
>
> Tail call elimination would reduce stack usage significantly. See
> Tail Call Elimination in OpenSmalltalk, Ralston & Mason, IWST2019
>
> This really should be part of any stack solution. Particularly people coming from a functional language background are likely to use it. Also double dispatch would benefit significantly.
Yes, I agree that tail-call elimination will largely improve performance.
But it will also harm debugging unless there is a way to reconstruct the stack when there is a problem, which I suspect could be very expensive.
Anyways, this solves a performance problem, not the recursion problem that motivates my mail, right?
>
> ../Dave
>
> On Thu, 9 Nov 2023 at 04:38, Guillermo Polito <guillermopolito@gmail.com <mailto:guillermopolito@gmail.com>> wrote:
> Hi all,
>
> We started (with many interruptions over the last months) working a bit with Stephane on understanding what is the (positive and negative) impact of stack-overflow support in Pharo.
> The key idea is that if a process consumes too much stack (potentially because of an infinite recursion) then the process should stop with an exception.
>
> ## Why we want better stack consumption control
>
> This idea comes up to solve issues that are pretty common and hit especially newbies.
> For example, imagine you accidentally write an accessor such as
>
> ```
> A >> foo
> ^ self foo
> ```
>
> Students do this all the time, and I’ve also seen it in experienced people who go too fast :).
> More importantly, such recursions could happen also with not-so-obvious indirect recursions (a sends b, b sends c, c sends a), and these could hit anybody.
>
> This is aggravated because the current execution model allows us to have infinite stacks —meaning: limited by available memory only.
> This is indeed a nice feature for many use cases but it has its own drawbacks when one of these kind of recursions are hit:
> - code just loops forever taking space in the stack
> - when there is no more stack space, context objects are created and moved to the heap
> - but those contexts are strongly held, so they are never GCed and take up extra space
> - even worse! they are there adding more work to the GC every time and making the GC run more often looking for space that is not there
>
> ## Why Ctrl-dot does not always work
>
> Of course, super users know there is this “Ctrl dot” hidden feature that should help you recover from this.
> First, let's take out of the equation that this is only known by super users.
> Now, in this situation, when Ctrl-dot is hit it will trigger a handler that suspends the problematic process and opens a debugger on it.
> But it could happen that,
> - the stack is so big that the debugger is very sluggish (best-case scenario)
> - the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
> - if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
>
> ## What are we working on
>
> The main idea here is: Can we have a simple and efficient way to prevent such kinds of situations?
>
> After many discussions around detecting recursion, we kinda arrived at the simple solution of just detecting a stack overflow.
> The solution is easy to understand (because it’s like other languages work) and easy to implement because there is already support for that.
> But this leaves open two questions:
> - what happens when people want to use the “infinite stack” feature?
> - when should a process stack overflow? What is a sensitive default value?
>
> Our draft implementation here https://github.com/pharo-project/pharo-vm/pull/710 <https://github.com/pharo-project/pharo-vm/pull/710> does the following to cope with this:
> - we can now parametrize the size of the stack (of each stack page to be more accurate) when the VM starts up
> - the stack overflow check can be disabled per process
>
> We also are running experiments to see what could be a sensitive stack size for our normal usages. Here, for example, we ran almost all test cases in Pharo separately (one suite per line below), and we observed how many tests broke (x-axis) with different stack sizes (y-axis).
> Here we see that most test suites require at least 20-24k to run properly, some go up to 36k of stack before converging (i.e., the number of broken tests does not change).
>
> <ImagenPegada-10.tiff>
>
> You’ll notice in the graph that There are some scenarios that break all the time. This is because exception handling itself is recursive and may produce more stack overflows depending on the size of the stack between the exception and the exception handler.
> So some more work is still required, mostly changing Pharo libraries to properly support this. For example:
> - should tests run in a fresh process with a fresh stack?
> - should the exception mechanism use less recursion?
> - resumable exceptions add stack pressure because they do not “unstack” until the exception is finally handled, meaning that the stack used by exception handling just adds up to the stack of the original code, can we do better here?
>
> Probably there are more interesting questions here, that’s the “why" behind this email.
> I’m interested in opinions and scenarios you may come up with that should be taken into account.
>
> Cheers,
> Guille
> _______________________________________________
> Pharo-vm mailing list -- pharo-vm@lists.pharo.org
> To unsubscribe send an email to pharo-vm-leave@lists.pharo.org
DM
David Mason
Thu, Nov 9, 2023 3:23 PM
Tail-call-elimination is certainly not a total solution to the stack
overflow problem. However it would catch the foo ^ self foo
example. And
it would reduce he demand for stack.
I like what Sven mentioned about a resumable exception. You could
incrementally expand your stack in your application.
As for debugging, it's not clear how much of a problem the lack of a return
context actually is. We won't know until TCE is implemented and we get some
experience with it. For the record Zag Smalltalk is doing TCE.
../Dave
On Thu, 9 Nov 2023 at 09:18, Guille Polito guillermo.polito@inria.fr
wrote:
El 9 nov. 2023, a las 13:35, David Mason dmason@torontomu.ca escribió:
Tail call elimination would reduce stack usage significantly. See
Tail Call Elimination in OpenSmalltalk, Ralston & Mason, IWST2019
This really should be part of any stack solution. Particularly people
coming from a functional language background are likely to use it. Also
double dispatch would benefit significantly.
Yes, I agree that tail-call elimination will largely improve performance.
But it will also harm debugging unless there is a way to reconstruct the
stack when there is a problem, which I suspect could be very expensive.
Anyways, this solves a performance problem, not the recursion problem that
motivates my mail, right?
../Dave
On Thu, 9 Nov 2023 at 04:38, Guillermo Polito guillermopolito@gmail.com
wrote:
Hi all,
We started (with many interruptions over the last months) working a bit
with Stephane on understanding what is the (positive and negative) impact
of stack-overflow support in Pharo.
The key idea is that if a process consumes too much stack (potentially
because of an infinite recursion) then the process should stop with an
exception.
Why we want better stack consumption control
This idea comes up to solve issues that are pretty common and hit
especially newbies.
For example, imagine you accidentally write an accessor such as
A >> foo
^ self foo
Students do this all the time, and I’ve also seen it in experienced
people who go too fast :).
More importantly, such recursions could happen also with not-so-obvious
indirect recursions (a sends b, b sends c, c sends a), and these could hit
anybody.
This is aggravated because the current execution model allows us to have
infinite stacks —meaning: limited by available memory only.
This is indeed a nice feature for many use cases but it has its own
drawbacks when one of these kind of recursions are hit:
- code just loops forever taking space in the stack
- when there is no more stack space, context objects are created and
moved to the heap
- but those contexts are strongly held, so they are never GCed and take
up extra space
- even worse! they are there adding more work to the GC every time and
making the GC run more often looking for space that is not there
Why Ctrl-dot does not always work
Of course, super users know there is this “Ctrl dot” hidden feature that
should help you recover from this.
First, let's take out of the equation that this is only known by super
users.
Now, in this situation, when Ctrl-dot is hit it will trigger a handler
that suspends the problematic process and opens a debugger on it.
But it could happen that,
- the stack is so big that the debugger is very sluggish (best-case
scenario)
- the VM is just flooded doing GCs so maybe the Ctrl dot event does not
even arrive at Pharo or the trigger
- if the recursion is hit when printing an object (which is more common
than you could imagine), opening the debugger could trigger a new recursion
and never give back the control to the user
What are we working on
The main idea here is: Can we have a simple and efficient way to prevent
such kinds of situations?
After many discussions around detecting recursion, we kinda arrived at
the simple solution of just detecting a stack overflow.
The solution is easy to understand (because it’s like other languages
work) and easy to implement because there is already support for that.
But this leaves open two questions:
- what happens when people want to use the “infinite stack” feature?
- when should a process stack overflow? What is a sensitive default
value?
Our draft implementation here
https://github.com/pharo-project/pharo-vm/pull/710 does the following to
cope with this:
- we can now parametrize the size of the stack (of each stack page to be
more accurate) when the VM starts up
- the stack overflow check can be disabled per process
We also are running experiments to see what could be a sensitive stack
size for our normal usages. Here, for example, we ran almost all test cases
in Pharo separately (one suite per line below), and we observed how many
tests broke (x-axis) with different stack sizes (y-axis).
Here we see that most test suites require at least 20-24k to run
properly, some go up to 36k of stack before converging (i.e., the number of
broken tests does not change).
<ImagenPegada-10.tiff>
You’ll notice in the graph that There are some scenarios that break all
the time. This is because exception handling itself is recursive and may
produce more stack overflows depending on the size of the stack between the
exception and the exception handler.
So some more work is still required, mostly changing Pharo libraries to
properly support this. For example:
- should tests run in a fresh process with a fresh stack?
- should the exception mechanism use less recursion?
- resumable exceptions add stack pressure because they do not “unstack”
until the exception is finally handled, meaning that the stack used by
exception handling just adds up to the stack of the original code, can we
do better here?
Probably there are more interesting questions here, that’s the “why"
behind this email.
I’m interested in opinions and scenarios you may come up with that should
be taken into account.
Cheers,
Guille
Tail-call-elimination is certainly not a total solution to the stack
overflow problem. However it would catch the `foo ^ self foo` example. And
it would reduce he demand for stack.
I like what Sven mentioned about a resumable exception. You could
incrementally expand your stack in your application.
As for debugging, it's not clear how much of a problem the lack of a return
context actually is. We won't know until TCE is implemented and we get some
experience with it. For the record Zag Smalltalk is doing TCE.
../Dave
On Thu, 9 Nov 2023 at 09:18, Guille Polito <guillermo.polito@inria.fr>
wrote:
>
>
> El 9 nov. 2023, a las 13:35, David Mason <dmason@torontomu.ca> escribió:
>
> Tail call elimination would reduce stack usage significantly. See
> Tail Call Elimination in OpenSmalltalk, Ralston & Mason, IWST2019
>
> This really should be part of any stack solution. Particularly people
> coming from a functional language background are likely to use it. Also
> double dispatch would benefit significantly.
>
>
> Yes, I agree that tail-call elimination will largely improve performance.
> But it will also harm debugging unless there is a way to reconstruct the
> stack when there is a problem, which I suspect could be very expensive.
>
> Anyways, this solves a performance problem, not the recursion problem that
> motivates my mail, right?
>
>
> ../Dave
>
> On Thu, 9 Nov 2023 at 04:38, Guillermo Polito <guillermopolito@gmail.com>
> wrote:
>
>> Hi all,
>>
>> We started (with many interruptions over the last months) working a bit
>> with Stephane on understanding what is the (positive and negative) impact
>> of stack-overflow support in Pharo.
>> The key idea is that if a process consumes too much stack (potentially
>> because of an infinite recursion) then the process should stop with an
>> exception.
>>
>> ## Why we want better stack consumption control
>>
>> This idea comes up to solve issues that are pretty common and hit
>> especially newbies.
>> For example, imagine you accidentally write an accessor such as
>>
>> ```
>> A >> foo
>> ^ self foo
>> ```
>>
>> Students do this all the time, and I’ve also seen it in experienced
>> people who go too fast :).
>> More importantly, such recursions could happen also with not-so-obvious
>> indirect recursions (a sends b, b sends c, c sends a), and these could hit
>> anybody.
>>
>> This is aggravated because the current execution model allows us to have
>> infinite stacks —meaning: limited by available memory only.
>> This is indeed a nice feature for many use cases but it has its own
>> drawbacks when one of these kind of recursions are hit:
>> - code just loops forever taking space in the stack
>> - when there is no more stack space, context objects are created and
>> moved to the heap
>> - but those contexts are strongly held, so they are never GCed and take
>> up extra space
>> - even worse! they are there adding more work to the GC every time and
>> making the GC run more often looking for space that is not there
>>
>> ## Why Ctrl-dot does not always work
>>
>> Of course, super users know there is this “Ctrl dot” hidden feature that
>> should help you recover from this.
>> First, let's take out of the equation that this is only known by super
>> users.
>> Now, in this situation, when Ctrl-dot is hit it will trigger a handler
>> that suspends the problematic process and opens a debugger on it.
>> But it could happen that,
>> - the stack is so big that the debugger is very sluggish (best-case
>> scenario)
>> - the VM is just flooded doing GCs so maybe the Ctrl dot event does not
>> even arrive at Pharo or the trigger
>> - if the recursion is hit when printing an object (which is more common
>> than you could imagine), opening the debugger could trigger a new recursion
>> and never give back the control to the user
>>
>> ## What are we working on
>>
>> The main idea here is: Can we have a simple and efficient way to prevent
>> such kinds of situations?
>>
>> After many discussions around detecting recursion, we kinda arrived at
>> the simple solution of just detecting a stack overflow.
>> The solution is easy to understand (because it’s like other languages
>> work) and easy to implement because there is already support for that.
>> But this leaves open two questions:
>> - what happens when people want to use the “infinite stack” feature?
>> - when should a process stack overflow? What is a sensitive default
>> value?
>>
>> Our draft implementation here
>> https://github.com/pharo-project/pharo-vm/pull/710 does the following to
>> cope with this:
>> - we can now parametrize the size of the stack (of each stack page to be
>> more accurate) when the VM starts up
>> - the stack overflow check can be disabled per process
>>
>> We also are running experiments to see what could be a sensitive stack
>> size for our normal usages. Here, for example, we ran almost all test cases
>> in Pharo separately (one suite per line below), and we observed how many
>> tests broke (x-axis) with different stack sizes (y-axis).
>> Here we see that most test suites require at least 20-24k to run
>> properly, some go up to 36k of stack before converging (i.e., the number of
>> broken tests does not change).
>>
>> <ImagenPegada-10.tiff>
>>
>> You’ll notice in the graph that There are some scenarios that break all
>> the time. This is because exception handling itself is recursive and may
>> produce more stack overflows depending on the size of the stack between the
>> exception and the exception handler.
>> So some more work is still required, mostly changing Pharo libraries to
>> properly support this. For example:
>> - should tests run in a fresh process with a fresh stack?
>> - should the exception mechanism use less recursion?
>> - resumable exceptions add stack pressure because they do not “unstack”
>> until the exception is finally handled, meaning that the stack used by
>> exception handling just adds up to the stack of the original code, can we
>> do better here?
>>
>> Probably there are more interesting questions here, that’s the “why"
>> behind this email.
>> I’m interested in opinions and scenarios you may come up with that should
>> be taken into account.
>>
>> Cheers,
>> Guille
>>
> _______________________________________________
> Pharo-vm mailing list -- pharo-vm@lists.pharo.org
> To unsubscribe send an email to pharo-vm-leave@lists.pharo.org
>
>
>
DS
Daniel Slomovits
Sat, Nov 11, 2023 4:10 AM
I think this is a great idea. I've mostly used Dolphin Smalltalk, which is
actually a strict stack machine under the hood (it has a context-like
introspection API but the stack is explicitly the canonical form), so it's
more-or-less forced to implement a limit of some kind. When I started using
Pharo I triggered a couple stack overflows by mistake, and was frustrated
by the fact that at first what happened was...nothing, everything seemed
fine, my code just didn't work. And then half a minute later Pharo gets
extremely slow and I notice it's using 2GB of memory and by then it's too
late and I have to kill the image. Getting a more-or-less immediate error
would be much more user-friendly IMO.
A couple things to learn from Dolphin's implementation, I think:
- When a stack overflow is detected, the resulting interrupt raises
that process' stack limit by a significant amount (though by less than the
original limit—IOW it doesn't double, but it's not just a couple more
frames either) before signaling the exception, precisely so that exception
handling can occur without triggering another stack-overflow event. A
further refinement could be that if a second stack overflow is detected,
we directly invoke more basic recovery—this could mean an emergency
evaluator, terminating the offending process and opening a post-mortem with
a textual stack dump (ugh! but at least it's predictable), etc.
- I don't think we should worry too much about refining what exactly
the limit is. 10x as much stack as 99% of code will ever use, is still a
tiny amount compared to consuming all available memory with Contexts. At
least, if I'm understanding the graph/data correctly. That's 36kB of stack
space, right? Not 36k frames/contexts deep? With each context being six
slots plus args/temps, 36kB is 500-750 frames on a 64-bit VM (in stack
representation—contexts add object-header overhead but we don't reify them
unless we have to). For reference, Dolphin's limit is 64kB, but that's a
32-bit VM, so the equivalent for 64-bit would be 128kB...but because Pharo
can spill contexts to the stack, the limit could easily be 1MB, or a fixed
number of frames designed to approximate that—still a tiny amount of memory
overall, and still will be hit near-instantly by true infinite recursion,
but lots of breathing room for most use cases.
Actually, this did get me to thinking...the stack depth of a Pharo process
is not necessarily easy/cheap to compute in the general case, without
caching a lot of information on intermediate contexts. In most cases the
context chain acts as a proper stack, but not always—methods like
Process>>on:do:, and some even more esoteric ones I forget off the top of
my head, make modifications far away from the top context and may splice
context chains together in odd ways. Perhaps a more flexible limit would be
better—one that is triggered by allocating more than a certain number of
contexts total, and examines running Processes in detail to find the
culprit at that point.
On Thu, Nov 9, 2023 at 4:38 AM Guillermo Polito guillermopolito@gmail.com
wrote:
Hi all,
We started (with many interruptions over the last months) working a bit
with Stephane on understanding what is the (positive and negative) impact
of stack-overflow support in Pharo.
The key idea is that if a process consumes too much stack (potentially
because of an infinite recursion) then the process should stop with an
exception.
Why we want better stack consumption control
This idea comes up to solve issues that are pretty common and hit
especially newbies.
For example, imagine you accidentally write an accessor such as
A >> foo
^ self foo
Students do this all the time, and I’ve also seen it in experienced people
who go too fast :).
More importantly, such recursions could happen also with not-so-obvious
indirect recursions (a sends b, b sends c, c sends a), and these could hit
anybody.
This is aggravated because the current execution model allows us to have
infinite stacks —meaning: limited by available memory only.
This is indeed a nice feature for many use cases but it has its own
drawbacks when one of these kind of recursions are hit:
- code just loops forever taking space in the stack
- when there is no more stack space, context objects are created and
moved to the heap
- but those contexts are strongly held, so they are never GCed and take
up extra space
- even worse! they are there adding more work to the GC every time and
making the GC run more often looking for space that is not there
Why Ctrl-dot does not always work
Of course, super users know there is this “Ctrl dot” hidden feature that
should help you recover from this.
First, let's take out of the equation that this is only known by super
users.
Now, in this situation, when Ctrl-dot is hit it will trigger a handler
that suspends the problematic process and opens a debugger on it.
But it could happen that,
- the stack is so big that the debugger is very sluggish (best-case
scenario)
- the VM is just flooded doing GCs so maybe the Ctrl dot event does not
even arrive at Pharo or the trigger
- if the recursion is hit when printing an object (which is more common
than you could imagine), opening the debugger could trigger a new recursion
and never give back the control to the user
What are we working on
The main idea here is: Can we have a simple and efficient way to prevent
such kinds of situations?
After many discussions around detecting recursion, we kinda arrived at the
simple solution of just detecting a stack overflow.
The solution is easy to understand (because it’s like other languages
work) and easy to implement because there is already support for that.
But this leaves open two questions:
- what happens when people want to use the “infinite stack” feature?
- when should a process stack overflow? What is a sensitive default value?
Our draft implementation here
https://github.com/pharo-project/pharo-vm/pull/710 does the following to
cope with this:
- we can now parametrize the size of the stack (of each stack page to be
more accurate) when the VM starts up
- the stack overflow check can be disabled per process
We also are running experiments to see what could be a sensitive stack
size for our normal usages. Here, for example, we ran almost all test cases
in Pharo separately (one suite per line below), and we observed how many
tests broke (x-axis) with different stack sizes (y-axis).
Here we see that most test suites require at least 20-24k to run properly,
some go up to 36k of stack before converging (i.e., the number of broken
tests does not change).
You’ll notice in the graph that There are some scenarios that break all
the time. This is because exception handling itself is recursive and may
produce more stack overflows depending on the size of the stack between the
exception and the exception handler.
So some more work is still required, mostly changing Pharo libraries to
properly support this. For example:
- should tests run in a fresh process with a fresh stack?
- should the exception mechanism use less recursion?
- resumable exceptions add stack pressure because they do not “unstack”
until the exception is finally handled, meaning that the stack used by
exception handling just adds up to the stack of the original code, can we
do better here?
Probably there are more interesting questions here, that’s the “why"
behind this email.
I’m interested in opinions and scenarios you may come up with that should
be taken into account.
Cheers,
Guille
I think this is a great idea. I've mostly used Dolphin Smalltalk, which is
actually a strict stack machine under the hood (it has a context-like
introspection API but the stack is explicitly the canonical form), so it's
more-or-less forced to implement a limit of some kind. When I started using
Pharo I triggered a couple stack overflows by mistake, and was frustrated
by the fact that at first what happened was...nothing, everything seemed
fine, my code just didn't work. And then half a minute later Pharo gets
extremely slow and I notice it's using 2GB of memory and by then it's too
late and I have to kill the image. Getting a more-or-less immediate error
would be much more user-friendly IMO.
A couple things to learn from Dolphin's implementation, I think:
1. When a stack overflow is detected, the resulting interrupt raises
that process' stack limit by a significant amount (though by less than the
original limit—IOW it doesn't double, but it's not just a couple more
frames either) before signaling the exception, precisely so that exception
handling can occur without triggering another stack-overflow event. A
further refinement could be that if a second stack overflow *is* detected,
we directly invoke more basic recovery—this could mean an emergency
evaluator, terminating the offending process and opening a post-mortem with
a textual stack dump (ugh! but at least it's predictable), etc.
2. I don't think we should worry too much about refining what exactly
the limit is. 10x as much stack as 99% of code will ever use, is still a
tiny amount compared to consuming all available memory with Contexts. At
least, if I'm understanding the graph/data correctly. That's 36kB of stack
space, right? Not 36k frames/contexts deep? With each context being six
slots plus args/temps, 36kB is 500-750 frames on a 64-bit VM (in stack
representation—contexts add object-header overhead but we don't reify them
unless we have to). For reference, Dolphin's limit is 64kB, but that's a
32-bit VM, so the equivalent for 64-bit would be 128kB...but because Pharo
can spill contexts to the stack, the limit could easily be 1MB, or a fixed
number of frames designed to approximate that—still a tiny amount of memory
overall, and still will be hit near-instantly by true infinite recursion,
but lots of breathing room for most use cases.
Actually, this did get me to thinking...the stack depth of a Pharo process
is not necessarily easy/cheap to compute in the general case, without
caching a lot of information on intermediate contexts. In most cases the
context chain acts as a proper stack, but *not always*—methods like
Process>>on:do:, and some even more esoteric ones I forget off the top of
my head, make modifications far away from the top context and may splice
context chains together in odd ways. Perhaps a more flexible limit would be
better—one that is triggered by allocating more than a certain number of
contexts *total*, and examines running Processes in detail to find the
culprit at that point.
On Thu, Nov 9, 2023 at 4:38 AM Guillermo Polito <guillermopolito@gmail.com>
wrote:
> Hi all,
>
> We started (with many interruptions over the last months) working a bit
> with Stephane on understanding what is the (positive and negative) impact
> of stack-overflow support in Pharo.
> The key idea is that if a process consumes too much stack (potentially
> because of an infinite recursion) then the process should stop with an
> exception.
>
> ## Why we want better stack consumption control
>
> This idea comes up to solve issues that are pretty common and hit
> especially newbies.
> For example, imagine you accidentally write an accessor such as
>
> ```
> A >> foo
> ^ self foo
> ```
>
> Students do this all the time, and I’ve also seen it in experienced people
> who go too fast :).
> More importantly, such recursions could happen also with not-so-obvious
> indirect recursions (a sends b, b sends c, c sends a), and these could hit
> anybody.
>
> This is aggravated because the current execution model allows us to have
> infinite stacks —meaning: limited by available memory only.
> This is indeed a nice feature for many use cases but it has its own
> drawbacks when one of these kind of recursions are hit:
> - code just loops forever taking space in the stack
> - when there is no more stack space, context objects are created and
> moved to the heap
> - but those contexts are strongly held, so they are never GCed and take
> up extra space
> - even worse! they are there adding more work to the GC every time and
> making the GC run more often looking for space that is not there
>
> ## Why Ctrl-dot does not always work
>
> Of course, super users know there is this “Ctrl dot” hidden feature that
> should help you recover from this.
> First, let's take out of the equation that this is only known by super
> users.
> Now, in this situation, when Ctrl-dot is hit it will trigger a handler
> that suspends the problematic process and opens a debugger on it.
> But it could happen that,
> - the stack is so big that the debugger is very sluggish (best-case
> scenario)
> - the VM is just flooded doing GCs so maybe the Ctrl dot event does not
> even arrive at Pharo or the trigger
> - if the recursion is hit when printing an object (which is more common
> than you could imagine), opening the debugger could trigger a new recursion
> and never give back the control to the user
>
> ## What are we working on
>
> The main idea here is: Can we have a simple and efficient way to prevent
> such kinds of situations?
>
> After many discussions around detecting recursion, we kinda arrived at the
> simple solution of just detecting a stack overflow.
> The solution is easy to understand (because it’s like other languages
> work) and easy to implement because there is already support for that.
> But this leaves open two questions:
> - what happens when people want to use the “infinite stack” feature?
> - when should a process stack overflow? What is a sensitive default value?
>
> Our draft implementation here
> https://github.com/pharo-project/pharo-vm/pull/710 does the following to
> cope with this:
> - we can now parametrize the size of the stack (of each stack page to be
> more accurate) when the VM starts up
> - the stack overflow check can be disabled per process
>
> We also are running experiments to see what could be a sensitive stack
> size for our normal usages. Here, for example, we ran almost all test cases
> in Pharo separately (one suite per line below), and we observed how many
> tests broke (x-axis) with different stack sizes (y-axis).
> Here we see that most test suites require at least 20-24k to run properly,
> some go up to 36k of stack before converging (i.e., the number of broken
> tests does not change).
>
>
> You’ll notice in the graph that There are some scenarios that break all
> the time. This is because exception handling itself is recursive and may
> produce more stack overflows depending on the size of the stack between the
> exception and the exception handler.
> So some more work is still required, mostly changing Pharo libraries to
> properly support this. For example:
> - should tests run in a fresh process with a fresh stack?
> - should the exception mechanism use less recursion?
> - resumable exceptions add stack pressure because they do not “unstack”
> until the exception is finally handled, meaning that the stack used by
> exception handling just adds up to the stack of the original code, can we
> do better here?
>
> Probably there are more interesting questions here, that’s the “why"
> behind this email.
> I’m interested in opinions and scenarios you may come up with that should
> be taken into account.
>
> Cheers,
> Guille
>
SD
stephane ducasse
Sun, Nov 12, 2023 9:52 AM
Hi daniel
Thanks for the feedback.
May be you wrote it but I could not really understand.
How dolphin handled
That is let to run a couple of seconds?
Did they kill the process?
In Pharo we do have an interrupt but
But it could happen that,
- the stack is so big that the debugger is very sluggish (best-case scenario)
- the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
- if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
On 11 Nov 2023, at 05:10, Daniel Slomovits daniels220@gmail.com wrote:
I think this is a great idea. I've mostly used Dolphin Smalltalk, which is actually a strict stack machine under the hood (it has a context-like introspection API but the stack is explicitly the canonical form), so it's more-or-less forced to implement a limit of some kind. When I started using Pharo I triggered a couple stack overflows by mistake, and was frustrated by the fact that at first what happened was...nothing, everything seemed fine, my code just didn't work. And then half a minute later Pharo gets extremely slow and I notice it's using 2GB of memory and by then it's too late and I have to kill the image. Getting a more-or-less immediate error would be much more user-friendly IMO.
A couple things to learn from Dolphin's implementation, I think:
When a stack overflow is detected, the resulting interrupt raises that process' stack limit by a significant amount (though by less than the original limit—IOW it doesn't double, but it's not just a couple more frames either) before signaling the exception, precisely so that exception handling can occur without triggering another stack-overflow event. A further refinement could be that if a second stack overflow is detected, we directly invoke more basic recovery—this could mean an emergency evaluator, terminating the offending process and opening a post-mortem with a textual stack dump (ugh! but at least it's predictable), etc.
I don't think we should worry too much about refining what exactly the limit is. 10x as much stack as 99% of code will ever use, is still a tiny amount compared to consuming all available memory with Contexts. At least, if I'm understanding the graph/data correctly. That's 36kB of stack space, right? Not 36k frames/contexts deep? With each context being six slots plus args/temps, 36kB is 500-750 frames on a 64-bit VM (in stack representation—contexts add object-header overhead but we don't reify them unless we have to). For reference, Dolphin's limit is 64kB, but that's a 32-bit VM, so the equivalent for 64-bit would be 128kB...but because Pharo can spill contexts to the stack, the limit could easily be 1MB, or a fixed number of frames designed to approximate that—still a tiny amount of memory overall, and still will be hit near-instantly by true infinite recursion, but lots of breathing room for most use cases.
Actually, this did get me to thinking...the stack depth of a Pharo process is not necessarily easy/cheap to compute in the general case, without caching a lot of information on intermediate contexts. In most cases the context chain acts as a proper stack, but not always—methods like Process>>on:do:, and some even more esoteric ones I forget off the top of my head, make modifications far away from the top context and may splice context chains together in odd ways. Perhaps a more flexible limit would be better—one that is triggered by allocating more than a certain number of contexts total, and examines running Processes in detail to find the culprit at that point.
On Thu, Nov 9, 2023 at 4:38 AM Guillermo Polito <guillermopolito@gmail.com mailto:guillermopolito@gmail.com> wrote:
Hi all,
We started (with many interruptions over the last months) working a bit with Stephane on understanding what is the (positive and negative) impact of stack-overflow support in Pharo.
The key idea is that if a process consumes too much stack (potentially because of an infinite recursion) then the process should stop with an exception.
Why we want better stack consumption control
This idea comes up to solve issues that are pretty common and hit especially newbies.
For example, imagine you accidentally write an accessor such as
A >> foo
^ self foo
Students do this all the time, and I’ve also seen it in experienced people who go too fast :).
More importantly, such recursions could happen also with not-so-obvious indirect recursions (a sends b, b sends c, c sends a), and these could hit anybody.
This is aggravated because the current execution model allows us to have infinite stacks —meaning: limited by available memory only.
This is indeed a nice feature for many use cases but it has its own drawbacks when one of these kind of recursions are hit:
- code just loops forever taking space in the stack
- when there is no more stack space, context objects are created and moved to the heap
- but those contexts are strongly held, so they are never GCed and take up extra space
- even worse! they are there adding more work to the GC every time and making the GC run more often looking for space that is not there
Why Ctrl-dot does not always work
Of course, super users know there is this “Ctrl dot” hidden feature that should help you recover from this.
First, let's take out of the equation that this is only known by super users.
Now, in this situation, when Ctrl-dot is hit it will trigger a handler that suspends the problematic process and opens a debugger on it.
But it could happen that,
- the stack is so big that the debugger is very sluggish (best-case scenario)
- the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
- if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
What are we working on
The main idea here is: Can we have a simple and efficient way to prevent such kinds of situations?
After many discussions around detecting recursion, we kinda arrived at the simple solution of just detecting a stack overflow.
The solution is easy to understand (because it’s like other languages work) and easy to implement because there is already support for that.
But this leaves open two questions:
- what happens when people want to use the “infinite stack” feature?
- when should a process stack overflow? What is a sensitive default value?
Our draft implementation here https://github.com/pharo-project/pharo-vm/pull/710 does the following to cope with this:
- we can now parametrize the size of the stack (of each stack page to be more accurate) when the VM starts up
- the stack overflow check can be disabled per process
We also are running experiments to see what could be a sensitive stack size for our normal usages. Here, for example, we ran almost all test cases in Pharo separately (one suite per line below), and we observed how many tests broke (x-axis) with different stack sizes (y-axis).
Here we see that most test suites require at least 20-24k to run properly, some go up to 36k of stack before converging (i.e., the number of broken tests does not change).
<ImagenPegada-10.tiff>
You’ll notice in the graph that There are some scenarios that break all the time. This is because exception handling itself is recursive and may produce more stack overflows depending on the size of the stack between the exception and the exception handler.
So some more work is still required, mostly changing Pharo libraries to properly support this. For example:
- should tests run in a fresh process with a fresh stack?
- should the exception mechanism use less recursion?
- resumable exceptions add stack pressure because they do not “unstack” until the exception is finally handled, meaning that the stack used by exception handling just adds up to the stack of the original code, can we do better here?
Probably there are more interesting questions here, that’s the “why" behind this email.
I’m interested in opinions and scenarios you may come up with that should be taken into account.
Cheers,
Guille
Stéphane Ducasse
http://stephane.ducasse.free.fr / http://www.pharo.org
03 59 35 87 52
Assistant: Aurore Dalle
FAX 03 59 57 78 50
TEL 03 59 35 86 16
S. Ducasse - Inria
40, avenue Halley,
Parc Scientifique de la Haute Borne, Bât.A, Park Plaza
Villeneuve d'Ascq 59650
France
Hi daniel
Thanks for the feedback.
May be you wrote it but I could not really understand.
How dolphin handled
>> ```
>> A >> foo
>> ^ self foo
>> ```
That is let to run a couple of seconds?
Did they kill the process?
In Pharo we do have an interrupt but
>> But it could happen that,
>> - the stack is so big that the debugger is very sluggish (best-case scenario)
>> - the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
>> - if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
S
> On 11 Nov 2023, at 05:10, Daniel Slomovits <daniels220@gmail.com> wrote:
>
> I think this is a great idea. I've mostly used Dolphin Smalltalk, which is actually a strict stack machine under the hood (it has a context-like introspection API but the stack is explicitly the canonical form), so it's more-or-less forced to implement a limit of some kind. When I started using Pharo I triggered a couple stack overflows by mistake, and was frustrated by the fact that at first what happened was...nothing, everything seemed fine, my code just didn't work. And then half a minute later Pharo gets extremely slow and I notice it's using 2GB of memory and by then it's too late and I have to kill the image. Getting a more-or-less immediate error would be much more user-friendly IMO.
>
> A couple things to learn from Dolphin's implementation, I think:
> When a stack overflow is detected, the resulting interrupt raises that process' stack limit by a significant amount (though by less than the original limit—IOW it doesn't double, but it's not just a couple more frames either) before signaling the exception, precisely so that exception handling can occur without triggering another stack-overflow event. A further refinement could be that if a second stack overflow is detected, we directly invoke more basic recovery—this could mean an emergency evaluator, terminating the offending process and opening a post-mortem with a textual stack dump (ugh! but at least it's predictable), etc.
> I don't think we should worry too much about refining what exactly the limit is. 10x as much stack as 99% of code will ever use, is still a tiny amount compared to consuming all available memory with Contexts. At least, if I'm understanding the graph/data correctly. That's 36kB of stack space, right? Not 36k frames/contexts deep? With each context being six slots plus args/temps, 36kB is 500-750 frames on a 64-bit VM (in stack representation—contexts add object-header overhead but we don't reify them unless we have to). For reference, Dolphin's limit is 64kB, but that's a 32-bit VM, so the equivalent for 64-bit would be 128kB...but because Pharo can spill contexts to the stack, the limit could easily be 1MB, or a fixed number of frames designed to approximate that—still a tiny amount of memory overall, and still will be hit near-instantly by true infinite recursion, but lots of breathing room for most use cases.
> Actually, this did get me to thinking...the stack depth of a Pharo process is not necessarily easy/cheap to compute in the general case, without caching a lot of information on intermediate contexts. In most cases the context chain acts as a proper stack, but not always—methods like Process>>on:do:, and some even more esoteric ones I forget off the top of my head, make modifications far away from the top context and may splice context chains together in odd ways. Perhaps a more flexible limit would be better—one that is triggered by allocating more than a certain number of contexts total, and examines running Processes in detail to find the culprit at that point.
>
> On Thu, Nov 9, 2023 at 4:38 AM Guillermo Polito <guillermopolito@gmail.com <mailto:guillermopolito@gmail.com>> wrote:
>> Hi all,
>>
>> We started (with many interruptions over the last months) working a bit with Stephane on understanding what is the (positive and negative) impact of stack-overflow support in Pharo.
>> The key idea is that if a process consumes too much stack (potentially because of an infinite recursion) then the process should stop with an exception.
>>
>> ## Why we want better stack consumption control
>>
>> This idea comes up to solve issues that are pretty common and hit especially newbies.
>> For example, imagine you accidentally write an accessor such as
>>
>> ```
>> A >> foo
>> ^ self foo
>> ```
>>
>> Students do this all the time, and I’ve also seen it in experienced people who go too fast :).
>> More importantly, such recursions could happen also with not-so-obvious indirect recursions (a sends b, b sends c, c sends a), and these could hit anybody.
>>
>> This is aggravated because the current execution model allows us to have infinite stacks —meaning: limited by available memory only.
>> This is indeed a nice feature for many use cases but it has its own drawbacks when one of these kind of recursions are hit:
>> - code just loops forever taking space in the stack
>> - when there is no more stack space, context objects are created and moved to the heap
>> - but those contexts are strongly held, so they are never GCed and take up extra space
>> - even worse! they are there adding more work to the GC every time and making the GC run more often looking for space that is not there
>>
>> ## Why Ctrl-dot does not always work
>>
>> Of course, super users know there is this “Ctrl dot” hidden feature that should help you recover from this.
>> First, let's take out of the equation that this is only known by super users.
>> Now, in this situation, when Ctrl-dot is hit it will trigger a handler that suspends the problematic process and opens a debugger on it.
>> But it could happen that,
>> - the stack is so big that the debugger is very sluggish (best-case scenario)
>> - the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
>> - if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
>>
>> ## What are we working on
>>
>> The main idea here is: Can we have a simple and efficient way to prevent such kinds of situations?
>>
>> After many discussions around detecting recursion, we kinda arrived at the simple solution of just detecting a stack overflow.
>> The solution is easy to understand (because it’s like other languages work) and easy to implement because there is already support for that.
>> But this leaves open two questions:
>> - what happens when people want to use the “infinite stack” feature?
>> - when should a process stack overflow? What is a sensitive default value?
>>
>> Our draft implementation here https://github.com/pharo-project/pharo-vm/pull/710 does the following to cope with this:
>> - we can now parametrize the size of the stack (of each stack page to be more accurate) when the VM starts up
>> - the stack overflow check can be disabled per process
>>
>> We also are running experiments to see what could be a sensitive stack size for our normal usages. Here, for example, we ran almost all test cases in Pharo separately (one suite per line below), and we observed how many tests broke (x-axis) with different stack sizes (y-axis).
>> Here we see that most test suites require at least 20-24k to run properly, some go up to 36k of stack before converging (i.e., the number of broken tests does not change).
>>
>> <ImagenPegada-10.tiff>
>>
>> You’ll notice in the graph that There are some scenarios that break all the time. This is because exception handling itself is recursive and may produce more stack overflows depending on the size of the stack between the exception and the exception handler.
>> So some more work is still required, mostly changing Pharo libraries to properly support this. For example:
>> - should tests run in a fresh process with a fresh stack?
>> - should the exception mechanism use less recursion?
>> - resumable exceptions add stack pressure because they do not “unstack” until the exception is finally handled, meaning that the stack used by exception handling just adds up to the stack of the original code, can we do better here?
>>
>> Probably there are more interesting questions here, that’s the “why" behind this email.
>> I’m interested in opinions and scenarios you may come up with that should be taken into account.
>>
>> Cheers,
>> Guille
> _______________________________________________
> Pharo-vm mailing list -- pharo-vm@lists.pharo.org
> To unsubscribe send an email to pharo-vm-leave@lists.pharo.org
--------------------------------------------
Stéphane Ducasse
http://stephane.ducasse.free.fr / http://www.pharo.org
03 59 35 87 52
Assistant: Aurore Dalle
FAX 03 59 57 78 50
TEL 03 59 35 86 16
S. Ducasse - Inria
40, avenue Halley,
Parc Scientifique de la Haute Borne, Bât.A, Park Plaza
Villeneuve d'Ascq 59650
France
DS
Daniel Slomovits
Tue, Nov 14, 2023 6:19 AM
Hello Stephane,
A simple example like that—really any case caused by common types of
mistake, even if the method involved is more complex—wouldn't get to run
for a couple seconds. It would error out after a small fraction of a
second, since the interrupt is issued immediately upon exceeding the limit
(which by default is—oops, 64k* slots*, 256kB, at most ~10k stack frames,
but realistically less due to args and temps), and it just doesn't take
that long to execute most recursive methods 10k levels deep.
At this point it would raise the limit a little for that process and signal
a StackOverflow exception. In production this likely would mean killing the
process but that's up to application code. In development you get a
walkback like any other exception. Because the stack is kept to a
relatively reasonable size, I've never had debugger performance be an
issue—it's noticeably more sluggish, but we're talking about taking
200-300ms to respond instead of imperceptible. Similarly for GC
pressure—even if you have dozens of processes all maxed out on stack, this
would still be less than the memory used to start a base image.
Certainly there are scenarios that can cause problems—an error while
printing an object will stop you from opening a debugger, yes. However a
StackOverflow in particular isn't any worse than any other exception—in all
cases you'll get a walkback, hit "Debug", and get another walkback instead
of a debugger. If you hit "Terminate" at that point, it might leave the
original process in a zombie state and/or a hidden debugger window, but you
can kill it from the Process Monitor/close it from the Window menu and be
fine. And even this is because Dolphin doesn't safeguard printing in the
development tools the way Pharo does—if Pharo adopted stack overflow
handling like Dolphin's, a stack overflow likely wouldn't even stop you
from opening a debugger, it would just make it a little slow, then
something would show up with "error printing: a StackOverflow" or the like.
Hope that helps.
Daniel
On Sun, Nov 12, 2023 at 4:53 AM stephane ducasse stephane.ducasse@inria.fr
wrote:
Hi daniel
Thanks for the feedback.
May be you wrote it but I could not really understand.
How dolphin handled
A >> foo
^ self foo
That is let to run a couple of seconds?
Did they kill the process?
In Pharo we do have an interrupt but
But it could happen that,
- the stack is so big that the debugger is very sluggish (best-case
scenario)
- the VM is just flooded doing GCs so maybe the Ctrl dot event does not
even arrive at Pharo or the trigger
- if the recursion is hit when printing an object (which is more common
than you could imagine), opening the debugger could trigger a new recursion
and never give back the control to the user
S
On 11 Nov 2023, at 05:10, Daniel Slomovits daniels220@gmail.com wrote:
I think this is a great idea. I've mostly used Dolphin Smalltalk, which is
actually a strict stack machine under the hood (it has a context-like
introspection API but the stack is explicitly the canonical form), so it's
more-or-less forced to implement a limit of some kind. When I started using
Pharo I triggered a couple stack overflows by mistake, and was frustrated
by the fact that at first what happened was...nothing, everything seemed
fine, my code just didn't work. And then half a minute later Pharo gets
extremely slow and I notice it's using 2GB of memory and by then it's too
late and I have to kill the image. Getting a more-or-less immediate error
would be much more user-friendly IMO.
A couple things to learn from Dolphin's implementation, I think:
1. When a stack overflow is detected, the resulting interrupt raises
that process' stack limit by a significant amount (though by less than the
original limit—IOW it doesn't double, but it's not just a couple more
frames either) before signaling the exception, precisely so that exception
handling can occur without triggering another stack-overflow event. A
further refinement could be that if a second stack overflow *is* detected,
we directly invoke more basic recovery—this could mean an emergency
evaluator, terminating the offending process and opening a post-mortem with
a textual stack dump (ugh! but at least it's predictable), etc.
2. I don't think we should worry too much about refining what exactly
the limit is. 10x as much stack as 99% of code will ever use, is still a
tiny amount compared to consuming all available memory with Contexts. At
least, if I'm understanding the graph/data correctly. That's 36kB of stack
space, right? Not 36k frames/contexts deep? With each context being six
slots plus args/temps, 36kB is 500-750 frames on a 64-bit VM (in stack
representation—contexts add object-header overhead but we don't reify them
unless we have to). For reference, Dolphin's limit is 64kB, but that's a
32-bit VM, so the equivalent for 64-bit would be 128kB...but because Pharo
can spill contexts to the stack, the limit could easily be 1MB, or a fixed
number of frames designed to approximate that—still a tiny amount of memory
overall, and still will be hit near-instantly by true infinite recursion,
but lots of breathing room for most use cases.
Actually, this did get me to thinking...the stack depth of a Pharo process
is not necessarily easy/cheap to compute in the general case, without
caching a lot of information on intermediate contexts. In most cases the
context chain acts as a proper stack, but not always—methods like
Process>>on:do:, and some even more esoteric ones I forget off the top of
my head, make modifications far away from the top context and may splice
context chains together in odd ways. Perhaps a more flexible limit would be
better—one that is triggered by allocating more than a certain number of
contexts total, and examines running Processes in detail to find the
culprit at that point.
On Thu, Nov 9, 2023 at 4:38 AM Guillermo Polito guillermopolito@gmail.com
wrote:
Hi all,
We started (with many interruptions over the last months) working a bit
with Stephane on understanding what is the (positive and negative) impact
of stack-overflow support in Pharo.
The key idea is that if a process consumes too much stack (potentially
because of an infinite recursion) then the process should stop with an
exception.
Why we want better stack consumption control
This idea comes up to solve issues that are pretty common and hit
especially newbies.
For example, imagine you accidentally write an accessor such as
A >> foo
^ self foo
Students do this all the time, and I’ve also seen it in experienced
people who go too fast :).
More importantly, such recursions could happen also with not-so-obvious
indirect recursions (a sends b, b sends c, c sends a), and these could hit
anybody.
This is aggravated because the current execution model allows us to have
infinite stacks —meaning: limited by available memory only.
This is indeed a nice feature for many use cases but it has its own
drawbacks when one of these kind of recursions are hit:
- code just loops forever taking space in the stack
- when there is no more stack space, context objects are created and
moved to the heap
- but those contexts are strongly held, so they are never GCed and take
up extra space
- even worse! they are there adding more work to the GC every time and
making the GC run more often looking for space that is not there
Why Ctrl-dot does not always work
Of course, super users know there is this “Ctrl dot” hidden feature that
should help you recover from this.
First, let's take out of the equation that this is only known by super
users.
Now, in this situation, when Ctrl-dot is hit it will trigger a handler
that suspends the problematic process and opens a debugger on it.
But it could happen that,
- the stack is so big that the debugger is very sluggish (best-case
scenario)
- the VM is just flooded doing GCs so maybe the Ctrl dot event does not
even arrive at Pharo or the trigger
- if the recursion is hit when printing an object (which is more common
than you could imagine), opening the debugger could trigger a new recursion
and never give back the control to the user
What are we working on
The main idea here is: Can we have a simple and efficient way to prevent
such kinds of situations?
After many discussions around detecting recursion, we kinda arrived at
the simple solution of just detecting a stack overflow.
The solution is easy to understand (because it’s like other languages
work) and easy to implement because there is already support for that.
But this leaves open two questions:
- what happens when people want to use the “infinite stack” feature?
- when should a process stack overflow? What is a sensitive default
value?
Our draft implementation here
https://github.com/pharo-project/pharo-vm/pull/710 does the following to
cope with this:
- we can now parametrize the size of the stack (of each stack page to be
more accurate) when the VM starts up
- the stack overflow check can be disabled per process
We also are running experiments to see what could be a sensitive stack
size for our normal usages. Here, for example, we ran almost all test cases
in Pharo separately (one suite per line below), and we observed how many
tests broke (x-axis) with different stack sizes (y-axis).
Here we see that most test suites require at least 20-24k to run
properly, some go up to 36k of stack before converging (i.e., the number of
broken tests does not change).
<ImagenPegada-10.tiff>
You’ll notice in the graph that There are some scenarios that break all
the time. This is because exception handling itself is recursive and may
produce more stack overflows depending on the size of the stack between the
exception and the exception handler.
So some more work is still required, mostly changing Pharo libraries to
properly support this. For example:
- should tests run in a fresh process with a fresh stack?
- should the exception mechanism use less recursion?
- resumable exceptions add stack pressure because they do not “unstack”
until the exception is finally handled, meaning that the stack used by
exception handling just adds up to the stack of the original code, can we
do better here?
Probably there are more interesting questions here, that’s the “why"
behind this email.
I’m interested in opinions and scenarios you may come up with that should
be taken into account.
Cheers,
Guille
Hello Stephane,
A simple example like that—really any case caused by common types of
mistake, even if the method involved is more complex—wouldn't get to run
for a couple seconds. It would error out after a small fraction of a
second, since the interrupt is issued immediately upon exceeding the limit
(which by default is—oops, 64k* slots*, 256kB, at most ~10k stack frames,
but realistically less due to args and temps), and it just doesn't take
that long to execute most recursive methods 10k levels deep.
At this point it would raise the limit a little for that process and signal
a StackOverflow exception. In production this likely would mean killing the
process but that's up to application code. In development you get a
walkback like any other exception. Because the stack is kept to a
*relatively* reasonable size, I've never had debugger performance be an
issue—it's noticeably more sluggish, but we're talking about taking
200-300ms to respond instead of imperceptible. Similarly for GC
pressure—even if you have dozens of processes all maxed out on stack, this
would still be less than the memory used to start a base image.
Certainly there are scenarios that can cause problems—an error while
printing an object will stop you from opening a debugger, yes. However a
StackOverflow in particular isn't any worse than any other exception—in all
cases you'll get a walkback, hit "Debug", and get another walkback instead
of a debugger. If you hit "Terminate" at that point, it might leave the
original process in a zombie state and/or a hidden debugger window, but you
can kill it from the Process Monitor/close it from the Window menu and be
fine. And even this is because Dolphin doesn't safeguard printing in the
development tools the way Pharo does—if Pharo adopted stack overflow
handling like Dolphin's, a stack overflow likely wouldn't even stop you
from opening a debugger, it would just make it a *little* slow, then
something would show up with "error printing: a StackOverflow" or the like.
Hope that helps.
Daniel
On Sun, Nov 12, 2023 at 4:53 AM stephane ducasse <stephane.ducasse@inria.fr>
wrote:
> Hi daniel
>
> Thanks for the feedback.
> May be you wrote it but I could not really understand.
>
> How dolphin handled
>
> ```
> A >> foo
> ^ self foo
> ```
>
>
> That is let to run a couple of seconds?
> Did they kill the process?
>
> In Pharo we do have an interrupt but
>
> But it could happen that,
> - the stack is so big that the debugger is very sluggish (best-case
> scenario)
> - the VM is just flooded doing GCs so maybe the Ctrl dot event does not
> even arrive at Pharo or the trigger
> - if the recursion is hit when printing an object (which is more common
> than you could imagine), opening the debugger could trigger a new recursion
> and never give back the control to the user
>
>
> S
>
> On 11 Nov 2023, at 05:10, Daniel Slomovits <daniels220@gmail.com> wrote:
>
> I think this is a great idea. I've mostly used Dolphin Smalltalk, which is
> actually a strict stack machine under the hood (it has a context-like
> introspection API but the stack is explicitly the canonical form), so it's
> more-or-less forced to implement a limit of some kind. When I started using
> Pharo I triggered a couple stack overflows by mistake, and was frustrated
> by the fact that at first what happened was...nothing, everything seemed
> fine, my code just didn't work. And then half a minute later Pharo gets
> extremely slow and I notice it's using 2GB of memory and by then it's too
> late and I have to kill the image. Getting a more-or-less immediate error
> would be much more user-friendly IMO.
>
> A couple things to learn from Dolphin's implementation, I think:
>
> 1. When a stack overflow is detected, the resulting interrupt raises
> that process' stack limit by a significant amount (though by less than the
> original limit—IOW it doesn't double, but it's not just a couple more
> frames either) before signaling the exception, precisely so that exception
> handling can occur without triggering another stack-overflow event. A
> further refinement could be that if a second stack overflow *is* detected,
> we directly invoke more basic recovery—this could mean an emergency
> evaluator, terminating the offending process and opening a post-mortem with
> a textual stack dump (ugh! but at least it's predictable), etc.
> 2. I don't think we should worry too much about refining what exactly
> the limit is. 10x as much stack as 99% of code will ever use, is still a
> tiny amount compared to consuming all available memory with Contexts. At
> least, if I'm understanding the graph/data correctly. That's 36kB of stack
> space, right? Not 36k frames/contexts deep? With each context being six
> slots plus args/temps, 36kB is 500-750 frames on a 64-bit VM (in stack
> representation—contexts add object-header overhead but we don't reify them
> unless we have to). For reference, Dolphin's limit is 64kB, but that's a
> 32-bit VM, so the equivalent for 64-bit would be 128kB...but because Pharo
> can spill contexts to the stack, the limit could easily be 1MB, or a fixed
> number of frames designed to approximate that—still a tiny amount of memory
> overall, and still will be hit near-instantly by true infinite recursion,
> but lots of breathing room for most use cases.
>
> Actually, this did get me to thinking...the stack depth of a Pharo process
> is not necessarily easy/cheap to compute in the general case, without
> caching a lot of information on intermediate contexts. In most cases the
> context chain acts as a proper stack, but *not always*—methods like
> Process>>on:do:, and some even more esoteric ones I forget off the top of
> my head, make modifications far away from the top context and may splice
> context chains together in odd ways. Perhaps a more flexible limit would be
> better—one that is triggered by allocating more than a certain number of
> contexts *total*, and examines running Processes in detail to find the
> culprit at that point.
>
> On Thu, Nov 9, 2023 at 4:38 AM Guillermo Polito <guillermopolito@gmail.com>
> wrote:
>
>> Hi all,
>>
>> We started (with many interruptions over the last months) working a bit
>> with Stephane on understanding what is the (positive and negative) impact
>> of stack-overflow support in Pharo.
>> The key idea is that if a process consumes too much stack (potentially
>> because of an infinite recursion) then the process should stop with an
>> exception.
>>
>> ## Why we want better stack consumption control
>>
>> This idea comes up to solve issues that are pretty common and hit
>> especially newbies.
>> For example, imagine you accidentally write an accessor such as
>>
>> ```
>> A >> foo
>> ^ self foo
>> ```
>>
>> Students do this all the time, and I’ve also seen it in experienced
>> people who go too fast :).
>> More importantly, such recursions could happen also with not-so-obvious
>> indirect recursions (a sends b, b sends c, c sends a), and these could hit
>> anybody.
>>
>> This is aggravated because the current execution model allows us to have
>> infinite stacks —meaning: limited by available memory only.
>> This is indeed a nice feature for many use cases but it has its own
>> drawbacks when one of these kind of recursions are hit:
>> - code just loops forever taking space in the stack
>> - when there is no more stack space, context objects are created and
>> moved to the heap
>> - but those contexts are strongly held, so they are never GCed and take
>> up extra space
>> - even worse! they are there adding more work to the GC every time and
>> making the GC run more often looking for space that is not there
>>
>> ## Why Ctrl-dot does not always work
>>
>> Of course, super users know there is this “Ctrl dot” hidden feature that
>> should help you recover from this.
>> First, let's take out of the equation that this is only known by super
>> users.
>> Now, in this situation, when Ctrl-dot is hit it will trigger a handler
>> that suspends the problematic process and opens a debugger on it.
>> But it could happen that,
>> - the stack is so big that the debugger is very sluggish (best-case
>> scenario)
>> - the VM is just flooded doing GCs so maybe the Ctrl dot event does not
>> even arrive at Pharo or the trigger
>> - if the recursion is hit when printing an object (which is more common
>> than you could imagine), opening the debugger could trigger a new recursion
>> and never give back the control to the user
>>
>> ## What are we working on
>>
>> The main idea here is: Can we have a simple and efficient way to prevent
>> such kinds of situations?
>>
>> After many discussions around detecting recursion, we kinda arrived at
>> the simple solution of just detecting a stack overflow.
>> The solution is easy to understand (because it’s like other languages
>> work) and easy to implement because there is already support for that.
>> But this leaves open two questions:
>> - what happens when people want to use the “infinite stack” feature?
>> - when should a process stack overflow? What is a sensitive default
>> value?
>>
>> Our draft implementation here
>> https://github.com/pharo-project/pharo-vm/pull/710 does the following to
>> cope with this:
>> - we can now parametrize the size of the stack (of each stack page to be
>> more accurate) when the VM starts up
>> - the stack overflow check can be disabled per process
>>
>> We also are running experiments to see what could be a sensitive stack
>> size for our normal usages. Here, for example, we ran almost all test cases
>> in Pharo separately (one suite per line below), and we observed how many
>> tests broke (x-axis) with different stack sizes (y-axis).
>> Here we see that most test suites require at least 20-24k to run
>> properly, some go up to 36k of stack before converging (i.e., the number of
>> broken tests does not change).
>>
>> <ImagenPegada-10.tiff>
>>
>> You’ll notice in the graph that There are some scenarios that break all
>> the time. This is because exception handling itself is recursive and may
>> produce more stack overflows depending on the size of the stack between the
>> exception and the exception handler.
>> So some more work is still required, mostly changing Pharo libraries to
>> properly support this. For example:
>> - should tests run in a fresh process with a fresh stack?
>> - should the exception mechanism use less recursion?
>> - resumable exceptions add stack pressure because they do not “unstack”
>> until the exception is finally handled, meaning that the stack used by
>> exception handling just adds up to the stack of the original code, can we
>> do better here?
>>
>> Probably there are more interesting questions here, that’s the “why"
>> behind this email.
>> I’m interested in opinions and scenarios you may come up with that should
>> be taken into account.
>>
>> Cheers,
>> Guille
>>
> _______________________________________________
> Pharo-vm mailing list -- pharo-vm@lists.pharo.org
> To unsubscribe send an email to pharo-vm-leave@lists.pharo.org
>
>
> --------------------------------------------
> Stéphane Ducasse
> http://stephane.ducasse.free.fr / http://www.pharo.org
> 03 59 35 87 52
> Assistant: Aurore Dalle
> FAX 03 59 57 78 50
> TEL 03 59 35 86 16
> S. Ducasse - Inria
> 40, avenue Halley,
> Parc Scientifique de la Haute Borne, Bât.A, Park Plaza
> Villeneuve d'Ascq 59650
> France
>
>
>
>
SD
stephane ducasse
Wed, Nov 15, 2023 9:48 AM
Thanks.
We will digest and discuss internally.
S
On 14 Nov 2023, at 07:19, Daniel Slomovits daniels220@gmail.com wrote:
Hello Stephane,
A simple example like that—really any case caused by common types of mistake, even if the method involved is more complex—wouldn't get to run for a couple seconds. It would error out after a small fraction of a second, since the interrupt is issued immediately upon exceeding the limit (which by default is—oops, 64k slots, 256kB, at most ~10k stack frames, but realistically less due to args and temps), and it just doesn't take that long to execute most recursive methods 10k levels deep.
At this point it would raise the limit a little for that process and signal a StackOverflow exception. In production this likely would mean killing the process but that's up to application code. In development you get a walkback like any other exception. Because the stack is kept to a relatively reasonable size, I've never had debugger performance be an issue—it's noticeably more sluggish, but we're talking about taking 200-300ms to respond instead of imperceptible. Similarly for GC pressure—even if you have dozens of processes all maxed out on stack, this would still be less than the memory used to start a base image.
Certainly there are scenarios that can cause problems—an error while printing an object will stop you from opening a debugger, yes. However a StackOverflow in particular isn't any worse than any other exception—in all cases you'll get a walkback, hit "Debug", and get another walkback instead of a debugger. If you hit "Terminate" at that point, it might leave the original process in a zombie state and/or a hidden debugger window, but you can kill it from the Process Monitor/close it from the Window menu and be fine. And even this is because Dolphin doesn't safeguard printing in the development tools the way Pharo does—if Pharo adopted stack overflow handling like Dolphin's, a stack overflow likely wouldn't even stop you from opening a debugger, it would just make it a little slow, then something would show up with "error printing: a StackOverflow" or the like.
Hope that helps.
Daniel
On Sun, Nov 12, 2023 at 4:53 AM stephane ducasse <stephane.ducasse@inria.fr mailto:stephane.ducasse@inria.fr> wrote:
Hi daniel
Thanks for the feedback.
May be you wrote it but I could not really understand.
How dolphin handled
That is let to run a couple of seconds?
Did they kill the process?
In Pharo we do have an interrupt but
But it could happen that,
- the stack is so big that the debugger is very sluggish (best-case scenario)
- the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
- if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
On 11 Nov 2023, at 05:10, Daniel Slomovits <daniels220@gmail.com mailto:daniels220@gmail.com> wrote:
I think this is a great idea. I've mostly used Dolphin Smalltalk, which is actually a strict stack machine under the hood (it has a context-like introspection API but the stack is explicitly the canonical form), so it's more-or-less forced to implement a limit of some kind. When I started using Pharo I triggered a couple stack overflows by mistake, and was frustrated by the fact that at first what happened was...nothing, everything seemed fine, my code just didn't work. And then half a minute later Pharo gets extremely slow and I notice it's using 2GB of memory and by then it's too late and I have to kill the image. Getting a more-or-less immediate error would be much more user-friendly IMO.
A couple things to learn from Dolphin's implementation, I think:
When a stack overflow is detected, the resulting interrupt raises that process' stack limit by a significant amount (though by less than the original limit—IOW it doesn't double, but it's not just a couple more frames either) before signaling the exception, precisely so that exception handling can occur without triggering another stack-overflow event. A further refinement could be that if a second stack overflow is detected, we directly invoke more basic recovery—this could mean an emergency evaluator, terminating the offending process and opening a post-mortem with a textual stack dump (ugh! but at least it's predictable), etc.
I don't think we should worry too much about refining what exactly the limit is. 10x as much stack as 99% of code will ever use, is still a tiny amount compared to consuming all available memory with Contexts. At least, if I'm understanding the graph/data correctly. That's 36kB of stack space, right? Not 36k frames/contexts deep? With each context being six slots plus args/temps, 36kB is 500-750 frames on a 64-bit VM (in stack representation—contexts add object-header overhead but we don't reify them unless we have to). For reference, Dolphin's limit is 64kB, but that's a 32-bit VM, so the equivalent for 64-bit would be 128kB...but because Pharo can spill contexts to the stack, the limit could easily be 1MB, or a fixed number of frames designed to approximate that—still a tiny amount of memory overall, and still will be hit near-instantly by true infinite recursion, but lots of breathing room for most use cases.
Actually, this did get me to thinking...the stack depth of a Pharo process is not necessarily easy/cheap to compute in the general case, without caching a lot of information on intermediate contexts. In most cases the context chain acts as a proper stack, but not always—methods like Process>>on:do:, and some even more esoteric ones I forget off the top of my head, make modifications far away from the top context and may splice context chains together in odd ways. Perhaps a more flexible limit would be better—one that is triggered by allocating more than a certain number of contexts total, and examines running Processes in detail to find the culprit at that point.
On Thu, Nov 9, 2023 at 4:38 AM Guillermo Polito <guillermopolito@gmail.com mailto:guillermopolito@gmail.com> wrote:
Hi all,
We started (with many interruptions over the last months) working a bit with Stephane on understanding what is the (positive and negative) impact of stack-overflow support in Pharo.
The key idea is that if a process consumes too much stack (potentially because of an infinite recursion) then the process should stop with an exception.
Why we want better stack consumption control
This idea comes up to solve issues that are pretty common and hit especially newbies.
For example, imagine you accidentally write an accessor such as
A >> foo
^ self foo
Students do this all the time, and I’ve also seen it in experienced people who go too fast :).
More importantly, such recursions could happen also with not-so-obvious indirect recursions (a sends b, b sends c, c sends a), and these could hit anybody.
This is aggravated because the current execution model allows us to have infinite stacks —meaning: limited by available memory only.
This is indeed a nice feature for many use cases but it has its own drawbacks when one of these kind of recursions are hit:
- code just loops forever taking space in the stack
- when there is no more stack space, context objects are created and moved to the heap
- but those contexts are strongly held, so they are never GCed and take up extra space
- even worse! they are there adding more work to the GC every time and making the GC run more often looking for space that is not there
Why Ctrl-dot does not always work
Of course, super users know there is this “Ctrl dot” hidden feature that should help you recover from this.
First, let's take out of the equation that this is only known by super users.
Now, in this situation, when Ctrl-dot is hit it will trigger a handler that suspends the problematic process and opens a debugger on it.
But it could happen that,
- the stack is so big that the debugger is very sluggish (best-case scenario)
- the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
- if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
What are we working on
The main idea here is: Can we have a simple and efficient way to prevent such kinds of situations?
After many discussions around detecting recursion, we kinda arrived at the simple solution of just detecting a stack overflow.
The solution is easy to understand (because it’s like other languages work) and easy to implement because there is already support for that.
But this leaves open two questions:
- what happens when people want to use the “infinite stack” feature?
- when should a process stack overflow? What is a sensitive default value?
Our draft implementation here https://github.com/pharo-project/pharo-vm/pull/710 does the following to cope with this:
- we can now parametrize the size of the stack (of each stack page to be more accurate) when the VM starts up
- the stack overflow check can be disabled per process
We also are running experiments to see what could be a sensitive stack size for our normal usages. Here, for example, we ran almost all test cases in Pharo separately (one suite per line below), and we observed how many tests broke (x-axis) with different stack sizes (y-axis).
Here we see that most test suites require at least 20-24k to run properly, some go up to 36k of stack before converging (i.e., the number of broken tests does not change).
<ImagenPegada-10.tiff>
You’ll notice in the graph that There are some scenarios that break all the time. This is because exception handling itself is recursive and may produce more stack overflows depending on the size of the stack between the exception and the exception handler.
So some more work is still required, mostly changing Pharo libraries to properly support this. For example:
- should tests run in a fresh process with a fresh stack?
- should the exception mechanism use less recursion?
- resumable exceptions add stack pressure because they do not “unstack” until the exception is finally handled, meaning that the stack used by exception handling just adds up to the stack of the original code, can we do better here?
Probably there are more interesting questions here, that’s the “why" behind this email.
I’m interested in opinions and scenarios you may come up with that should be taken into account.
Cheers,
Guille
Stéphane Ducasse
http://stephane.ducasse.free.fr / http://www.pharo.org
03 59 35 87 52
Assistant: Aurore Dalle
FAX 03 59 57 78 50
TEL 03 59 35 86 16
S. Ducasse - Inria
40, avenue Halley,
Parc Scientifique de la Haute Borne, Bât.A, Park Plaza
Villeneuve d'Ascq 59650
France
Thanks.
We will digest and discuss internally.
S
> On 14 Nov 2023, at 07:19, Daniel Slomovits <daniels220@gmail.com> wrote:
>
> Hello Stephane,
>
> A simple example like that—really any case caused by common types of mistake, even if the method involved is more complex—wouldn't get to run for a couple seconds. It would error out after a small fraction of a second, since the interrupt is issued immediately upon exceeding the limit (which by default is—oops, 64k slots, 256kB, at most ~10k stack frames, but realistically less due to args and temps), and it just doesn't take that long to execute most recursive methods 10k levels deep.
>
> At this point it would raise the limit a little for that process and signal a StackOverflow exception. In production this likely would mean killing the process but that's up to application code. In development you get a walkback like any other exception. Because the stack is kept to a relatively reasonable size, I've never had debugger performance be an issue—it's noticeably more sluggish, but we're talking about taking 200-300ms to respond instead of imperceptible. Similarly for GC pressure—even if you have dozens of processes all maxed out on stack, this would still be less than the memory used to start a base image.
>
> Certainly there are scenarios that can cause problems—an error while printing an object will stop you from opening a debugger, yes. However a StackOverflow in particular isn't any worse than any other exception—in all cases you'll get a walkback, hit "Debug", and get another walkback instead of a debugger. If you hit "Terminate" at that point, it might leave the original process in a zombie state and/or a hidden debugger window, but you can kill it from the Process Monitor/close it from the Window menu and be fine. And even this is because Dolphin doesn't safeguard printing in the development tools the way Pharo does—if Pharo adopted stack overflow handling like Dolphin's, a stack overflow likely wouldn't even stop you from opening a debugger, it would just make it a little slow, then something would show up with "error printing: a StackOverflow" or the like.
>
> Hope that helps.
>
> Daniel
>
> On Sun, Nov 12, 2023 at 4:53 AM stephane ducasse <stephane.ducasse@inria.fr <mailto:stephane.ducasse@inria.fr>> wrote:
>> Hi daniel
>>
>> Thanks for the feedback.
>> May be you wrote it but I could not really understand.
>>
>> How dolphin handled
>>
>>>> ```
>>>> A >> foo
>>>> ^ self foo
>>>> ```
>>
>> That is let to run a couple of seconds?
>> Did they kill the process?
>>
>> In Pharo we do have an interrupt but
>>
>>>> But it could happen that,
>>>> - the stack is so big that the debugger is very sluggish (best-case scenario)
>>>> - the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
>>>> - if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
>>
>>
>> S
>>
>>> On 11 Nov 2023, at 05:10, Daniel Slomovits <daniels220@gmail.com <mailto:daniels220@gmail.com>> wrote:
>>>
>>> I think this is a great idea. I've mostly used Dolphin Smalltalk, which is actually a strict stack machine under the hood (it has a context-like introspection API but the stack is explicitly the canonical form), so it's more-or-less forced to implement a limit of some kind. When I started using Pharo I triggered a couple stack overflows by mistake, and was frustrated by the fact that at first what happened was...nothing, everything seemed fine, my code just didn't work. And then half a minute later Pharo gets extremely slow and I notice it's using 2GB of memory and by then it's too late and I have to kill the image. Getting a more-or-less immediate error would be much more user-friendly IMO.
>>>
>>> A couple things to learn from Dolphin's implementation, I think:
>>> When a stack overflow is detected, the resulting interrupt raises that process' stack limit by a significant amount (though by less than the original limit—IOW it doesn't double, but it's not just a couple more frames either) before signaling the exception, precisely so that exception handling can occur without triggering another stack-overflow event. A further refinement could be that if a second stack overflow is detected, we directly invoke more basic recovery—this could mean an emergency evaluator, terminating the offending process and opening a post-mortem with a textual stack dump (ugh! but at least it's predictable), etc.
>>> I don't think we should worry too much about refining what exactly the limit is. 10x as much stack as 99% of code will ever use, is still a tiny amount compared to consuming all available memory with Contexts. At least, if I'm understanding the graph/data correctly. That's 36kB of stack space, right? Not 36k frames/contexts deep? With each context being six slots plus args/temps, 36kB is 500-750 frames on a 64-bit VM (in stack representation—contexts add object-header overhead but we don't reify them unless we have to). For reference, Dolphin's limit is 64kB, but that's a 32-bit VM, so the equivalent for 64-bit would be 128kB...but because Pharo can spill contexts to the stack, the limit could easily be 1MB, or a fixed number of frames designed to approximate that—still a tiny amount of memory overall, and still will be hit near-instantly by true infinite recursion, but lots of breathing room for most use cases.
>>> Actually, this did get me to thinking...the stack depth of a Pharo process is not necessarily easy/cheap to compute in the general case, without caching a lot of information on intermediate contexts. In most cases the context chain acts as a proper stack, but not always—methods like Process>>on:do:, and some even more esoteric ones I forget off the top of my head, make modifications far away from the top context and may splice context chains together in odd ways. Perhaps a more flexible limit would be better—one that is triggered by allocating more than a certain number of contexts total, and examines running Processes in detail to find the culprit at that point.
>>>
>>> On Thu, Nov 9, 2023 at 4:38 AM Guillermo Polito <guillermopolito@gmail.com <mailto:guillermopolito@gmail.com>> wrote:
>>>> Hi all,
>>>>
>>>> We started (with many interruptions over the last months) working a bit with Stephane on understanding what is the (positive and negative) impact of stack-overflow support in Pharo.
>>>> The key idea is that if a process consumes too much stack (potentially because of an infinite recursion) then the process should stop with an exception.
>>>>
>>>> ## Why we want better stack consumption control
>>>>
>>>> This idea comes up to solve issues that are pretty common and hit especially newbies.
>>>> For example, imagine you accidentally write an accessor such as
>>>>
>>>> ```
>>>> A >> foo
>>>> ^ self foo
>>>> ```
>>>>
>>>> Students do this all the time, and I’ve also seen it in experienced people who go too fast :).
>>>> More importantly, such recursions could happen also with not-so-obvious indirect recursions (a sends b, b sends c, c sends a), and these could hit anybody.
>>>>
>>>> This is aggravated because the current execution model allows us to have infinite stacks —meaning: limited by available memory only.
>>>> This is indeed a nice feature for many use cases but it has its own drawbacks when one of these kind of recursions are hit:
>>>> - code just loops forever taking space in the stack
>>>> - when there is no more stack space, context objects are created and moved to the heap
>>>> - but those contexts are strongly held, so they are never GCed and take up extra space
>>>> - even worse! they are there adding more work to the GC every time and making the GC run more often looking for space that is not there
>>>>
>>>> ## Why Ctrl-dot does not always work
>>>>
>>>> Of course, super users know there is this “Ctrl dot” hidden feature that should help you recover from this.
>>>> First, let's take out of the equation that this is only known by super users.
>>>> Now, in this situation, when Ctrl-dot is hit it will trigger a handler that suspends the problematic process and opens a debugger on it.
>>>> But it could happen that,
>>>> - the stack is so big that the debugger is very sluggish (best-case scenario)
>>>> - the VM is just flooded doing GCs so maybe the Ctrl dot event does not even arrive at Pharo or the trigger
>>>> - if the recursion is hit when printing an object (which is more common than you could imagine), opening the debugger could trigger a new recursion and never give back the control to the user
>>>>
>>>> ## What are we working on
>>>>
>>>> The main idea here is: Can we have a simple and efficient way to prevent such kinds of situations?
>>>>
>>>> After many discussions around detecting recursion, we kinda arrived at the simple solution of just detecting a stack overflow.
>>>> The solution is easy to understand (because it’s like other languages work) and easy to implement because there is already support for that.
>>>> But this leaves open two questions:
>>>> - what happens when people want to use the “infinite stack” feature?
>>>> - when should a process stack overflow? What is a sensitive default value?
>>>>
>>>> Our draft implementation here https://github.com/pharo-project/pharo-vm/pull/710 does the following to cope with this:
>>>> - we can now parametrize the size of the stack (of each stack page to be more accurate) when the VM starts up
>>>> - the stack overflow check can be disabled per process
>>>>
>>>> We also are running experiments to see what could be a sensitive stack size for our normal usages. Here, for example, we ran almost all test cases in Pharo separately (one suite per line below), and we observed how many tests broke (x-axis) with different stack sizes (y-axis).
>>>> Here we see that most test suites require at least 20-24k to run properly, some go up to 36k of stack before converging (i.e., the number of broken tests does not change).
>>>>
>>>> <ImagenPegada-10.tiff>
>>>>
>>>> You’ll notice in the graph that There are some scenarios that break all the time. This is because exception handling itself is recursive and may produce more stack overflows depending on the size of the stack between the exception and the exception handler.
>>>> So some more work is still required, mostly changing Pharo libraries to properly support this. For example:
>>>> - should tests run in a fresh process with a fresh stack?
>>>> - should the exception mechanism use less recursion?
>>>> - resumable exceptions add stack pressure because they do not “unstack” until the exception is finally handled, meaning that the stack used by exception handling just adds up to the stack of the original code, can we do better here?
>>>>
>>>> Probably there are more interesting questions here, that’s the “why" behind this email.
>>>> I’m interested in opinions and scenarios you may come up with that should be taken into account.
>>>>
>>>> Cheers,
>>>> Guille
>>> _______________________________________________
>>> Pharo-vm mailing list -- pharo-vm@lists.pharo.org <mailto:pharo-vm@lists.pharo.org>
>>> To unsubscribe send an email to pharo-vm-leave@lists.pharo.org <mailto:pharo-vm-leave@lists.pharo.org>
>>
>> --------------------------------------------
>> Stéphane Ducasse
>> http://stephane.ducasse.free.fr <http://stephane.ducasse.free.fr/> / http://www.pharo.org <http://www.pharo.org/>
>> 03 59 35 87 52
>> Assistant: Aurore Dalle
>> FAX 03 59 57 78 50
>> TEL 03 59 35 86 16
>> S. Ducasse - Inria
>> 40, avenue Halley,
>> Parc Scientifique de la Haute Borne, Bât.A, Park Plaza
>> Villeneuve d'Ascq 59650
>> France
>>
>>
>>
> _______________________________________________
> Pharo-vm mailing list -- pharo-vm@lists.pharo.org
> To unsubscribe send an email to pharo-vm-leave@lists.pharo.org
--------------------------------------------
Stéphane Ducasse
http://stephane.ducasse.free.fr / http://www.pharo.org
03 59 35 87 52
Assistant: Aurore Dalle
FAX 03 59 57 78 50
TEL 03 59 35 86 16
S. Ducasse - Inria
40, avenue Halley,
Parc Scientifique de la Haute Borne, Bât.A, Park Plaza
Villeneuve d'Ascq 59650
France