On 4/25/2017 9:19 AM, Phil Martel wrote:
>>> So, lets say you're in the middle of calculating
>>> factorial(1,000,000,000,000)
>>> with algorithm 2.  Then you find out about algorithm 1 (or maybe
>>> decide that
>>> Stirling's approximation is close enough).
>>
>> You (as an executing client who has called upon the "factorial service"
>> to perform that calculation) don't "find out about" anything!  To *you*,
>> nothing appears remiss.  That's the whole point; as long as the API
>> hasn't changed, you shouldn't care that the service has been replaced
>> with an equivalent service.
>>
>> How the *system* ensures that illusion is maintained is the problem
>> being addressed.
>>
>> The remedy that "makes most sense" will vary with the design (and
>> functionality) of the service being upgraded.  And, the approach the
>> maintainer chooses to address those "rolling updates"
>
> I used the word "you" to mean the system providing the service (including the
> programmer who implemented the new algorithm).  However, a factorial
> calculation is a poor example in that it is not persistent.

There are no conditions placed on what a service can provide.  E.g.,
my calculator is a service; in your world, it might be a library.

I use factorial as an example of a "job" that can take some "macroscopic"
amount of time -- rather than arguing about whether 10 microseconds or
200 hours is "too long" for a service to "linger" in the face of a
pending/desired upgrade.

> It may make sense
> for you (the system) to dump the work you've done and start over, or to
> continue with the old algorithm for this instance.

There are *many* possible courses of action that the developer could
apply to providing a rolling upgrade of his service.  The system can't
impose *one* -- without sharply constraining the types of services that
can be implemented as well as the "time" each takes to operate.

If a service has side-effects, then you can't (typically) start over
as you would have to consider which side effects had already taken
place.

Etc.

>> As it would be heavy-handed for the system to dictate how EVERY service
>> is coded AND the constraints placed upon their algorithms, the system
>> can only offer (prefabricated) *mechanisms* that the service designer
>> (and maintainer) can exploit to facilitate the upgrade.
>>
>> And, the system has to rely on the designer/maintainer to make best use
>> of the mechanisms that it provides -- because the designer/maintainer
>> has more intimate knowledge of the way the service is intended to work.
>>
>> A "lazy" designer may choose not to address live upgrade issues.
>> In which case, the system will resort to draconian measures when
>> an upgrade is installed:  it will KILL the running service and
>> let the clients deal with the resulting mess.  *Users* will then
>> either avoid products from that provider *or* will avoid upgrading
>> (if the consequences are too painful -- where "too" is a subjective
>> criteria defined by the user in question).
>>
>>> What *can* you do with the
>>> unfinished solution other than dump the work and restart the problem
>>> with the
>>> new algorithm or let it finish? (and next time use the new algorithm)?
>>
>> I prepare a document using /WordProcessor25/.  The document can be seen
>> as a snapshot of the "conceptual document" that I seek to prepare.
>> I upgrade to /WordProcessor29/.  Is all of the work that I did prior
>> to that upgrade lost?  (Why not?  :>  )
>
> I think the example you're trying for is that you run a word processor service
> and that I'm a client.

Yes.

> I'm typing into a document using /WordProcessor25/
> (which I think of as /WordProcessor/).  You want to upgrade to
> /WordProcessor29/ while I'm typing right here.

I want to upgrade.  *When* is a separate issue.

The point I am making is that you *can* upgrade /WordProcessor/ and
not lose the "work in progress" -- even if it has been sitting on an
offline floppy for 2 years -- BECAUSE THE NEW WORDPROCESSOR KNOWS
HOW TO UNDERSTAND THE STATE STORED BY THE OLDER VERSION.

(i.e., yet another "upgrade strategy")

> In this case, perhaps in most cases, there's some point where the system can
> save its state as a checkpoint, start the new software and continue.  If the
> system can do the change between user inputs, the change will be transparent.
> The case where the inputs come too fast is where it gets tricky and you may
> have to keep a copy of the old code running.

Again, the system can't know how the service behaves or how its clients
expect to use it.  So, you can't *impose* an upgrade strategy on the service.
Instead, you provide mechanisms that allow many approaches to be used and
count on the designer/maintainer to use their specific knowledge of the
service (THEIR service!) to decide which of them to exploit.

Part of installing the upgrade "software" is the specification of the
upgrade strategy mechanism to be used -- along with any ancillary
requirements FOR THE UPGRADE.  Make it easier for developers to
do what they "should" instead of forcing them to do ALL of the
heavy lifting (which would tend to result in more "please reboot, now"
scenarios)

What I've been doing (since Friday evening) is codifying the different
strategies and drafting guidelines for when each is "preferred" along
with when each is contraindicated.  Then, I'll sort out the consequences
of a developer incorrectly specifying the "wrong" strategy/mechanism;
or, incorrectly implementing it in their upgrade:  how might this
screw up other aspects of the system and what can I do to guard against
that.

On 4/25/2017 0:37, Don Y wrote:
> On 4/24/2017 7:02 PM, Phil Martel wrote:
>>>> Provided you translate and replace the existing state vector also.
>>>
>>> That may not be practical.
>>>
>>> factorial(n: int) : int
>>>    ASSERT( n >= 1 )
>>>    result := 1
>>>    while (n > 1) {
>>>       result *= n
>>>       n--
>>>    }
>>>    return result
>>> }
>>>
>>> factorial(n: int) : int
>>>    ASSERT( n >= 1 )
>>>    if (n == 1)
>>>       return 1
>>>    return N * factorial(n-1)
>>> )
>>>
>>> have vastly different state vectors (assuming I haven't botched the
>>> implementations).
>>>
>>> So, just assuming you can <somehow> map one state vector into another
>>> won't give you a "fix".
>> So, lets say you're in the middle of calculating
>> factorial(1,000,000,000,000)
>> with algorithm 2.  Then you find out about algorithm 1 (or maybe
>> decide that
>> Stirling's approximation is close enough).
>
> You (as an executing client who has called upon the "factorial service"
> to perform that calculation) don't "find out about" anything!  To *you*,
> nothing appears remiss.  That's the whole point; as long as the API
> hasn't changed, you shouldn't care that the service has been replaced
> with an equivalent service.
>
> How the *system* ensures that illusion is maintained is the problem
> being addressed.
>
> The remedy that "makes most sense" will vary with the design (and
> functionality) of the service being upgraded.  And, the approach the
> maintainer chooses to address those "rolling updates"
>

I used the word "you" to mean the system providing the service 
(including the programmer who implemented the new algorithm).  However, 
a factorial calculation is a poor example in that it is not persistent. 
It may make sense for you (the system) to dump the work you've done and 
start over, or to continue with the old algorithm for this instance.
> As it would be heavy-handed for the system to dictate how EVERY service
> is coded AND the constraints placed upon their algorithms, the system
> can only offer (prefabricated) *mechanisms* that the service designer
> (and maintainer) can exploit to facilitate the upgrade.
>
> And, the system has to rely on the designer/maintainer to make best use
> of the mechanisms that it provides -- because the designer/maintainer
> has more intimate knowledge of the way the service is intended to work.
>
> A "lazy" designer may choose not to address live upgrade issues.
> In which case, the system will resort to draconian measures when
> an upgrade is installed:  it will KILL the running service and
> let the clients deal with the resulting mess.  *Users* will then
> either avoid products from that provider *or* will avoid upgrading
> (if the consequences are too painful -- where "too" is a subjective
> criteria defined by the user in question).
>
>> What *can* you do with the
>> unfinished solution other than dump the work and restart the problem
>> with the
>> new algorithm or let it finish? (and next time use the new algorithm)?
>
> I prepare a document using /WordProcessor25/.  The document can be seen
> as a snapshot of the "conceptual document" that I seek to prepare.
> I upgrade to /WordProcessor29/.  Is all of the work that I did prior
> to that upgrade lost?  (Why not?  :>  )

I think the example you're trying for is that you run a word processor 
service and that I'm a client.  I'm typing into a document using 
/WordProcessor25/ (which I think of as /WordProcessor/).  You want to 
upgrade to /WordProcessor29/ while I'm typing right here.
                                                       ^
In this case, perhaps in most cases, there's some point where the system 
can save its state as a checkpoint, start the new software and continue. 
  If the system can do the change between user inputs, the change will 
be transparent.  The case where the inputs come too fast is where it 
gets tricky and you may have to keep a copy of the old code running.

Best wishes,
--Phil
pomartel At Comcast(ignore_this) dot net

On 4/24/2017 7:02 PM, Phil Martel wrote:
>>> Provided you translate and replace the existing state vector also.
>>
>> That may not be practical.
>>
>> factorial(n: int) : int
>>    ASSERT( n >= 1 )
>>    result := 1
>>    while (n > 1) {
>>       result *= n
>>       n--
>>    }
>>    return result
>> }
>>
>> factorial(n: int) : int
>>    ASSERT( n >= 1 )
>>    if (n == 1)
>>       return 1
>>    return N * factorial(n-1)
>> )
>>
>> have vastly different state vectors (assuming I haven't botched the
>> implementations).
>>
>> So, just assuming you can <somehow> map one state vector into another
>> won't give you a "fix".
> So, lets say you're in the middle of calculating factorial(1,000,000,000,000)
> with algorithm 2.  Then you find out about algorithm 1 (or maybe decide that
> Stirling's approximation is close enough).

You (as an executing client who has called upon the "factorial service"
to perform that calculation) don't "find out about" anything!  To *you*,
nothing appears remiss.  That's the whole point; as long as the API
hasn't changed, you shouldn't care that the service has been replaced
with an equivalent service.

How the *system* ensures that illusion is maintained is the problem
being addressed.

The remedy that "makes most sense" will vary with the design (and
functionality) of the service being upgraded.  And, the approach the
maintainer chooses to address those "rolling updates"

As it would be heavy-handed for the system to dictate how EVERY service
is coded AND the constraints placed upon their algorithms, the system
can only offer (prefabricated) *mechanisms* that the service designer
(and maintainer) can exploit to facilitate the upgrade.

And, the system has to rely on the designer/maintainer to make best use
of the mechanisms that it provides -- because the designer/maintainer
has more intimate knowledge of the way the service is intended to work.

A "lazy" designer may choose not to address live upgrade issues.
In which case, the system will resort to draconian measures when
an upgrade is installed:  it will KILL the running service and
let the clients deal with the resulting mess.  *Users* will then
either avoid products from that provider *or* will avoid upgrading
(if the consequences are too painful -- where "too" is a subjective
criteria defined by the user in question).

> What *can* you do with the
> unfinished solution other than dump the work and restart the problem with the
> new algorithm or let it finish? (and next time use the new algorithm)?

I prepare a document using /WordProcessor25/.  The document can be seen
as a snapshot of the "conceptual document" that I seek to prepare.
I upgrade to /WordProcessor29/.  Is all of the work that I did prior
to that upgrade lost?  (Why not?  :>  )

On 4/24/2017 15:33, Don Y wrote:
> On 4/24/2017 7:48 AM, Phil Martel wrote:
>>>>>> I still do not see what your actual problem is.
>>>>>
>>>>> Find a piece of software that is currently executing:  your
>>>>> microwave oven controller, your PC (consider it a *collection*
>>>>> of software), your calculator, your ....
>>>>>
>>>>> Now, WHILE it is "solving some particular problem for which it
>>>>> was designed", pause the clock and replace all the INSTRUCTIONS
>>>>> in the program(s) with a new, revised program (it does <whatever>
>>>>> only "better" (the 8 digit calculator now handles 12 digits; the
>>>>> microwave oven now has 6 other types of cycles; the PC is now
>>>>> running Windows 11 instead of DOS 3.3; etc.)
>>>>>
>>>>> Let the clock resume.  None of the actions that were running
>>>>> at the time the clock was PAUSED should have been affected by
>>>>> the upgrade.  I.e., if the calculator was in the middle of
>>>>> computing "14!", it should continue to completion -- from
>>>>> wherever it happened to have been, at the time -- yielding
>>>>> the correct result.
>>>>>
>>>>> Note, however, that the result should now be displayed as
>>>>> 8.71782912*10^10 or 87178291200 and NOT as 8.7178291*10^10
>>>>> to reflect the extra precision that it has internally
>>>>> as well as the extended "display"/reporting capability
>>>>> (assuming, of course, that the original executable was
>>>>> interrupted before any loss of precision).
>>>> Since this group is for embedded processing, it is fair to ask why the
>>>> original
>>>> calculator would have a display with  more that 8 significant figures.
>>>
>>> Why does the calculator *function* have to be implemented in
>>> a calculator *package*?  Do you not use <math.h> in your
>>> embedded applications?
>>
>> Obviously it doesn't have to be, but it may be. Perhaps "calculator"
>> is a poor
>> example of what you're trying to explain
>
> I'm trying to pick examples of "programs" that people can easily
> understand.  A calculator evaluating a transcendental function
> (i.e., something with some "meat" in it) could approach the
> problem in different ways (Taylor series, CORDIC, etc.) in
> different "revisions"/versions.
>
> So, (*ignoring* the desire to upgrade due to a *flaw* in the
> implementation,) it is conceivable that you would want to
> upgrade the algorithm to adopt an approach that converges
> more quickly.
>
> And, because the algorithm would be iterative, it is likely
> that it could be "in progress" when you choose to upgrade the
> software (e.g., an 80b floating-point "FMUL" can be a single
> instruction but FTANH probably isn't!).
>
> Finally, the approaches can vary significantly in terms of
> their resource requirements (e.g., temporary storage) making
> a direct mapping of one to the other virtually impossible.
>
>>>>> Put something in your microwave oven.  Set the timer to X.
>>>>> After an arbitrary amount of time, pause the process (processor)
>>>>> and replace the ROMs.  Resume the process.  EXPECT the entire
>>>>> process -- start to finish -- to proceed exactly as it would
>>>>> have had you not replaced the ROMs!
>>>> This assumes that you can replace the ROMs by some hot-swap process
>>>> that does
>>>> not kill power to the RAM/registers that hold the state and quickly
>>>> enough that
>>>> the food will not cool substantially.
>>>
>>> Again, imagination suggests you could implement the ROMs (i.e., the
>>> program TEXT) in other media that *can* be (effectively) replaced
>>> "between
>>> one clock cycle and the next".  This is all old technology.  The problem
>>> lies in doing so while some consumer (client) might be ACTIVELY
>>> executing
>>> within that block of program TEXT.
>>
>> I'm not familiar with *how* these systems do what they do.  Keeping
>> the old
>> copy running while clients are in the middle of a transaction and perhaps
>> warning them to finish up is an option.
>
> That assumes they *will* "finish up" (consider a "black box" service that
> is always receiving "log" information) and in the time frame that *you*
> consider appropriate.  If you're shutting down a node in a cluster for
> periodic maintenance, you can probably afford to wait seconds/minutes
> for everything to come to an orderly state.  But, you can't make that
> generalization about all clients and dependencies (recall, many clients
> are, typically, *agents* -- "serving" clients of their own!)
>
> You can always ensure no *new* clients avail themselves of the "old"
> instance of the service thereby (hopefully) expediting its "release".
>
>>>> Also, the old program state must be
>>>> coded so that the new ROMs read and operate on it properly.
>>>
>>> No, that isn't necessary.  In fact, different algorithms may use
>>> inconsistent state vectors so that mapping from one algorithm to
>>> another is not possible.  That doesn't preclude "interrupting"
>>> existing processing, replacing the TEXT and finishing the
>>> processing with the "new" algorithm.
>>
>> Provided you translate and replace the existing state vector also.
>
> That may not be practical.
>
> factorial(n: int) : int
>    ASSERT( n >= 1 )
>    result := 1
>    while (n > 1) {
>       result *= n
>       n--
>    }
>    return result
> }
>
> factorial(n: int) : int
>    ASSERT( n >= 1 )
>    if (n == 1)
>       return 1
>    return N * factorial(n-1)
> )
>
> have vastly different state vectors (assuming I haven't botched the
> implementations).
>
> So, just assuming you can <somehow> map one state vector into another
> won't give you a "fix".
So, lets say you're in the middle of calculating 
factorial(1,000,000,000,000) with algorithm 2.  Then you find out about 
algorithm 1 (or maybe decide that Stirling's approximation is close 
enough).  What *can* you do with the unfinished solution other than dump 
the work and restart the problem with the new algorithm or let it 
finish? (and next time use the new algorithm)?
>
>>>> It sounds like a lot of work.
>>>
>>> That's why things like Windows want you to reboot so often!  :>
>>>
>>> OTOH, web sites and enterprise systems regularly roll out
>>> updates WHILE still providing services -- because the cost
>>> of shutting the systems/services down for that update can
>>> be substantial ("We're sorry, but the on-line banking transaction
>>> that you are engaged in AT THIS MOMENT will be aborted.  Please
>>> try again later.")
>>
>> I'm not familiar with *how* these systems do what they do.  Keeping
>> the old
>> copy running while clients are in the middle of a transaction and perhaps
>> warning them to finish up is an option.
>
> I think most of these types of services are short-lived and/or
> transactional.  And, for services with human interaction, you can
> always hope the human "client" is "understanding"/patient (which
> is possible IF these types of inconveniences aren't frequent)
>
>>> (Would you want to have to *stop* your car to have the code in the
>>> ABS system updated -- given that stopping the car might not be
>>> possible, reliably, given the current state of the ABS code?  :> )
>>
>> Would you want to rely on the company that wrote the bad ABS code to
>> fix it and
>> do so while your car was moving?  I suspect that the "fix it live"
>> problem is
>> tougher that the "ABS" problem.
>
> *Undoubtedly* tougher!  OTOH, if there was sufficient risk (death or
> injury)
> to applying the brakes *prior to* installing the upgrade, I'd much prefer
> <someone> invest in *that* solution!  You can't tell the Apollo 13 crew
> that you'll fix their problem -- AFTER they return home...  :>
>
>> FILAAS (Fix it live as a service) might be possible if the processor
>> and system
>> the ABS was running on was standard, but what about your cardiac
>> pacemaker?  Is
>> that running on the same processor?
>
> Pacemaker is a perfect example of upgrade /in situ/.  Of course, the
> chances
> of the pacemaker needing to perform its function during the upgrade AND
> being
> unable to do so AND the patient dying while the doctor is standing
> nearby is
> probably pretty slim.  And, the pace maker designer undoubtedly considered
> this capability in their design of the product.
>
> We worked out a bunch of different approaches to the problem Friday night.
> Unfortunately, no *one* is a panacea.  So, I'm working through the costs
> (and consequences) of each approach.  I've got an off-site/retreat coming
> up RSN so I hope to bring my problem to the table, there.
>
> As I can't rely on others (writing code to run in my system) to design
> components with this capability in mind, I need a fall-back strategy that
> will allow me to upgrade *those* components in the least painful way
> possible
> (if those folks' products end up "looking bad" as a result, its their
> "image"
> to attend to).


-- 
Best wishes,
--Phil
pomartel At Comcast(ignore_this) dot net

On 4/24/2017 7:48 AM, Phil Martel wrote:
>>>>> I still do not see what your actual problem is.
>>>>
>>>> Find a piece of software that is currently executing:  your
>>>> microwave oven controller, your PC (consider it a *collection*
>>>> of software), your calculator, your ....
>>>>
>>>> Now, WHILE it is "solving some particular problem for which it
>>>> was designed", pause the clock and replace all the INSTRUCTIONS
>>>> in the program(s) with a new, revised program (it does <whatever>
>>>> only "better" (the 8 digit calculator now handles 12 digits; the
>>>> microwave oven now has 6 other types of cycles; the PC is now
>>>> running Windows 11 instead of DOS 3.3; etc.)
>>>>
>>>> Let the clock resume.  None of the actions that were running
>>>> at the time the clock was PAUSED should have been affected by
>>>> the upgrade.  I.e., if the calculator was in the middle of
>>>> computing "14!", it should continue to completion -- from
>>>> wherever it happened to have been, at the time -- yielding
>>>> the correct result.
>>>>
>>>> Note, however, that the result should now be displayed as
>>>> 8.71782912*10^10 or 87178291200 and NOT as 8.7178291*10^10
>>>> to reflect the extra precision that it has internally
>>>> as well as the extended "display"/reporting capability
>>>> (assuming, of course, that the original executable was
>>>> interrupted before any loss of precision).
>>> Since this group is for embedded processing, it is fair to ask why the
>>> original
>>> calculator would have a display with  more that 8 significant figures.
>>
>> Why does the calculator *function* have to be implemented in
>> a calculator *package*?  Do you not use <math.h> in your
>> embedded applications?
>
> Obviously it doesn't have to be, but it may be. Perhaps "calculator" is a poor
> example of what you're trying to explain

I'm trying to pick examples of "programs" that people can easily
understand.  A calculator evaluating a transcendental function
(i.e., something with some "meat" in it) could approach the
problem in different ways (Taylor series, CORDIC, etc.) in
different "revisions"/versions.

So, (*ignoring* the desire to upgrade due to a *flaw* in the
implementation,) it is conceivable that you would want to
upgrade the algorithm to adopt an approach that converges
more quickly.

And, because the algorithm would be iterative, it is likely
that it could be "in progress" when you choose to upgrade the
software (e.g., an 80b floating-point "FMUL" can be a single
instruction but FTANH probably isn't!).

Finally, the approaches can vary significantly in terms of
their resource requirements (e.g., temporary storage) making
a direct mapping of one to the other virtually impossible.

>>>> Put something in your microwave oven.  Set the timer to X.
>>>> After an arbitrary amount of time, pause the process (processor)
>>>> and replace the ROMs.  Resume the process.  EXPECT the entire
>>>> process -- start to finish -- to proceed exactly as it would
>>>> have had you not replaced the ROMs!
>>> This assumes that you can replace the ROMs by some hot-swap process
>>> that does
>>> not kill power to the RAM/registers that hold the state and quickly
>>> enough that
>>> the food will not cool substantially.
>>
>> Again, imagination suggests you could implement the ROMs (i.e., the
>> program TEXT) in other media that *can* be (effectively) replaced "between
>> one clock cycle and the next".  This is all old technology.  The problem
>> lies in doing so while some consumer (client) might be ACTIVELY executing
>> within that block of program TEXT.
>
> I'm not familiar with *how* these systems do what they do.  Keeping the old
> copy running while clients are in the middle of a transaction and perhaps
> warning them to finish up is an option.

That assumes they *will* "finish up" (consider a "black box" service that
is always receiving "log" information) and in the time frame that *you*
consider appropriate.  If you're shutting down a node in a cluster for
periodic maintenance, you can probably afford to wait seconds/minutes
for everything to come to an orderly state.  But, you can't make that
generalization about all clients and dependencies (recall, many clients
are, typically, *agents* -- "serving" clients of their own!)

You can always ensure no *new* clients avail themselves of the "old"
instance of the service thereby (hopefully) expediting its "release".

>>> Also, the old program state must be
>>> coded so that the new ROMs read and operate on it properly.
>>
>> No, that isn't necessary.  In fact, different algorithms may use
>> inconsistent state vectors so that mapping from one algorithm to
>> another is not possible.  That doesn't preclude "interrupting"
>> existing processing, replacing the TEXT and finishing the
>> processing with the "new" algorithm.
>
> Provided you translate and replace the existing state vector also.

That may not be practical.

factorial(n: int) : int
    ASSERT( n >= 1 )
    result := 1
    while (n > 1) {
       result *= n
       n--
    }
    return result
}

factorial(n: int) : int
    ASSERT( n >= 1 )
    if (n == 1)
       return 1
    return N * factorial(n-1)
)

have vastly different state vectors (assuming I haven't botched the
implementations).

So, just assuming you can <somehow> map one state vector into another
won't give you a "fix".

>>> It sounds like a lot of work.
>>
>> That's why things like Windows want you to reboot so often!  :>
>>
>> OTOH, web sites and enterprise systems regularly roll out
>> updates WHILE still providing services -- because the cost
>> of shutting the systems/services down for that update can
>> be substantial ("We're sorry, but the on-line banking transaction
>> that you are engaged in AT THIS MOMENT will be aborted.  Please
>> try again later.")
>
> I'm not familiar with *how* these systems do what they do.  Keeping the old
> copy running while clients are in the middle of a transaction and perhaps
> warning them to finish up is an option.

I think most of these types of services are short-lived and/or
transactional.  And, for services with human interaction, you can
always hope the human "client" is "understanding"/patient (which
is possible IF these types of inconveniences aren't frequent)

>> (Would you want to have to *stop* your car to have the code in the
>> ABS system updated -- given that stopping the car might not be
>> possible, reliably, given the current state of the ABS code?  :> )
>
> Would you want to rely on the company that wrote the bad ABS code to fix it and
> do so while your car was moving?  I suspect that the "fix it live" problem is
> tougher that the "ABS" problem.

*Undoubtedly* tougher!  OTOH, if there was sufficient risk (death or injury)
to applying the brakes *prior to* installing the upgrade, I'd much prefer
<someone> invest in *that* solution!  You can't tell the Apollo 13 crew
that you'll fix their problem -- AFTER they return home...  :>

> FILAAS (Fix it live as a service) might be possible if the processor and system
> the ABS was running on was standard, but what about your cardiac pacemaker?  Is
> that running on the same processor?

Pacemaker is a perfect example of upgrade /in situ/.  Of course, the chances
of the pacemaker needing to perform its function during the upgrade AND being
unable to do so AND the patient dying while the doctor is standing nearby is
probably pretty slim.  And, the pace maker designer undoubtedly considered
this capability in their design of the product.

We worked out a bunch of different approaches to the problem Friday night.
Unfortunately, no *one* is a panacea.  So, I'm working through the costs
(and consequences) of each approach.  I've got an off-site/retreat coming
up RSN so I hope to bring my problem to the table, there.

As I can't rely on others (writing code to run in my system) to design
components with this capability in mind, I need a fall-back strategy that
will allow me to upgrade *those* components in the least painful way possible
(if those folks' products end up "looking bad" as a result, its their "image"
to attend to).

On 4/23/2017 13:52, Don Y wrote:
> On 4/23/2017 10:01 AM, Phil Martel wrote:
>> On 4/22/2017 11:55, Don Y wrote:
>>> On 4/22/2017 4:09 AM, upsidedown@downunder.com wrote:
>>>> I still do not see what your actual problem is.
>>>
>>> Find a piece of software that is currently executing:  your
>>> microwave oven controller, your PC (consider it a *collection*
>>> of software), your calculator, your ....
>>>
>>> Now, WHILE it is "solving some particular problem for which it
>>> was designed", pause the clock and replace all the INSTRUCTIONS
>>> in the program(s) with a new, revised program (it does <whatever>
>>> only "better" (the 8 digit calculator now handles 12 digits; the
>>> microwave oven now has 6 other types of cycles; the PC is now
>>> running Windows 11 instead of DOS 3.3; etc.)
>>>
>>> Let the clock resume.  None of the actions that were running
>>> at the time the clock was PAUSED should have been affected by
>>> the upgrade.  I.e., if the calculator was in the middle of
>>> computing "14!", it should continue to completion -- from
>>> wherever it happened to have been, at the time -- yielding
>>> the correct result.
>>>
>>> Note, however, that the result should now be displayed as
>>> 8.71782912*10^10 or 87178291200 and NOT as 8.7178291*10^10
>>> to reflect the extra precision that it has internally
>>> as well as the extended "display"/reporting capability
>>> (assuming, of course, that the original executable was
>>> interrupted before any loss of precision).
>> Since this group is for embedded processing, it is fair to ask why the
>> original
>> calculator would have a display with  more that 8 significant figures.
>
> Why does the calculator *function* have to be implemented in
> a calculator *package*?  Do you not use <math.h> in your
> embedded applications?
>
Obviously it doesn't have to be, but it may be. Perhaps "calculator" is 
a poor example of what you're trying to explain

> With the tiniest bit of imagination, one should be able to consider
> a new math library that had greater precision *or* different
> algorithms that converged faster than the previous implementation.
>
> Given that you (I) can not shut the application down "for
> maintenance", how would you replace the library (used by multiple
> modules) in the application while the system was powered up and
> operating?   (see my previous examples for steps)
>
> Replace "library" with "service" and you have my original question
> (i.e., most libraries can be implemented *as* services with the
> re-formalization of the interface communication overhead)
>
>>> Put something in your microwave oven.  Set the timer to X.
>>> After an arbitrary amount of time, pause the process (processor)
>>> and replace the ROMs.  Resume the process.  EXPECT the entire
>>> process -- start to finish -- to proceed exactly as it would
>>> have had you not replaced the ROMs!
>> This assumes that you can replace the ROMs by some hot-swap process
>> that does
>> not kill power to the RAM/registers that hold the state and quickly
>> enough that
>> the food will not cool substantially.
>
> Again, imagination suggests you could implement the ROMs (i.e., the
> program TEXT) in other media that *can* be (effectively) replaced "between
> one clock cycle and the next".  This is all old technology.  The problem
> lies in doing so while some consumer (client) might be ACTIVELY executing
> within that block of program TEXT.
>
I'm not familiar with *how* these systems do what they do.  Keeping the 
old copy running while clients are in the middle of a transaction and 
perhaps warning them to finish up is an option.

>> Also, the old program state must be
>> coded so that the new ROMs read and operate on it properly.
>
> No, that isn't necessary.  In fact, different algorithms may use
> inconsistent state vectors so that mapping from one algorithm to
> another is not possible.  That doesn't preclude "interrupting"
> existing processing, replacing the TEXT and finishing the
> processing with the "new" algorithm.
>
Provided you translate and replace the existing state vector also.

>> It sounds like a lot of work.
>
> That's why things like Windows want you to reboot so often!  :>
>
> OTOH, web sites and enterprise systems regularly roll out
> updates WHILE still providing services -- because the cost
> of shutting the systems/services down for that update can
> be substantial ("We're sorry, but the on-line banking transaction
> that you are engaged in AT THIS MOMENT will be aborted.  Please
> try again later.")

I'm not familiar with *how* these systems do what they do.  Keeping the 
old copy running while clients are in the middle of a transaction and 
perhaps warning them to finish up is an option.
>
> (Would you want to have to *stop* your car to have the code in the
> ABS system updated -- given that stopping the car might not be
> possible, reliably, given the current state of the ABS code?  :> )

Would you want to rely on the company that wrote the bad ABS code to fix 
it and do so while your car was moving?  I suspect that the "fix it 
live" problem is tougher that the "ABS" problem.

FILAAS (Fix it live as a service) might be possible if the processor and 
system the ABS was running on was standard, but what about your cardiac 
pacemaker?  Is that running on the same processor?

-- 
Best wishes,
--Phil
pomartel At Comcast(ignore_this) dot net

On 4/23/2017 10:01 AM, Phil Martel wrote:
> On 4/22/2017 11:55, Don Y wrote:
>> On 4/22/2017 4:09 AM, upsidedown@downunder.com wrote:
>>> I still do not see what your actual problem is.
>>
>> Find a piece of software that is currently executing:  your
>> microwave oven controller, your PC (consider it a *collection*
>> of software), your calculator, your ....
>>
>> Now, WHILE it is "solving some particular problem for which it
>> was designed", pause the clock and replace all the INSTRUCTIONS
>> in the program(s) with a new, revised program (it does <whatever>
>> only "better" (the 8 digit calculator now handles 12 digits; the
>> microwave oven now has 6 other types of cycles; the PC is now
>> running Windows 11 instead of DOS 3.3; etc.)
>>
>> Let the clock resume.  None of the actions that were running
>> at the time the clock was PAUSED should have been affected by
>> the upgrade.  I.e., if the calculator was in the middle of
>> computing "14!", it should continue to completion -- from
>> wherever it happened to have been, at the time -- yielding
>> the correct result.
>>
>> Note, however, that the result should now be displayed as
>> 8.71782912*10^10 or 87178291200 and NOT as 8.7178291*10^10
>> to reflect the extra precision that it has internally
>> as well as the extended "display"/reporting capability
>> (assuming, of course, that the original executable was
>> interrupted before any loss of precision).
> Since this group is for embedded processing, it is fair to ask why the original
> calculator would have a display with  more that 8 significant figures.

Why does the calculator *function* have to be implemented in
a calculator *package*?  Do you not use <math.h> in your
embedded applications?

With the tiniest bit of imagination, one should be able to consider
a new math library that had greater precision *or* different
algorithms that converged faster than the previous implementation.

Given that you (I) can not shut the application down "for
maintenance", how would you replace the library (used by multiple
modules) in the application while the system was powered up and
operating?   (see my previous examples for steps)

Replace "library" with "service" and you have my original question
(i.e., most libraries can be implemented *as* services with the
re-formalization of the interface communication overhead)

>> Put something in your microwave oven.  Set the timer to X.
>> After an arbitrary amount of time, pause the process (processor)
>> and replace the ROMs.  Resume the process.  EXPECT the entire
>> process -- start to finish -- to proceed exactly as it would
>> have had you not replaced the ROMs!
> This assumes that you can replace the ROMs by some hot-swap process that does
> not kill power to the RAM/registers that hold the state and quickly enough that
> the food will not cool substantially.

Again, imagination suggests you could implement the ROMs (i.e., the
program TEXT) in other media that *can* be (effectively) replaced "between
one clock cycle and the next".  This is all old technology.  The problem
lies in doing so while some consumer (client) might be ACTIVELY executing
within that block of program TEXT.

> Also, the old program state must be
> coded so that the new ROMs read and operate on it properly.

No, that isn't necessary.  In fact, different algorithms may use
inconsistent state vectors so that mapping from one algorithm to
another is not possible.  That doesn't preclude "interrupting"
existing processing, replacing the TEXT and finishing the
processing with the "new" algorithm.

> It sounds like a lot of work.

That's why things like Windows want you to reboot so often!  :>

OTOH, web sites and enterprise systems regularly roll out
updates WHILE still providing services -- because the cost
of shutting the systems/services down for that update can
be substantial ("We're sorry, but the on-line banking transaction
that you are engaged in AT THIS MOMENT will be aborted.  Please
try again later.")

(Would you want to have to *stop* your car to have the code in the
ABS system updated -- given that stopping the car might not be
possible, reliably, given the current state of the ABS code?  :> )

On 4/22/2017 11:55, Don Y wrote:
> On 4/22/2017 4:09 AM, upsidedown@downunder.com wrote:
>> I still do not see what your actual problem is.
>
> Find a piece of software that is currently executing:  your
> microwave oven controller, your PC (consider it a *collection*
> of software), your calculator, your ....
>
> Now, WHILE it is "solving some particular problem for which it
> was designed", pause the clock and replace all the INSTRUCTIONS
> in the program(s) with a new, revised program (it does <whatever>
> only "better" (the 8 digit calculator now handles 12 digits; the
> microwave oven now has 6 other types of cycles; the PC is now
> running Windows 11 instead of DOS 3.3; etc.)
>
> Let the clock resume.  None of the actions that were running
> at the time the clock was PAUSED should have been affected by
> the upgrade.  I.e., if the calculator was in the middle of
> computing "14!", it should continue to completion -- from
> wherever it happened to have been, at the time -- yielding
> the correct result.
>
> Note, however, that the result should now be displayed as
> 8.71782912*10^10 or 87178291200 and NOT as 8.7178291*10^10
> to reflect the extra precision that it has internally
> as well as the extended "display"/reporting capability
> (assuming, of course, that the original executable was
> interrupted before any loss of precision).
Since this group is for embedded processing, it is fair to ask why the 
original calculator would have a display with  more that 8 significant 
figures.
>
> Put something in your microwave oven.  Set the timer to X.
> After an arbitrary amount of time, pause the process (processor)
> and replace the ROMs.  Resume the process.  EXPECT the entire
> process -- start to finish -- to proceed exactly as it would
> have had you not replaced the ROMs!
This assumes that you can replace the ROMs by some hot-swap process that 
does not kill power to the RAM/registers that hold the state and quickly 
enough that the food will not cool substantially.  Also, the old program 
state must be coded so that the new ROMs read and operate on it properly.

It sounds like a lot of work.
>
>> Just swap the MAC addresses between the activating and passivating
>> server and the client nor the client application doesn't noticing
>> anything special.
>>
>> On the server side with stateless protocols such as UDP and LAT things
>> are quite straight forward.
>
> Communication protocols aren't the only places where state is involved.
> Start counting out loud.  The next time you encounter a person, switch
> to another language.  I.e., the algorithm by which you determine the
> next ordinal to speak has changed.  But, you've still got to remember
> which was the *last* previously spoken!
>
>> With state full protocols like TCP, things get hairy, if the protocol
>> state is maintained in kernel mode, if it is not swapped out and in
>> into an other process with the user mode code. With the TCP stack in
>> user mode, this should not be a big problem.
>>
>


-- 
Best wishes,
--Phil
pomartel At Comcast(ignore_this) dot net

On 4/22/2017 4:09 AM, upsidedown@downunder.com wrote:
> I still do not see what your actual problem is.

Find a piece of software that is currently executing:  your
microwave oven controller, your PC (consider it a *collection*
of software), your calculator, your ....

Now, WHILE it is "solving some particular problem for which it
was designed", pause the clock and replace all the INSTRUCTIONS
in the program(s) with a new, revised program (it does <whatever>
only "better" (the 8 digit calculator now handles 12 digits; the
microwave oven now has 6 other types of cycles; the PC is now
running Windows 11 instead of DOS 3.3; etc.)

Let the clock resume.  None of the actions that were running
at the time the clock was PAUSED should have been affected by
the upgrade.  I.e., if the calculator was in the middle of
computing "14!", it should continue to completion -- from
wherever it happened to have been, at the time -- yielding
the correct result.

Note, however, that the result should now be displayed as
8.71782912*10^10 or 87178291200 and NOT as 8.7178291*10^10
to reflect the extra precision that it has internally
as well as the extended "display"/reporting capability
(assuming, of course, that the original executable was
interrupted before any loss of precision).

Put something in your microwave oven.  Set the timer to X.
After an arbitrary amount of time, pause the process (processor)
and replace the ROMs.  Resume the process.  EXPECT the entire
process -- start to finish -- to proceed exactly as it would
have had you not replaced the ROMs!

> Just swap the MAC addresses between the activating and passivating
> server and the client nor the client application doesn't noticing
> anything special.
>
> On the server side with stateless protocols such as UDP and LAT things
> are quite straight forward.

Communication protocols aren't the only places where state is involved.
Start counting out loud.  The next time you encounter a person, switch
to another language.  I.e., the algorithm by which you determine the
next ordinal to speak has changed.  But, you've still got to remember
which was the *last* previously spoken!

> With state full protocols like TCP, things get hairy, if the protocol
> state is maintained in kernel mode, if it is not swapped out and in
> into an other process with the user mode code. With the TCP stack in
> user mode, this should not be a big problem.
>

On Fri, 21 Apr 2017 17:51:42 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 4/21/2017 1:06 PM, upsidedown@downunder.com wrote:
>> On Fri, 21 Apr 2017 10:43:28 -0700, Don Y
>> <blockedofcourse@foo.invalid> wrote:
>>
>>> On 4/21/2017 12:45 AM, upsidedown@downunder.com wrote:
>>>> On Thu, 20 Apr 2017 16:05:00 -0700, Don Y
>>>> <blockedofcourse@foo.invalid> wrote:
>>>>
>>>>> My application runs 24/7/365.  There's no "reset" switch.  So,
>>>>> components (hardware AND software) that are upgraded are done
>>>>> so while the system is live.
>>>>>
>>>>> [I suspect there are six nines services on-line that behave
>>>>> similarly.  But, if inspected "in the small", I wonder how
>>>>> clean the switchovers are?  And, how durable the connections?]
>>>>
>>>> Before (re)inventing the wheel, take a look how VAXcluster (now
>>>> VMScluster) has done it since the 1980's.
>>>
>>> In cluster environments, nodes tend to be indivisible entities.
>>> When you update the software on a node, you update the node,
>>> itself.  You "kick off" the processes that are running on the
>>> node just prior to the upgrade (even if that means migrating them
>>> to another node -- with an OLD copy of the service they are using
>>> at the time) and summarily replace the node (and the services it
>>> provides).
>>>
>>> [If you've migrated the existing connections to services that WERE
>>> running on that node to some other node, then you still have those
>>> clients running on that other node -- potentially indefinitely with
>>> the OLD server code!]
>>>
>>> Imagine, instead, that some processes are using the "file service"
>>> (whatever THAT is!) on *a* node.  You want to upgrade the file service
>>> code (without affecting any of the other services that are running
>>> on the node) WHILE the file service remains in use on that node.
>>>
>>> I.e., install the new service and start it running.  Change the
>>> service registry to reference the new server instance for *new*
>>> service requests (i.e., any files that are accessed AFTER this
>>> point will be handled by the NEW file service).  Allow the old
>>> file service instance to remain active to finish servicing any
>>> existing connections.
>>>
>>> *Eventually*, the preexisting connections will be completely serviced
>>> (those files closed, etc.).  Because all NEW requests are handled by
>>> the NEW service, the old service will eventually find itself with no
>>> work to do -- no active connections (clients).  At that point, it
>>> can terminate itself with no deleterious impact on the system.
>>>
>>> The problem comes with clients that "linger" on the old service longer
>>> than you'd like.  E.g., imagine a process that opens a dribble file and
>>> leaves it open FOREVER.  That would stake a continuous claim on the
>>> old service preventing it from ever being "replaced".
>>>
>>> Or, you might be in a *hurry* to replace a service -- before the
>>> clients currently using it are naturally *done* with it.
>>>
>>> So, you need a way of migrating the *active* connections to another
>>> server ALONG WITH THE INTERNAL STATE associated with each of those
>>> connections.
>>>
>>> For a file service, that state might include an inode number, access
>>> mode (R/W), current file offset (for read or write), any buffered data
>>> (to be written or already read-ahead), any media I/O actions "in progress",
>>> etc.
>>>
>>> But, the *new* service may associate different state with each connection
>>> as dictated by *its* implementation.  So, simply "copying" the state
>>>from the old service to the new service won't suffice; there needs to
>>> be some "state translation" that takes place to ensure the client's
>>> connection remains semantically intact across the transition between
>>> servers.
>>>
>>> Or, a modification of the server contract that allows any server to
>>> simply state, "I quit" and let the clients figure out how to recover
>>> or restore their use of that service (boo, hiss!).
>>>
>>> I'll be meeting up with some local colleagues, tonight, for 12oz curls.
>>> I'll see if any of the guys who work in "enterprises" can shed some light
>>> on the approach they take to this sort of thing.  Though I suspect their
>>> users are more "transient" than persistent.  So, will leave a service in
>>> short order as a natural consequence of their operation (in which case,
>>> just registering the new service and waiting would suffice).
>>
>> It is more than a quarter of a century since I have been running a
>> large VAXcluster with  a dozen cabinet size CPUs, but I try to
>> remember some of the details.
>>
>> If you have multiple CPUs with shared (and mirrored) disks, switching
>> from an active process from one CPU to an other is quite easy. As long
>> as the OS supports process checkpointing or swapping out a complete
>> process to disk, things are easy. Instead of swapping in a process
>> from disk back into the original CPU, just swap it in to an another
>> CPU :-)
>
>That's not the same thing.   That's *migrating* a process to a different
>CPU.  You're moving the entire state of the process to resume execution
>on another CPU.  All the "variables" AND all the instructions that
>interpret those variables!
>
>I want to "alter the executable" while it is running -- change the
>instructions and (somehow) tweak the variables so their current
>values "make sense" when interpreted by a different set of instructions!
>
>I'm typing a "followup" to your message using Thunderbird.  I
>(the human) can be regarded as a client of Thunderbird.  I am engaged
>in an interaction with it -- my CONNECTION to it persists continuously
>as I am typing this message.
>
>WHILE I AM TYPING, I want something to be able to sneak in and
>REPLACE the copy of Thunderbird that is executing in my computer's
>memory -- not just the copy that resides on the disk (which Windows
>won't examine until the next time I *load* Thunderbird) -- and to
>do so such that this message ends up intact as it is eventually
>posted to the NNTP server.
>
>I.e., to do this, you'd need to capture a copy of what I've typed
>up to the instant the upgrade is switched in *under* me.  It
>would have to know how the windows that the old Thunderbird
>instance was using were maintained by the OS, and the source
>of keystrokes and other user interface events.
>
>It would be messy and tedious to get it "right" -- but not impossible.
>
>A far easier goal would be to swap the executable bound to "Thunderbird.exe"
>so that the next time I invoked Thunderbird, I'd get the NEW executable;
>let my current interaction run to completion with the *old* executable!
>
>But, there's no guarantee that I will terminate this Thunderbird session
>anytime soon.  Or *ever*!
>
>The OS can forcibly move the user interface connections to another
>process running on the same -- or different -- node.  But, that doesn't
>mean the client's (i.e., user's) experience will be "continuous"
>or coherent.

I still do not see what your actual problem is.

Just swap the MAC addresses between the activating and passivating
server and the client nor the client application doesn't noticing
anything special.

On the server side with stateless protocols such as UDP and LAT things
are quite straight forward.

With state full protocols like TCP, things get hairy, if the protocol
state is maintained in kernel mode, if it is not swapped out and in
into an other process with the user mode code. With the TCP stack in
user mode, this should not be a big problem.