EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

Dynamic upgrading/Hot-swapping a service

Started by Don Y April 20, 2017
On 4/23/2017 13:52, Don Y wrote:
> On 4/23/2017 10:01 AM, Phil Martel wrote: >> On 4/22/2017 11:55, Don Y wrote: >>> On 4/22/2017 4:09 AM, upsidedown@downunder.com wrote: >>>> I still do not see what your actual problem is. >>> >>> Find a piece of software that is currently executing: your >>> microwave oven controller, your PC (consider it a *collection* >>> of software), your calculator, your .... >>> >>> Now, WHILE it is "solving some particular problem for which it >>> was designed", pause the clock and replace all the INSTRUCTIONS >>> in the program(s) with a new, revised program (it does <whatever> >>> only "better" (the 8 digit calculator now handles 12 digits; the >>> microwave oven now has 6 other types of cycles; the PC is now >>> running Windows 11 instead of DOS 3.3; etc.) >>> >>> Let the clock resume. None of the actions that were running >>> at the time the clock was PAUSED should have been affected by >>> the upgrade. I.e., if the calculator was in the middle of >>> computing "14!", it should continue to completion -- from >>> wherever it happened to have been, at the time -- yielding >>> the correct result. >>> >>> Note, however, that the result should now be displayed as >>> 8.71782912*10^10 or 87178291200 and NOT as 8.7178291*10^10 >>> to reflect the extra precision that it has internally >>> as well as the extended "display"/reporting capability >>> (assuming, of course, that the original executable was >>> interrupted before any loss of precision). >> Since this group is for embedded processing, it is fair to ask why the >> original >> calculator would have a display with more that 8 significant figures. > > Why does the calculator *function* have to be implemented in > a calculator *package*? Do you not use <math.h> in your > embedded applications? >
Obviously it doesn't have to be, but it may be. Perhaps "calculator" is a poor example of what you're trying to explain
> With the tiniest bit of imagination, one should be able to consider > a new math library that had greater precision *or* different > algorithms that converged faster than the previous implementation. > > Given that you (I) can not shut the application down "for > maintenance", how would you replace the library (used by multiple > modules) in the application while the system was powered up and > operating? (see my previous examples for steps) > > Replace "library" with "service" and you have my original question > (i.e., most libraries can be implemented *as* services with the > re-formalization of the interface communication overhead) > >>> Put something in your microwave oven. Set the timer to X. >>> After an arbitrary amount of time, pause the process (processor) >>> and replace the ROMs. Resume the process. EXPECT the entire >>> process -- start to finish -- to proceed exactly as it would >>> have had you not replaced the ROMs! >> This assumes that you can replace the ROMs by some hot-swap process >> that does >> not kill power to the RAM/registers that hold the state and quickly >> enough that >> the food will not cool substantially. > > Again, imagination suggests you could implement the ROMs (i.e., the > program TEXT) in other media that *can* be (effectively) replaced "between > one clock cycle and the next". This is all old technology. The problem > lies in doing so while some consumer (client) might be ACTIVELY executing > within that block of program TEXT. >
I'm not familiar with *how* these systems do what they do. Keeping the old copy running while clients are in the middle of a transaction and perhaps warning them to finish up is an option.
>> Also, the old program state must be >> coded so that the new ROMs read and operate on it properly. > > No, that isn't necessary. In fact, different algorithms may use > inconsistent state vectors so that mapping from one algorithm to > another is not possible. That doesn't preclude "interrupting" > existing processing, replacing the TEXT and finishing the > processing with the "new" algorithm. >
Provided you translate and replace the existing state vector also.
>> It sounds like a lot of work. > > That's why things like Windows want you to reboot so often! :> > > OTOH, web sites and enterprise systems regularly roll out > updates WHILE still providing services -- because the cost > of shutting the systems/services down for that update can > be substantial ("We're sorry, but the on-line banking transaction > that you are engaged in AT THIS MOMENT will be aborted. Please > try again later.")
I'm not familiar with *how* these systems do what they do. Keeping the old copy running while clients are in the middle of a transaction and perhaps warning them to finish up is an option.
> > (Would you want to have to *stop* your car to have the code in the > ABS system updated -- given that stopping the car might not be > possible, reliably, given the current state of the ABS code? :> )
Would you want to rely on the company that wrote the bad ABS code to fix it and do so while your car was moving? I suspect that the "fix it live" problem is tougher that the "ABS" problem. FILAAS (Fix it live as a service) might be possible if the processor and system the ABS was running on was standard, but what about your cardiac pacemaker? Is that running on the same processor? -- Best wishes, --Phil pomartel At Comcast(ignore_this) dot net
On 4/24/2017 7:48 AM, Phil Martel wrote:
>>>>> I still do not see what your actual problem is. >>>> >>>> Find a piece of software that is currently executing: your >>>> microwave oven controller, your PC (consider it a *collection* >>>> of software), your calculator, your .... >>>> >>>> Now, WHILE it is "solving some particular problem for which it >>>> was designed", pause the clock and replace all the INSTRUCTIONS >>>> in the program(s) with a new, revised program (it does <whatever> >>>> only "better" (the 8 digit calculator now handles 12 digits; the >>>> microwave oven now has 6 other types of cycles; the PC is now >>>> running Windows 11 instead of DOS 3.3; etc.) >>>> >>>> Let the clock resume. None of the actions that were running >>>> at the time the clock was PAUSED should have been affected by >>>> the upgrade. I.e., if the calculator was in the middle of >>>> computing "14!", it should continue to completion -- from >>>> wherever it happened to have been, at the time -- yielding >>>> the correct result. >>>> >>>> Note, however, that the result should now be displayed as >>>> 8.71782912*10^10 or 87178291200 and NOT as 8.7178291*10^10 >>>> to reflect the extra precision that it has internally >>>> as well as the extended "display"/reporting capability >>>> (assuming, of course, that the original executable was >>>> interrupted before any loss of precision). >>> Since this group is for embedded processing, it is fair to ask why the >>> original >>> calculator would have a display with more that 8 significant figures. >> >> Why does the calculator *function* have to be implemented in >> a calculator *package*? Do you not use <math.h> in your >> embedded applications? > > Obviously it doesn't have to be, but it may be. Perhaps "calculator" is a poor > example of what you're trying to explain
I'm trying to pick examples of "programs" that people can easily understand. A calculator evaluating a transcendental function (i.e., something with some "meat" in it) could approach the problem in different ways (Taylor series, CORDIC, etc.) in different "revisions"/versions. So, (*ignoring* the desire to upgrade due to a *flaw* in the implementation,) it is conceivable that you would want to upgrade the algorithm to adopt an approach that converges more quickly. And, because the algorithm would be iterative, it is likely that it could be "in progress" when you choose to upgrade the software (e.g., an 80b floating-point "FMUL" can be a single instruction but FTANH probably isn't!). Finally, the approaches can vary significantly in terms of their resource requirements (e.g., temporary storage) making a direct mapping of one to the other virtually impossible.
>>>> Put something in your microwave oven. Set the timer to X. >>>> After an arbitrary amount of time, pause the process (processor) >>>> and replace the ROMs. Resume the process. EXPECT the entire >>>> process -- start to finish -- to proceed exactly as it would >>>> have had you not replaced the ROMs! >>> This assumes that you can replace the ROMs by some hot-swap process >>> that does >>> not kill power to the RAM/registers that hold the state and quickly >>> enough that >>> the food will not cool substantially. >> >> Again, imagination suggests you could implement the ROMs (i.e., the >> program TEXT) in other media that *can* be (effectively) replaced "between >> one clock cycle and the next". This is all old technology. The problem >> lies in doing so while some consumer (client) might be ACTIVELY executing >> within that block of program TEXT. > > I'm not familiar with *how* these systems do what they do. Keeping the old > copy running while clients are in the middle of a transaction and perhaps > warning them to finish up is an option.
That assumes they *will* "finish up" (consider a "black box" service that is always receiving "log" information) and in the time frame that *you* consider appropriate. If you're shutting down a node in a cluster for periodic maintenance, you can probably afford to wait seconds/minutes for everything to come to an orderly state. But, you can't make that generalization about all clients and dependencies (recall, many clients are, typically, *agents* -- "serving" clients of their own!) You can always ensure no *new* clients avail themselves of the "old" instance of the service thereby (hopefully) expediting its "release".
>>> Also, the old program state must be >>> coded so that the new ROMs read and operate on it properly. >> >> No, that isn't necessary. In fact, different algorithms may use >> inconsistent state vectors so that mapping from one algorithm to >> another is not possible. That doesn't preclude "interrupting" >> existing processing, replacing the TEXT and finishing the >> processing with the "new" algorithm. > > Provided you translate and replace the existing state vector also.
That may not be practical. factorial(n: int) : int ASSERT( n >= 1 ) result := 1 while (n > 1) { result *= n n-- } return result } factorial(n: int) : int ASSERT( n >= 1 ) if (n == 1) return 1 return N * factorial(n-1) ) have vastly different state vectors (assuming I haven't botched the implementations). So, just assuming you can <somehow> map one state vector into another won't give you a "fix".
>>> It sounds like a lot of work. >> >> That's why things like Windows want you to reboot so often! :> >> >> OTOH, web sites and enterprise systems regularly roll out >> updates WHILE still providing services -- because the cost >> of shutting the systems/services down for that update can >> be substantial ("We're sorry, but the on-line banking transaction >> that you are engaged in AT THIS MOMENT will be aborted. Please >> try again later.") > > I'm not familiar with *how* these systems do what they do. Keeping the old > copy running while clients are in the middle of a transaction and perhaps > warning them to finish up is an option.
I think most of these types of services are short-lived and/or transactional. And, for services with human interaction, you can always hope the human "client" is "understanding"/patient (which is possible IF these types of inconveniences aren't frequent)
>> (Would you want to have to *stop* your car to have the code in the >> ABS system updated -- given that stopping the car might not be >> possible, reliably, given the current state of the ABS code? :> ) > > Would you want to rely on the company that wrote the bad ABS code to fix it and > do so while your car was moving? I suspect that the "fix it live" problem is > tougher that the "ABS" problem.
*Undoubtedly* tougher! OTOH, if there was sufficient risk (death or injury) to applying the brakes *prior to* installing the upgrade, I'd much prefer <someone> invest in *that* solution! You can't tell the Apollo 13 crew that you'll fix their problem -- AFTER they return home... :>
> FILAAS (Fix it live as a service) might be possible if the processor and system > the ABS was running on was standard, but what about your cardiac pacemaker? Is > that running on the same processor?
Pacemaker is a perfect example of upgrade /in situ/. Of course, the chances of the pacemaker needing to perform its function during the upgrade AND being unable to do so AND the patient dying while the doctor is standing nearby is probably pretty slim. And, the pace maker designer undoubtedly considered this capability in their design of the product. We worked out a bunch of different approaches to the problem Friday night. Unfortunately, no *one* is a panacea. So, I'm working through the costs (and consequences) of each approach. I've got an off-site/retreat coming up RSN so I hope to bring my problem to the table, there. As I can't rely on others (writing code to run in my system) to design components with this capability in mind, I need a fall-back strategy that will allow me to upgrade *those* components in the least painful way possible (if those folks' products end up "looking bad" as a result, its their "image" to attend to).
On 4/24/2017 15:33, Don Y wrote:
> On 4/24/2017 7:48 AM, Phil Martel wrote: >>>>>> I still do not see what your actual problem is. >>>>> >>>>> Find a piece of software that is currently executing: your >>>>> microwave oven controller, your PC (consider it a *collection* >>>>> of software), your calculator, your .... >>>>> >>>>> Now, WHILE it is "solving some particular problem for which it >>>>> was designed", pause the clock and replace all the INSTRUCTIONS >>>>> in the program(s) with a new, revised program (it does <whatever> >>>>> only "better" (the 8 digit calculator now handles 12 digits; the >>>>> microwave oven now has 6 other types of cycles; the PC is now >>>>> running Windows 11 instead of DOS 3.3; etc.) >>>>> >>>>> Let the clock resume. None of the actions that were running >>>>> at the time the clock was PAUSED should have been affected by >>>>> the upgrade. I.e., if the calculator was in the middle of >>>>> computing "14!", it should continue to completion -- from >>>>> wherever it happened to have been, at the time -- yielding >>>>> the correct result. >>>>> >>>>> Note, however, that the result should now be displayed as >>>>> 8.71782912*10^10 or 87178291200 and NOT as 8.7178291*10^10 >>>>> to reflect the extra precision that it has internally >>>>> as well as the extended "display"/reporting capability >>>>> (assuming, of course, that the original executable was >>>>> interrupted before any loss of precision). >>>> Since this group is for embedded processing, it is fair to ask why the >>>> original >>>> calculator would have a display with more that 8 significant figures. >>> >>> Why does the calculator *function* have to be implemented in >>> a calculator *package*? Do you not use <math.h> in your >>> embedded applications? >> >> Obviously it doesn't have to be, but it may be. Perhaps "calculator" >> is a poor >> example of what you're trying to explain > > I'm trying to pick examples of "programs" that people can easily > understand. A calculator evaluating a transcendental function > (i.e., something with some "meat" in it) could approach the > problem in different ways (Taylor series, CORDIC, etc.) in > different "revisions"/versions. > > So, (*ignoring* the desire to upgrade due to a *flaw* in the > implementation,) it is conceivable that you would want to > upgrade the algorithm to adopt an approach that converges > more quickly. > > And, because the algorithm would be iterative, it is likely > that it could be "in progress" when you choose to upgrade the > software (e.g., an 80b floating-point "FMUL" can be a single > instruction but FTANH probably isn't!). > > Finally, the approaches can vary significantly in terms of > their resource requirements (e.g., temporary storage) making > a direct mapping of one to the other virtually impossible. > >>>>> Put something in your microwave oven. Set the timer to X. >>>>> After an arbitrary amount of time, pause the process (processor) >>>>> and replace the ROMs. Resume the process. EXPECT the entire >>>>> process -- start to finish -- to proceed exactly as it would >>>>> have had you not replaced the ROMs! >>>> This assumes that you can replace the ROMs by some hot-swap process >>>> that does >>>> not kill power to the RAM/registers that hold the state and quickly >>>> enough that >>>> the food will not cool substantially. >>> >>> Again, imagination suggests you could implement the ROMs (i.e., the >>> program TEXT) in other media that *can* be (effectively) replaced >>> "between >>> one clock cycle and the next". This is all old technology. The problem >>> lies in doing so while some consumer (client) might be ACTIVELY >>> executing >>> within that block of program TEXT. >> >> I'm not familiar with *how* these systems do what they do. Keeping >> the old >> copy running while clients are in the middle of a transaction and perhaps >> warning them to finish up is an option. > > That assumes they *will* "finish up" (consider a "black box" service that > is always receiving "log" information) and in the time frame that *you* > consider appropriate. If you're shutting down a node in a cluster for > periodic maintenance, you can probably afford to wait seconds/minutes > for everything to come to an orderly state. But, you can't make that > generalization about all clients and dependencies (recall, many clients > are, typically, *agents* -- "serving" clients of their own!) > > You can always ensure no *new* clients avail themselves of the "old" > instance of the service thereby (hopefully) expediting its "release". > >>>> Also, the old program state must be >>>> coded so that the new ROMs read and operate on it properly. >>> >>> No, that isn't necessary. In fact, different algorithms may use >>> inconsistent state vectors so that mapping from one algorithm to >>> another is not possible. That doesn't preclude "interrupting" >>> existing processing, replacing the TEXT and finishing the >>> processing with the "new" algorithm. >> >> Provided you translate and replace the existing state vector also. > > That may not be practical. > > factorial(n: int) : int > ASSERT( n >= 1 ) > result := 1 > while (n > 1) { > result *= n > n-- > } > return result > } > > factorial(n: int) : int > ASSERT( n >= 1 ) > if (n == 1) > return 1 > return N * factorial(n-1) > ) > > have vastly different state vectors (assuming I haven't botched the > implementations). > > So, just assuming you can <somehow> map one state vector into another > won't give you a "fix".
So, lets say you're in the middle of calculating factorial(1,000,000,000,000) with algorithm 2. Then you find out about algorithm 1 (or maybe decide that Stirling's approximation is close enough). What *can* you do with the unfinished solution other than dump the work and restart the problem with the new algorithm or let it finish? (and next time use the new algorithm)?
> >>>> It sounds like a lot of work. >>> >>> That's why things like Windows want you to reboot so often! :> >>> >>> OTOH, web sites and enterprise systems regularly roll out >>> updates WHILE still providing services -- because the cost >>> of shutting the systems/services down for that update can >>> be substantial ("We're sorry, but the on-line banking transaction >>> that you are engaged in AT THIS MOMENT will be aborted. Please >>> try again later.") >> >> I'm not familiar with *how* these systems do what they do. Keeping >> the old >> copy running while clients are in the middle of a transaction and perhaps >> warning them to finish up is an option. > > I think most of these types of services are short-lived and/or > transactional. And, for services with human interaction, you can > always hope the human "client" is "understanding"/patient (which > is possible IF these types of inconveniences aren't frequent) > >>> (Would you want to have to *stop* your car to have the code in the >>> ABS system updated -- given that stopping the car might not be >>> possible, reliably, given the current state of the ABS code? :> ) >> >> Would you want to rely on the company that wrote the bad ABS code to >> fix it and >> do so while your car was moving? I suspect that the "fix it live" >> problem is >> tougher that the "ABS" problem. > > *Undoubtedly* tougher! OTOH, if there was sufficient risk (death or > injury) > to applying the brakes *prior to* installing the upgrade, I'd much prefer > <someone> invest in *that* solution! You can't tell the Apollo 13 crew > that you'll fix their problem -- AFTER they return home... :> > >> FILAAS (Fix it live as a service) might be possible if the processor >> and system >> the ABS was running on was standard, but what about your cardiac >> pacemaker? Is >> that running on the same processor? > > Pacemaker is a perfect example of upgrade /in situ/. Of course, the > chances > of the pacemaker needing to perform its function during the upgrade AND > being > unable to do so AND the patient dying while the doctor is standing > nearby is > probably pretty slim. And, the pace maker designer undoubtedly considered > this capability in their design of the product. > > We worked out a bunch of different approaches to the problem Friday night. > Unfortunately, no *one* is a panacea. So, I'm working through the costs > (and consequences) of each approach. I've got an off-site/retreat coming > up RSN so I hope to bring my problem to the table, there. > > As I can't rely on others (writing code to run in my system) to design > components with this capability in mind, I need a fall-back strategy that > will allow me to upgrade *those* components in the least painful way > possible > (if those folks' products end up "looking bad" as a result, its their > "image" > to attend to).
-- Best wishes, --Phil pomartel At Comcast(ignore_this) dot net
On 4/24/2017 7:02 PM, Phil Martel wrote:
>>> Provided you translate and replace the existing state vector also. >> >> That may not be practical. >> >> factorial(n: int) : int >> ASSERT( n >= 1 ) >> result := 1 >> while (n > 1) { >> result *= n >> n-- >> } >> return result >> } >> >> factorial(n: int) : int >> ASSERT( n >= 1 ) >> if (n == 1) >> return 1 >> return N * factorial(n-1) >> ) >> >> have vastly different state vectors (assuming I haven't botched the >> implementations). >> >> So, just assuming you can <somehow> map one state vector into another >> won't give you a "fix". > So, lets say you're in the middle of calculating factorial(1,000,000,000,000) > with algorithm 2. Then you find out about algorithm 1 (or maybe decide that > Stirling's approximation is close enough).
You (as an executing client who has called upon the "factorial service" to perform that calculation) don't "find out about" anything! To *you*, nothing appears remiss. That's the whole point; as long as the API hasn't changed, you shouldn't care that the service has been replaced with an equivalent service. How the *system* ensures that illusion is maintained is the problem being addressed. The remedy that "makes most sense" will vary with the design (and functionality) of the service being upgraded. And, the approach the maintainer chooses to address those "rolling updates" As it would be heavy-handed for the system to dictate how EVERY service is coded AND the constraints placed upon their algorithms, the system can only offer (prefabricated) *mechanisms* that the service designer (and maintainer) can exploit to facilitate the upgrade. And, the system has to rely on the designer/maintainer to make best use of the mechanisms that it provides -- because the designer/maintainer has more intimate knowledge of the way the service is intended to work. A "lazy" designer may choose not to address live upgrade issues. In which case, the system will resort to draconian measures when an upgrade is installed: it will KILL the running service and let the clients deal with the resulting mess. *Users* will then either avoid products from that provider *or* will avoid upgrading (if the consequences are too painful -- where "too" is a subjective criteria defined by the user in question).
> What *can* you do with the > unfinished solution other than dump the work and restart the problem with the > new algorithm or let it finish? (and next time use the new algorithm)?
I prepare a document using /WordProcessor25/. The document can be seen as a snapshot of the "conceptual document" that I seek to prepare. I upgrade to /WordProcessor29/. Is all of the work that I did prior to that upgrade lost? (Why not? :> )
On 4/25/2017 0:37, Don Y wrote:
> On 4/24/2017 7:02 PM, Phil Martel wrote: >>>> Provided you translate and replace the existing state vector also. >>> >>> That may not be practical. >>> >>> factorial(n: int) : int >>> ASSERT( n >= 1 ) >>> result := 1 >>> while (n > 1) { >>> result *= n >>> n-- >>> } >>> return result >>> } >>> >>> factorial(n: int) : int >>> ASSERT( n >= 1 ) >>> if (n == 1) >>> return 1 >>> return N * factorial(n-1) >>> ) >>> >>> have vastly different state vectors (assuming I haven't botched the >>> implementations). >>> >>> So, just assuming you can <somehow> map one state vector into another >>> won't give you a "fix". >> So, lets say you're in the middle of calculating >> factorial(1,000,000,000,000) >> with algorithm 2. Then you find out about algorithm 1 (or maybe >> decide that >> Stirling's approximation is close enough). > > You (as an executing client who has called upon the "factorial service" > to perform that calculation) don't "find out about" anything! To *you*, > nothing appears remiss. That's the whole point; as long as the API > hasn't changed, you shouldn't care that the service has been replaced > with an equivalent service. > > How the *system* ensures that illusion is maintained is the problem > being addressed. > > The remedy that "makes most sense" will vary with the design (and > functionality) of the service being upgraded. And, the approach the > maintainer chooses to address those "rolling updates" >
I used the word "you" to mean the system providing the service (including the programmer who implemented the new algorithm). However, a factorial calculation is a poor example in that it is not persistent. It may make sense for you (the system) to dump the work you've done and start over, or to continue with the old algorithm for this instance.
> As it would be heavy-handed for the system to dictate how EVERY service > is coded AND the constraints placed upon their algorithms, the system > can only offer (prefabricated) *mechanisms* that the service designer > (and maintainer) can exploit to facilitate the upgrade. > > And, the system has to rely on the designer/maintainer to make best use > of the mechanisms that it provides -- because the designer/maintainer > has more intimate knowledge of the way the service is intended to work. > > A "lazy" designer may choose not to address live upgrade issues. > In which case, the system will resort to draconian measures when > an upgrade is installed: it will KILL the running service and > let the clients deal with the resulting mess. *Users* will then > either avoid products from that provider *or* will avoid upgrading > (if the consequences are too painful -- where "too" is a subjective > criteria defined by the user in question). > >> What *can* you do with the >> unfinished solution other than dump the work and restart the problem >> with the >> new algorithm or let it finish? (and next time use the new algorithm)? > > I prepare a document using /WordProcessor25/. The document can be seen > as a snapshot of the "conceptual document" that I seek to prepare. > I upgrade to /WordProcessor29/. Is all of the work that I did prior > to that upgrade lost? (Why not? :> )
I think the example you're trying for is that you run a word processor service and that I'm a client. I'm typing into a document using /WordProcessor25/ (which I think of as /WordProcessor/). You want to upgrade to /WordProcessor29/ while I'm typing right here. ^ In this case, perhaps in most cases, there's some point where the system can save its state as a checkpoint, start the new software and continue. If the system can do the change between user inputs, the change will be transparent. The case where the inputs come too fast is where it gets tricky and you may have to keep a copy of the old code running. Best wishes, --Phil pomartel At Comcast(ignore_this) dot net
On 4/25/2017 9:19 AM, Phil Martel wrote:
>>> So, lets say you're in the middle of calculating >>> factorial(1,000,000,000,000) >>> with algorithm 2. Then you find out about algorithm 1 (or maybe >>> decide that >>> Stirling's approximation is close enough). >> >> You (as an executing client who has called upon the "factorial service" >> to perform that calculation) don't "find out about" anything! To *you*, >> nothing appears remiss. That's the whole point; as long as the API >> hasn't changed, you shouldn't care that the service has been replaced >> with an equivalent service. >> >> How the *system* ensures that illusion is maintained is the problem >> being addressed. >> >> The remedy that "makes most sense" will vary with the design (and >> functionality) of the service being upgraded. And, the approach the >> maintainer chooses to address those "rolling updates" > > I used the word "you" to mean the system providing the service (including the > programmer who implemented the new algorithm). However, a factorial > calculation is a poor example in that it is not persistent.
There are no conditions placed on what a service can provide. E.g., my calculator is a service; in your world, it might be a library. I use factorial as an example of a "job" that can take some "macroscopic" amount of time -- rather than arguing about whether 10 microseconds or 200 hours is "too long" for a service to "linger" in the face of a pending/desired upgrade.
> It may make sense > for you (the system) to dump the work you've done and start over, or to > continue with the old algorithm for this instance.
There are *many* possible courses of action that the developer could apply to providing a rolling upgrade of his service. The system can't impose *one* -- without sharply constraining the types of services that can be implemented as well as the "time" each takes to operate. If a service has side-effects, then you can't (typically) start over as you would have to consider which side effects had already taken place. Etc.
>> As it would be heavy-handed for the system to dictate how EVERY service >> is coded AND the constraints placed upon their algorithms, the system >> can only offer (prefabricated) *mechanisms* that the service designer >> (and maintainer) can exploit to facilitate the upgrade. >> >> And, the system has to rely on the designer/maintainer to make best use >> of the mechanisms that it provides -- because the designer/maintainer >> has more intimate knowledge of the way the service is intended to work. >> >> A "lazy" designer may choose not to address live upgrade issues. >> In which case, the system will resort to draconian measures when >> an upgrade is installed: it will KILL the running service and >> let the clients deal with the resulting mess. *Users* will then >> either avoid products from that provider *or* will avoid upgrading >> (if the consequences are too painful -- where "too" is a subjective >> criteria defined by the user in question). >> >>> What *can* you do with the >>> unfinished solution other than dump the work and restart the problem >>> with the >>> new algorithm or let it finish? (and next time use the new algorithm)? >> >> I prepare a document using /WordProcessor25/. The document can be seen >> as a snapshot of the "conceptual document" that I seek to prepare. >> I upgrade to /WordProcessor29/. Is all of the work that I did prior >> to that upgrade lost? (Why not? :> ) > > I think the example you're trying for is that you run a word processor service > and that I'm a client.
Yes.
> I'm typing into a document using /WordProcessor25/ > (which I think of as /WordProcessor/). You want to upgrade to > /WordProcessor29/ while I'm typing right here.
I want to upgrade. *When* is a separate issue. The point I am making is that you *can* upgrade /WordProcessor/ and not lose the "work in progress" -- even if it has been sitting on an offline floppy for 2 years -- BECAUSE THE NEW WORDPROCESSOR KNOWS HOW TO UNDERSTAND THE STATE STORED BY THE OLDER VERSION. (i.e., yet another "upgrade strategy")
> In this case, perhaps in most cases, there's some point where the system can > save its state as a checkpoint, start the new software and continue. If the > system can do the change between user inputs, the change will be transparent. > The case where the inputs come too fast is where it gets tricky and you may > have to keep a copy of the old code running.
Again, the system can't know how the service behaves or how its clients expect to use it. So, you can't *impose* an upgrade strategy on the service. Instead, you provide mechanisms that allow many approaches to be used and count on the designer/maintainer to use their specific knowledge of the service (THEIR service!) to decide which of them to exploit. Part of installing the upgrade "software" is the specification of the upgrade strategy mechanism to be used -- along with any ancillary requirements FOR THE UPGRADE. Make it easier for developers to do what they "should" instead of forcing them to do ALL of the heavy lifting (which would tend to result in more "please reboot, now" scenarios) What I've been doing (since Friday evening) is codifying the different strategies and drafting guidelines for when each is "preferred" along with when each is contraindicated. Then, I'll sort out the consequences of a developer incorrectly specifying the "wrong" strategy/mechanism; or, incorrectly implementing it in their upgrade: how might this screw up other aspects of the system and what can I do to guard against that.

The 2024 Embedded Online Conference