EmbeddedRelated.com
Forums

RPC/RMI in heterogeneous environments

Started by Don Y January 21, 2021
On 1/31/2021 12:39 PM, upsidedown@downunder.com wrote:
> On Thu, 21 Jan 2021 18:38:27 -0700, Don Y > <blockedofcourse@foo.invalid> wrote: > >> I postpone automatic type conversion to the *servant*, rather >> than embedding it in the client-side stub. My reasoning is >> that this allows a wider window for the targeted node to change >> *after* the RPC/RMI invocation and before it's actual servicing. >> >> I.e., it would be silly to convert arguments for a target node >> of type X, marshall them and then, before actual transport, >> discover that the targeted *service* now resides on node Y >> (which is of a different "character" than X). That would require >> unmarshalling the already converted arguments (the originals having >> been discarded after the first conversion), converting them from X >> to Y representations, remarshalling and facing the same potential >> "what if it moves, again?" >> >> However, this defers actual *checking* the data until after >> transport. It also means transport errors take a higher priority >> than data type errors (cuz the data aren't checked until they've >> actually been transported!) >> >> Discussing this with colleagues, it seems a smarter approach might >> be to split the activities between client- and server- side stubs. >> E.g., check the data in the client-side stub so any errors in it >> can be reported immediately (without waiting or relying on transport >> for that to occur). Then, let the servant sort out the details of >> conversion KNOWING that the data are "valid"/correct. >> >> [The downside of this is the servant must bear the cost of the >> conversion; so, targeting an action to a resource-starved node is >> costlier than it might otherwise have been.] >> >> The process obviously mirrors for return value(s) conversions. >> >> [I haven't, yet, considered if anything undesirable "leaks" in either >> of these approaches] > > Dud you look at the MQTT protocol https://en.wikipedia.org/wiki/MQTT > which is often used with the IoT drivel. The MQTT Broker handles the > marshaling between lightweight clients.
Different paradigm entirely. I'm trying to implement a "connection" between a specific instance of a specific client's invocation of a specific method on a specific object. I.e., the mechanism between: result = object.method(params) and type method(params) {... return retval} In a single processor environment (in a single process container), this mechanism would be the compiler's building of the stack frame and "CALL" to the targeted method (referencing the object specified). I don't want to know about any other methods or objects. And, no one else should know about this object or its methods UNLESS THEY HAVE BEEN TOLD TO BE SO INFORMED. As the (malevolent) developer can avoid the protections that the compiler provides (e.g., pass bogus data or a bogus object reference or method selector), that mechanism has to protect the codebase against such attacks. Do something you shouldn't and you die -- you've either got a latent bug (in which case, it's dubious as to whether the next line of code that you execute will be correct) *or* have malicious intent (in which case, I don't want to give you another chance to muck with things). MQTT is more of an "active whiteboard" where agents exchange information. I can create an app (object) that is responsible for turning on a light when the ambient light level falls below a particular level. The developer coding the app can KNOW that there may be hundreds of possible lights. And, many different ways of determining "ambient" (is it an outdoor determination? indoor? windowless basement? etc.). He may also know that there is an electric garage door opener that could be commanded to open and allow (intruders? accomplices??) to gain access to the property. But, he can't use any of that information. He can only access the sensor that is provided to him and can only control the light that he is given control over. If he is the only application that is concerned with that light, then he's the only app that even knows that the light exists! And, the light is "safe" from other malicious/buggy acts. Marshalling arguments is easy. Getting them in the right form (data type/encoding) and at the right *time* is the issue being addressed by "late conversion" (to coincide with the late reification)
On Sun, 31 Jan 2021 12:37:27 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 1/29/2021 11:33 AM, upsidedown@downunder.com wrote: >> On Thu, 21 Jan 2021 18:38:27 -0700, Don Y >> <blockedofcourse@foo.invalid> wrote: >> >>> I postpone automatic type conversion to the *servant*, rather >>> than embedding it in the client-side stub. My reasoning is >>> that this allows a wider window for the targeted node to change >>> *after* the RPC/RMI invocation and before it's actual servicing. >> >> You must be quite desperate (extreme data rate, flea power client, >> very small memory client) for even considering functionality that >> naturally belongs to the client. > >It's "power" (resource) related only in that time = 1/"power"
Computing power (and hence power consumption) is often exchangeable with time. Thus the _energy_ consumption of a complex operation can be nearly constant. This can be an issue in truly flee power systems, in which you may have to find algorithms with less complexity to reduce total energy consumption,
> >> Any healthy clients can handle byte order, time zones, character sets >> and languages. > >Convert Q10.6 to double
Q10.6 is already in binary representation fitting into a 16 bit word or two bytes. Just some shifting to do the hidden bit normalization and some exponent adjustment and you are done. Should be easy with even any 8 bit processor. Doing it with some primitive 4 bitters (e.g. (4004/4040) would require more swapping between internal registers. Of course a decimal number with 10 integer digits and 6 fractional digits (either in BCD/ASCII/EBCDIC) requires much more work and scratchpad areas.
On Sun, 31 Jan 2021 11:44:22 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 1/28/2021 7:08 PM, George Neuner wrote: >> >> You're not seeing that the case where a service may change hosts is >> quite similar to the case where the service can be provided by >> multiple hosts. In either case, the client does not know which host >> will answer until it tries to communicate. > >It's not applicable in my world. The client NEVER knows which host >(which may be its own!) *will* service or *did* service a particular request >because the client never knows where the object resides, at any point in >time.
Terminology notwithstanding, *something* knows which host it is talking to: technically it may be a proxy/agent rather than an "end point" (client) per se, but [sans a completely generic encoding] something has to know in order to proper encode the data.
>The *kernel* does the actual transaction as it knows whether your handle >ACTUALLY references an object of that "type" (a handle is just an int), >which permissions you have for the object's methods, where the object is >"physically" located, etc. (if those tests were done in the client's >address space, the client could bypass them) > >"You" don't see the RMI as anything other than an expensive function >invocation. You aren't involved in deciding how -- or where -- the >action is performed; all you get to see is the result(s) of that >invocation (or a data error or transport failure).
You're seeing lots of trees but you are missing the forest. It DOES NOT MATTER who/what makes the actual call - the point is that the "other end" is not known until the call is made. The situation is EXTREMELY SIMILAR (if not identical) to that of a conventional network client invoking a replicated remote service. And it is amenable to many of the same solutions. The cases of the service being co-hosted with the client or be hosted on a compatible CPU (thus needing no data translation) are simply variations that may be streamlined.
>>>> However, in older systems it was not uncommon for clients to negotiate >>>> the protocol with whatever server responded, and for data in the >>>> resulting messages to be sent in the native format of one side or the >>>> other (or both, depending). Whether the client or the server (or >>>> both) did some translation was one of the points that was negotiated. >>> >>> I've not considered defering that decision to run-time. I >>> make sure that the targeted node for a service CAN support the >>> data types that its clients require. Then, the question is >>> whether to do the conversion client- or server- -side. >> >> Perhaps it is time to look into such a run-time solution. > >That's done at compile time (sorting out how to do the conversions) >and at run time by the workload manager knowing which nodes are >running which servers, on which types of hardware/processors, with >what resource complements, etc. There's no "negotiation" involved; >the workload manager simply conditions its selection of the "best" >node on which to place a server based on the data types that will >be required (if sited, there) and the conversion routines available >for its clients.
Again terminology. "Negotiation" does not imply lots of back and forth hashing out nitty gritty details ... in network parlance it means simply that there is an exchange of meta-data which allows the systems to communicate more effectively. At simplest the meta-data could be a single word indicating what kind of CPU is at the other end. At the most complex it could be a vector of codes representing what data translations could be needed and what each side is capable of doing. Messages then are encoded using the "lowest" common format(s) [for some definition of "lowest"] and each side translates only what it can and/or must. In either case, it is a single handshake consisting of one "this is what I am / what I can do" message in each direction. It is typical for such host specific meta-data to be cached so that it is exchanged only if the remote host has not been seen previously or for some period of time.
>>> I don't want to get into a situation where (potentially), an RMI >>> stalls in the outbound stack because it is repeatedly massaged to >>> adapt to a "target in motion". >> >> But what is the likelihood of that? You have said that the server can >> only be started on a CPU that has sufficient resources ... so apart >> from a failure WHY would it be started somewhere and then quickly >> moved somewhere else? > >It may have been started 4 hours ago; that doesn't change the fact that >there is a (possibly large) window between when the RMI is invoked and >when the message actually hits the wire, destined for some PARTICULAR >node (which may now be incorrect, given the data type conversions that >were applied to the parameters WHEN INVOKED).
If you anticipate that clients routinely may spend unbounded time inside local RMI code without any messages being exchanged, then the system is fundamentally unstable: it is subject to livelock (as well as deadlock), and the global state can become perpetually inconsistent - which is a form of Byzantine failure.
>The *tough* failures are "someone unplugged node #37" (or powered it down, >unintentionally). Suddenly EVERYTHING that was running on that node is *gone*. >And, every reference INTO that node (as well as out of) is now meaningless, >on every node in the system! > >Because a node can easily host hundreds/thousands of objects, there is a >flurry of activity (within a node and between nodes) when a node goes down >so unceremoniously. > >[It's actually entertaining to watch things move around, new nodes get powered >up, etc.] > >So, the system sees a spike in utilization as it tries to recover. >That increases the likelihood of deadlines not being met. Which, in >turn, brings about more exception handling. And, services deciding >to simply quit (HRT) -- which propagates to clients who expected those >services to be "up", etc. > >[The challenge is ensuring this doesn't become chaotic]
Yes ... but in the face of Byzantine failures there is only so much that can be done automagically. FWIW: when the system knows it is in a "recovery" mode, it may make sense to elide handling (or even throwing) of certain exceptions.
>As this isn't "normal" operation, the workload manager can't know what >sort of NEW load factors will exist as they ALL try to recover from their >missed deadline(s), broken RMIs, dropped endpoints, etc. It can only react >after the fact. *Then*, start reshuffling objects to eliminate peaks. > > : > >There are also other "normal use cases" that result in a lot of >dynamic reconfiguration. E.g., if power fails, the system has to >start shedding load (powering down nodes that it deems nonessential). >Doing that means those resources have to be shuffled to other nodes >that will remain "up" -- at least until the next load shedding >threshold (to prolong battery life). And, competing with users who >may have their own ideas as to how to shift the load to address the >power outage. > >And, of course, the normal acquiring and shedding of applications on >the whims of users varies the instantaneous load on the system... which >precipitates load balancing activities. > >Bottom line, things are ALWAYS jostling around.
Understood ... the problem is that in order to maintain stability, some central authority has to direct the changes - the typical networking techniques of relying on timeouts and retries will not work well (if it works at all) when the environment is very highly dynamic. There are ways to evaluate global status in a dynamic system ... not continuously, but in snapshot ... and the goal then is maintain that every snapshot shows the system to be consistent.
>I expect a *lot* of service invocations as EVERYTHING is an object >(thus backed by a service). Applications just wire together services. >The system sorts out how best to address the *current* demands >being placed on it. > >In ages past, you'd just keep throwing a bigger and bigger processor at >the problem as more "responsibilities" got added. And, in periods of >less-than-peak demand, you'd bear the cost of an overprovisioned system. > >Eventually, you'd realize you couldn't continue scaling in this manner >and would opt for additional processor(s). Then, you'd start sorting >out how to partition the (existing!) system. And, how to rebuild >applications to handle the introduction of new communication protocols >between components on different processors. > >[One of my colleagues is working on an app that sprawls over close >to a square mile!]
George
On 1/31/2021 11:50 PM, upsidedown@downunder.com wrote:
> On Sun, 31 Jan 2021 12:37:27 -0700, Don Y > <blockedofcourse@foo.invalid> wrote: > >> On 1/29/2021 11:33 AM, upsidedown@downunder.com wrote: >>> On Thu, 21 Jan 2021 18:38:27 -0700, Don Y >>> <blockedofcourse@foo.invalid> wrote: >>> >>>> I postpone automatic type conversion to the *servant*, rather >>>> than embedding it in the client-side stub. My reasoning is >>>> that this allows a wider window for the targeted node to change >>>> *after* the RPC/RMI invocation and before it's actual servicing. >>> >>> You must be quite desperate (extreme data rate, flea power client, >>> very small memory client) for even considering functionality that >>> naturally belongs to the client. >> >> It's "power" (resource) related only in that time = 1/"power" > > Computing power (and hence power consumption) is often exchangeable > with time. Thus the _energy_ consumption of a complex operation can be > nearly constant. This can be an issue in truly flee power systems, in > which you may have to find algorithms with less complexity to reduce > total energy consumption,
By "power", I meant all resources, not just watts. A device can consume very little power (watts) but also be crippled or capable in what it can do in any given amount of time. E.g., I can perform fast multiplies without a hardware multiply -- if I can trade memory for time. [It's also possible to be a power HOG -- and STILL not be able to do much!]
>>> Any healthy clients can handle byte order, time zones, character sets >>> and languages. >> >> Convert Q10.6 to double > > Q10.6 is already in binary representation fitting into a 16 bit word > or two bytes. Just some shifting to do the hidden bit normalization > and some exponent adjustment and you are done. Should be easy with > even any 8 bit processor. Doing it with some primitive 4 bitters (e.g. > (4004/4040) would require more swapping between internal registers.
It's not a question of how complex the operation is vs the capabilities of the processor; an MC14500 can EVENTUALLY pull it off! But, increasing complexity (with a given set of "capabilities") adds to "conversion delay". The issue is putting this conversion task in series with the routing of the packet. If you convert parameters on the client-side, then the time spent converting them increases the time until the packet gets onto the wire and, ultimately, to the target node. (Note that you had to make assumptions as to WHICH node would be targeted in order to know which conversions would need to be made.) But, while you are busy converting, a concurrent process (possibly on another node) can cause the targeted object to be relocated to a new node. So, the work done preparing the parameters for the original node is now wasted; a NEW set of conversions need to be applied to fit the requirements of the NEW target node (because you've decided that these conversions need to be done at the *client*!). I.e., even though the wire is ready for your packet, your packet is no longer ready for the wire! By contrast, if you ship the parameters "as is" and rely on the far end to perform any necessary conversions, then the time between the client invoking the RMI and the time the packet is ready to be placed on the wire is shortened. This reduces the likelihood of an "object moved" event rendering your conversions futile (though it still means your packet has to be routed to a different location than from the one initially assumed).
> Of course a decimal number with 10 integer digits and 6 fractional > digits (either in BCD/ASCII/EBCDIC) requires much more work and > scratchpad areas.
Imagine shipping a 4K page of audio samples (because that is far more efficient than shipping individual samples!) in little endian form to a server that wants them in big endian form. Or, a TIFF to a server that wants a JPEG. Or, an array of 1000 floats to a server that wants IEEE 754 doubles. I.e., "power" is a relative term. (Of course, there are practical limits to the sorts of conversions you can expected to do in this process. But, you should be able to design a client or a service with reasonable implementation flexibility... and "fixup" for equally reasonable expectations from their counterparts)
On 2/1/2021 1:50 PM, George Neuner wrote:
> On Sun, 31 Jan 2021 11:44:22 -0700, Don Y > <blockedofcourse@foo.invalid> wrote: > >> On 1/28/2021 7:08 PM, George Neuner wrote: >>> >>> You're not seeing that the case where a service may change hosts is >>> quite similar to the case where the service can be provided by >>> multiple hosts. In either case, the client does not know which host >>> will answer until it tries to communicate. >> >> It's not applicable in my world. The client NEVER knows which host >> (which may be its own!) *will* service or *did* service a particular request >> because the client never knows where the object resides, at any point in >> time. > > Terminology notwithstanding, *something* knows which host it is > talking to: technically it may be a proxy/agent rather than an "end > point" (client) per se, but [sans a completely generic encoding] > something has to know in order to proper encode the data. > >> The *kernel* does the actual transaction as it knows whether your handle >> ACTUALLY references an object of that "type" (a handle is just an int), >> which permissions you have for the object's methods, where the object is >> "physically" located, etc. (if those tests were done in the client's >> address space, the client could bypass them) >> >> "You" don't see the RMI as anything other than an expensive function >> invocation. You aren't involved in deciding how -- or where -- the >> action is performed; all you get to see is the result(s) of that >> invocation (or a data error or transport failure). > > You're seeing lots of trees but you are missing the forest. > > It DOES NOT MATTER who/what makes the actual call - the point is that > the "other end" is not known until the call is made.
No. The kernel ***KNOWS*** where the object is located BEFORE THE RMI IS EVEN INVOKED! There's no need to inquire as to its current location. Nor is there a mechanism for doing so as there is no global repository of object information; if you don't have a "hand" (handle) on an object you don't know that it exists (let alone WHERE it exists)! So, in the absence of a last minute notification of the targeted object having moved "(very-very) recently", the kernel can be assured that it knows where the object is, presently -- exactly where it was "a while ago"! By shrinking the time between when the kernel NEEDS to know where the object resides and the time the kernel can put the message on the wire (remembering that the wire may not be READY for the message at any given time), I minimize the window in which an "object-has-moved" notification can render the location information obsolete. [You can't eliminate this possibility when you have TRUE parallelism. A uniprocessor can give the illusion of eliminating it by ensuring that only one sequence of instructions executes at a time -- locking out the move (or, alternatively, the reference) until the move is complete.] As this is done in the kernel, the kernel can (almost literally) fill in the destination address AS it is putting the message on the wire. If I have to convert parameters BEFORE transport, then I have to look at that location information *earlier* -- in order to figure out WHICH conversions are appropriate for the targeted node. The fact that the kernel is doing this means it gets done "right"; instead of relying on individual clients to keep track of every "object-has-moved" notification and do-the-right-thing with that info. [In an early "process migration" implementation, I exposed all of these events to the clients/objects -- thinking that there might be some value that they could add, in certain special cases. I never found a way to exploit those notifications (cuz they still don't tell you anything about the old or new location!) So, they just added more work for each process; more stuff that the process developer could screw up!] It also means the kernel can track the location of the object for ALL (local) clients that presently have references into it -- instead of having to notify multiple clients of a single object's movement, even if those clients would "do nothing" with the information. Finally, it means ALL clients having references to a particular object can be notified "in real time" when anything happens to an object.
> The situation is > EXTREMELY SIMILAR (if not identical) to that of a conventional network > client invoking a replicated remote service. And it is amenable to > many of the same solutions. The cases of the service being co-hosted > with the client or be hosted on a compatible CPU (thus needing no data > translation) are simply variations that may be streamlined. > >>>>> However, in older systems it was not uncommon for clients to negotiate >>>>> the protocol with whatever server responded, and for data in the >>>>> resulting messages to be sent in the native format of one side or the >>>>> other (or both, depending). Whether the client or the server (or >>>>> both) did some translation was one of the points that was negotiated. >>>> >>>> I've not considered defering that decision to run-time. I >>>> make sure that the targeted node for a service CAN support the >>>> data types that its clients require. Then, the question is >>>> whether to do the conversion client- or server- -side. >>> >>> Perhaps it is time to look into such a run-time solution. >> >> That's done at compile time (sorting out how to do the conversions) >> and at run time by the workload manager knowing which nodes are >> running which servers, on which types of hardware/processors, with >> what resource complements, etc. There's no "negotiation" involved; >> the workload manager simply conditions its selection of the "best" >> node on which to place a server based on the data types that will >> be required (if sited, there) and the conversion routines available >> for its clients. > > Again terminology. "Negotiation" does not imply lots of back and forth > hashing out nitty gritty details ... in network parlance it means > simply that there is an exchange of meta-data which allows the systems > to communicate more effectively.
There is no point in time where this process occurs. A node's kernel KNOWS where all of the objects that it references are located. It knows (and can find out nothing about) any OTHER objects; if you don't have a "hand" on the object, it doesn't exist (to you). The kernel's notion of "where" is simply updated when the object moves. This is just an efficiency hack; if the kernel didn't know of the move, then it would deliver the message to the object's previous location -- and the server that previously handled requests for that particular object would forward it along to the new location -- it knows this (and has a "hand" on that new instance -- because *it* was tasked with forwarding the internal server state that represents that object). Without the notification (location update), the kernel can't FIND the object. If the object is NEVER "used", the kernel *still* knows where it is! Its only when the last reference (on that node) to the object is dropped that the kernel "forgets" -- about the location AND THE OBJECT! I.e., when the server that was previously handling requests for that object "goes away", any old handles to that object's existence, there, become unresolvable; if you haven't been informed of the new object instance's location, it's lost to you -- forever!
> At simplest the meta-data could be a single word indicating what kind > of CPU is at the other end. > > At the most complex it could be a vector of codes representing what > data translations could be needed and what each side is capable of > doing. Messages then are encoded using the "lowest" common format(s) > [for some definition of "lowest"] and each side translates only what > it can and/or must. > > In either case, it is a single handshake consisting of one "this is > what I am / what I can do" message in each direction. It is typical > for such host specific meta-data to be cached so that it is exchanged > only if the remote host has not been seen previously or for some > period of time.
Again, there is no need to "ask" or "tell". The kernel already *knows*. And, knows that there *are* necessary conversion operators in place on <whatever> target node to ensure a message will be "translated" correctly -- otherwise, the object wouldn't have been allowed to move to that node! And, the target node already knows which conversions to apply to each incoming method invocation FOR that object (and others of its type) A node can't magically change its hardware. So, whatever the characteristics of the node when it came on-line persist to this point in time. Why *ask* it if you already KNOW the answer?
>>>> I don't want to get into a situation where (potentially), an RMI >>>> stalls in the outbound stack because it is repeatedly massaged to >>>> adapt to a "target in motion". >>> >>> But what is the likelihood of that? You have said that the server can >>> only be started on a CPU that has sufficient resources ... so apart >>> from a failure WHY would it be started somewhere and then quickly >>> moved somewhere else? >> >> It may have been started 4 hours ago; that doesn't change the fact that >> there is a (possibly large) window between when the RMI is invoked and >> when the message actually hits the wire, destined for some PARTICULAR >> node (which may now be incorrect, given the data type conversions that >> were applied to the parameters WHEN INVOKED). > > If you anticipate that clients routinely may spend unbounded time > inside local RMI code without any messages being exchanged, then the > system is fundamentally unstable: it is subject to livelock (as well > as deadlock), and the global state can become perpetually inconsistent > - which is a form of Byzantine failure.
That's always the case when you can't bound the problem space. The user can always opt to install applications that the hardware can't support (when considering the range of OTHER applications that could potentially be CO-executing, at any given time). The user can also opt to *reduce* the available hardware. <shrug> [A PC user can install a suite of applications on his PC that could NEVER, practically, co-execute (due to resource limitations). But, the PC doesn't explicitly *say* "No, you can't do that!". Instead, it tries its best to support everything in the framework that was imposed when it was designed. It may lead to excessive thrashing, long response times, dropped events, etc. If that's not satisfactory, buy a faster PC (or change the application load). That's *your* problem, not the PC's! Or, any of the application developers'!] This is very different than *most* embedded systems where the (functionality of the) code that executes after release-to-manufacturing is very similar to the code that exists the day the product is retired. Think about ANY project that you've done. Imagine the end user coming along and installing some other bit of software on the hardware that's hosting your application. Or, deciding to remove some hardware resources (downgrade CPU, RAM, disk, etc.) Can you make any guarantees as to whether your original application will continue to run? Or, that the new complement of applications will peacefully coexist? Chances are, if confronted with this situation, you (or the product vendor) will simply disclaim that situation as the user has modified the product from its original sales condition; warranty void! All *I* can do is try to minimize the scenarios that lead to this sort of behavior and hope that specific instances are "sufficiently damped" so they don't take the whole system down (but may lead to transient failures). And, make the OS services aware of the sorts of things that *can* happen. So, when the system starts to go chaotic, those heuristics can inherently prune the actions that would swamp a less inspired solution. E.g., if the process that should be scheduled next simply can't meet its deadline, then don't even bother giving it a chance -- schedule its deadline handler, instead (and eliminate that load from the processor). Just like I can't prevent the user from unplugging something unilaterally without giving me a head's up. I can crash, unceremoniously, and blame him for "not following the rules". But, its unlikely that he's going to see it as HIS fault ("What a stupid system!") [I was fiddling with the GPUs in one of my workstations, yesterday. As it is hard to "service" them, I apparently didn't seat one of the GPUs properly. The 1100W power supply didn't burn traces on the motherboard. The GPU didn't get toasted. Instead, I got an error code that essentially said "The power supply (which I know to have been operating properly a few minutes earlier) has a problem." What did I do (likely *incorrectly*), recently? Remove card. Reseat -- CAREFULLY. Error gone! Obviously, PC and GPU vendors design with the EXPECTATION that these sorts of things WILL happen and add provisions to their designs to avoid costly consequences. Had I "lost" any piece of kit DUE TO MY ERROR, I would NOT have been happy with the GPU/PC vendor(s)!]
>> The *tough* failures are "someone unplugged node #37" (or powered it down, >> unintentionally). Suddenly EVERYTHING that was running on that node is *gone*. >> And, every reference INTO that node (as well as out of) is now meaningless, >> on every node in the system! >> >> Because a node can easily host hundreds/thousands of objects, there is a >> flurry of activity (within a node and between nodes) when a node goes down >> so unceremoniously. >> >> [It's actually entertaining to watch things move around, new nodes get powered >> up, etc.] >> >> So, the system sees a spike in utilization as it tries to recover. >> That increases the likelihood of deadlines not being met. Which, in >> turn, brings about more exception handling. And, services deciding >> to simply quit (HRT) -- which propagates to clients who expected those >> services to be "up", etc. >> >> [The challenge is ensuring this doesn't become chaotic] > > Yes ... but in the face of Byzantine failures there is only so much > that can be done automagically.
Correct. You just try to minimize the impact "expected" (which is not the same as "desired" or "planned") events have on the system. E.g., if the transport media ran at 9600 baud, you'd expect lots more problems related to transport delays. If the network had considerably more hosts you'd expect media access delays. Etc. Shrinking the window of vulnerability (to object movement) is just another attempt to tamp down the worst case consequences. E.g., you can buy metastable-hardened components, reduce event and/or clock frequencies, add multiple layers of synchronizers, etc. but that just minimizes the *probability* of a metastable condition eating your lunch; it doesn't (completely) eliminate it.
> FWIW: when the system knows it is in a "recovery" mode, it may make > sense to elide handling (or even throwing) of certain exceptions.
How do you decide what should be notified and what shouldn't? Do you let application developers (each with their own self-interests) declare what's important? Do you implement a clearing house to review applications and impose its own notion of significance on them? I opt for providing as much information as possible to clients/servers (cuz a server can actually be an *agent*). And, hoping folks who are diligent developers will take the effort to sort out suitable recovery strategies (that only THEY can imagine). [This was the reasoning behind exposing all of the "object motion" events to the clients, originally] If you don't want to handle the exception (or, don't have a robust strategy), then the default handler is invoked -- which just kills off your process! Don't want to check the result of malloc()? Fine! SIGSEGV works! :> Of course, if current events indicate that the system can't perform some desired action, then it *can't*!
>> As this isn't "normal" operation, the workload manager can't know what >> sort of NEW load factors will exist as they ALL try to recover from their >> missed deadline(s), broken RMIs, dropped endpoints, etc. It can only react >> after the fact. *Then*, start reshuffling objects to eliminate peaks. >> >> : >> >> There are also other "normal use cases" that result in a lot of >> dynamic reconfiguration. E.g., if power fails, the system has to >> start shedding load (powering down nodes that it deems nonessential). >> Doing that means those resources have to be shuffled to other nodes >> that will remain "up" -- at least until the next load shedding >> threshold (to prolong battery life). And, competing with users who >> may have their own ideas as to how to shift the load to address the >> power outage. >> >> And, of course, the normal acquiring and shedding of applications on >> the whims of users varies the instantaneous load on the system... which >> precipitates load balancing activities. >> >> Bottom line, things are ALWAYS jostling around. > > Understood ... the problem is that in order to maintain stability, > some central authority has to direct the changes - the typical > networking techniques of relying on timeouts and retries will not work > well (if it works at all) when the environment is very highly dynamic.
The responsibility is distributed. Individual nodes can decide they need to shed some load (and WHICH load to shed). Likewise, they can advertise what loads (strictly in terms of the resources they are willing to dole out to an imported load) they are willing to take on. There's never *an* optimal solution; you just try to "make things a little better" given your current knowledge of the current situation. And, then try to make THAT *new* situation "a little better". I.e., expect object assignments to be in a constant state of flux. Damp the response so the system doesn't oscillate, shuffling objects back and forth with no actual benefit to the system's state. E.g., many objects (processes) just idle "most of the time". A thoughtful developer doesn't claim all of the resources he will EVENTUALLY need -- when he is just idling. You *can* do so -- but, when the system decides to shed load, it is likely to see YOU as a nice, fat opportunity to bring that resource under control! Rather, when/if they need to "do work", they can consume a boatload of resources -- if they happen to be available (SOMEWHERE in the system). I can learn when some of these things are *likely* to happen, if they are driven by the users. E.g., you might watch TV/movies in the evening so I can expect to need to have at least two network speakers and one network display "on-line", plus the associated CODECs, at that time. But, others are reactions to events that are "random"/unpredictable. E.g., I can't predict when the phone might ring or someone might come to the front door (though both are unlikely in the wee-hours of the morning -- so, that might be a good time to schedule the speech recognizer training based on the speech samples harvested from today's phone conversations!) So, I can't *anticipate* a good binding of objects to nodes. Instead, I have to be able to react to changes as they occur. And, try not to break things AS they are being juggled! [Of course, I can *notice* configurations/object-node bindings that seem to work better than others and *tend* towards promoting those configurations; its silly NOT to notice good and bad situations and use them in influencing future object distribution choices. E.g., I start shuffling objects around when I detect the first ring of the phone -- *before* those objects are called on to actually "work" at answering the phone!] Having the ability to move objects (processes) and bring other nodes on/off-line is my way of coping with peak loads without resorting to unilateral overprovisioning (or, arbitrarily limiting the functionality that can be used at any given time). Inherent in this is the fact that there will be transient periods where things are in flux. [To avoid this, you could design individual DEDICATED appliances that have all of the resources that they need -- even if they happen to be idle (much of the time). The classic over-provisioning approach. Of course, that hardware then perpetually limits how the device can evolve! (how often does your smart TV get an update?) Or, impose arbitrary limits on what a system (monolithic or distributed) can be asked to do at any given time. (under-utilizing?) Or, hope some "big server" is always available (in a timely manner) to provide whatever resources you need. The "it's-someone-else's-problem" approach. Or, let the user decide that things aren't performing AS HE EXPECTS and take manual actions to fix the situation (which is how folks deal with their PC's when they start thrashing, responding slowly, etc.)]
> There are ways to evaluate global status in a dynamic system ... not > continuously, but in snapshot ... and the goal then is maintain that > every snapshot shows the system to be consistent.