EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

Common name for a "Task Loop"

Started by Tim Wescott June 24, 2016
On 16-07-18 07:08 , George Neuner wrote:
>>> On 16-07-10 12:41 , Don Y wrote: >>>> It's no different than a uniprocessor implementation -- *if* you >>>> "migrate" the "scheduling criteria" (in your case, "priorities") >>>> *with* the communication... THROUGHOUT the system! > > Don is talking about both multicores and physically separate > processors connected by network. As he has described it elsewhere - > his system is able to migrate a running task to any suitable platform > anywhere within the network.
As I have understood it, Don does not migrate "tasks", in the sense of RTOS "task", but "jobs" or "computations". That is, what moves across the net is the input for and output from a computation, together with some scheduling criteria, and not a "task" in the sense of a TCB and other context data, stack, variables, etc. The migrated computation is performed by some task (in the RTOS sense) that already exists on the remote node, or is perhaps created on that node for doing this computation. This is of course much simpler and more efficient than migrating a task, and does not depend on any HW compatibility of the local and remote nodes, and does not need deep access to task internals in the local and remote RTOS. -- Niklas Holsti Tidorum Ltd niklas holsti tidorum fi . @ .
On Mon, 18 Jul 2016 21:16:15 +0300, Niklas Holsti
<niklas.holsti@tidorum.invalid> wrote:

>On 16-07-18 07:08 , George Neuner wrote: >>>> On 16-07-10 12:41 , Don Y wrote: >>>>> It's no different than a uniprocessor implementation -- *if* you >>>>> "migrate" the "scheduling criteria" (in your case, "priorities") >>>>> *with* the communication... THROUGHOUT the system! >> >> Don is talking about both multicores and physically separate >> processors connected by network. As he has described it elsewhere - >> his system is able to migrate a running task to any suitable platform >> anywhere within the network. > >As I have understood it, Don does not migrate "tasks", in the sense of >RTOS "task", but "jobs" or "computations". That is, what moves across >the net is the input for and output from a computation, together with >some scheduling criteria, and not a "task" in the sense of a TCB and >other context data, stack, variables, etc.
I may be mistaken, but my understanding is that both occur in Don's system: tasks (in the sense of running programs) can move between CPUs as needed, and is discussed in this thread, running computations can jump from one CPU to another due to client/server interactions. Don will see this and set us all straight soon enough 8-) George
On Mon, 18 Jul 2016 23:54:39 -0400, George Neuner
<gneuner2@comcast.net> wrote:

>On Mon, 18 Jul 2016 21:16:15 +0300, Niklas Holsti ><niklas.holsti@tidorum.invalid> wrote: > >>On 16-07-18 07:08 , George Neuner wrote: >>>>> On 16-07-10 12:41 , Don Y wrote: >>>>>> It's no different than a uniprocessor implementation -- *if* you >>>>>> "migrate" the "scheduling criteria" (in your case, "priorities") >>>>>> *with* the communication... THROUGHOUT the system! >>> >>> Don is talking about both multicores and physically separate >>> processors connected by network. As he has described it elsewhere - >>> his system is able to migrate a running task to any suitable platform >>> anywhere within the network.
Don might get his ideas better to the readers by trying to limit the length of individual postings to something reasonable and also to try not to stray too far away from the subject discussed. More people would read the postings throughout find out what is the main point. This would benefit both Don and the readers.
>>As I have understood it, Don does not migrate "tasks", in the sense of >>RTOS "task", but "jobs" or "computations". That is, what moves across >>the net is the input for and output from a computation, together with >>some scheduling criteria, and not a "task" in the sense of a TCB and >>other context data, stack, variables, etc.
Distributed control systems have done this for decades. It is an other question, should it be allowed to happen automatically or should it be commanded manually after careful analysis. For any time critical system, you would have to do again any timing analysis before and after the move.
>I may be mistaken, but my understanding is that both occur in Don's >system: tasks (in the sense of running programs) can move between CPUs >as needed, and is discussed in this thread, running computations can >jump from one CPU to another due to client/server interactions.
Sounds like a 1980's VAXcluster to me :-) Usable for non-RT applications, but of course there could be no timing guaranties.
> >Don will see this and set us all straight soon enough 8-) > >George
Hi George,

[SWMBO grumbling cuz her cookie jar has been empty for many days now.
Guess I'd best plan on baking, tonight, lest I have to deal with
The Frowny Face  (sigh).  Or, stop *improving* the Rx so she has
an incentive to buy store-bought, instead!!  ;-)  ]

On 7/17/2016 9:08 PM, George Neuner wrote:
> On Sun, 17 Jul 2016 09:56:32 +0300, upsidedown@downunder.com wrote: > >> On Sun, 10 Jul 2016 17:55:10 +0300, Niklas Holsti >> <niklas.holsti@tidorum.invalid> wrote: >> >>> On 16-07-10 12:41 , Don Y wrote: >>>> On 7/9/2016 5:46 AM, Niklas Holsti wrote: >>>>> On 16-07-09 00:56 , Les Cargill wrote: >>>>>> Niklas Holsti wrote: >>>>>>> [snips] >>>>>>> One example of interactions I find difficult is a shared I/O channel or >>>>>>> bus that must be used by various tasks, for various purposes, with most >>>>>>> transmissions being sporadic and such that the sending task must wait >>>>>>> for and check a response transmission. >>>>>> >>>>>> So the sender sends asynchronously then blocks on a receive >>>>>> ( presumably with a timeout ) , with other tasks/ISRs handling >>>>>> the details. You may even have separate send and receive loops, >>>>>> with state indicating the timeout for each receive. >>>>> >>>>> I have no problem *implementing* such things, my problem is >>>>> *analysing* the timing to compute worst-case task response times >>>>> under various load scenarios. This computation must also consider >>>>> the possible latencies of response-generation at the remote end of >>>>> the channel. >>>> >>>> It's no different than a uniprocessor implementation -- *if* you >>>> "migrate" the "scheduling criteria" (in your case, "priorities") >>>> *with* the communication... THROUGHOUT the system! >> >> If you have the luxury of multiple cores/processors, just set the task >> affinities so that the critical tasks are locking on one >> core/processors and do the timing analysis with that. The other >> noncritical tasks will run on the remaining core(s)/processor(s). >> >> This is especially important in multiprocessor applications with large >> private caches. If the scheduler is allowed to through around the task >> among all processors, there is a lot of cache invalidation. > > Don is talking about both multicores and physically separate > processors connected by network. As he has described it elsewhere - > his system is able to migrate a running task to any suitable platform > anywhere within the network.
Actually, I only recently decided to support multicore processors; mainly, they're too cheap *not* to! <frown> But, I'm taking the "easy" way out, there -- assigning specific subsystems to specific cores (e.g., like moving protected "capabilities" around; something that every node needs to be able to do!) The other "migration" issues (bad choice of terms) have different application domains. E.g., migrating the "client thread" INTO the "server" (i.e., letting it execute *in* the serving thread AS IF it still had the identity of the client thread) applies when you do a *local* IPC as well as when you do a *remote* RPC, obviously. I.e., even on a uniprocessor! (amusing, then, that it seems to not be supported in COTS/FOSS OS offerings! Oooops!) IIRC, Alpha called these "distributed threads" or, less humbly, "alpha threads" as a nod to the fact that the thread *conceptually* wanders around the "system" (regardless of: multitasking on a uniprocessor, multiprocessing via SMP or in a physically distributed system). The other sort of "migration" (relocation?) applies to "physically" moving the process to a different node elsewhere in the network. In this case, obviously only pertinent to a loosely coupled multiprocessor system (NORMA). I.e., once "migrated", the original host can get struck by lightning to no effect (wrt the task/process in question) [Note this is more involved than just packing up registers plus address space!]
> There have been some other operating systems capable of doing this. > AIUI, the interesting thing about Don's system is that he is > attempting to do real-time, real-world control ... not simply to > distribute processing over a bunch of networked "compute servers".
Exactly. Previous systems just were "processor farms" -- typically all "powered" just waiting for "workloads". The idea of *bringing* another node on-line, ON-DEMAND to address increasing needs wasn't part of their scope (why should it be? Unless you're concerned with power consumption!). Nor was there a concern over taking nodes OFF-line when they weren't technically needed. Finally, AFAICT, trying to meet timeliness constraints in such a malleable system was an extra degree of freedom never addressed. (and taking that into consideration AS you made these other dispatching/scheduling decisions) Cluster, Grid, NoW, Cloud, server farm, etc. -- none really address the concept accurately (for different reasons). IMO, designs will increasingly be moving in this direction as systems become too complex to be addressed reliably on single processors (even with multiple cores, memory becomes the bottleneck) as well as physically more dispersed applications. How do you factor in the activities of another "service" when *YOU* have to provide some sort of timeliness guarantees? [Should be relatively easy to see applications where a single processor will fall flat!]
On 7/18/2016 11:16 AM, Niklas Holsti wrote:
> On 16-07-18 07:08 , George Neuner wrote: >>>> On 16-07-10 12:41 , Don Y wrote: >>>>> It's no different than a uniprocessor implementation -- *if* you >>>>> "migrate" the "scheduling criteria" (in your case, "priorities") >>>>> *with* the communication... THROUGHOUT the system! >> >> Don is talking about both multicores and physically separate >> processors connected by network. As he has described it elsewhere - >> his system is able to migrate a running task to any suitable platform >> anywhere within the network. > > As I have understood it, Don does not migrate "tasks", in the sense of RTOS > "task", but "jobs" or "computations". That is, what moves across the net is the > input for and output from a computation, together with some scheduling > criteria, and not a "task" in the sense of a TCB and other context data, stack, > variables, etc.
No, as George mentioned, I do both. Unfortunately, there is a problem with terminology.
> The migrated computation is performed by some task (in the RTOS sense) that > already exists on the remote node, or is perhaps created on that node for doing > this computation.
These have been called "distributed threads" (/cf/ Alpha) to stress the fact that the "execution" is distributed across the system. Note that this can apply to a uniprocessor as well as a multiprocessor/distributed system. (I.e., IPC as well as RPC)
> This is of course much simpler and more efficient than migrating a task, and > does not depend on any HW compatibility of the local and remote nodes, and does > not need deep access to task internals in the local and remote RTOS.
You don't need HW compatibility to achieve this. <grin> But, you *do* need to "get intimate" with the OS as "registers+address space" are insufficient. OTOH, you'd *expect* the OS to be the entity providing this service! Consider how you would address: - computational requirements exceeding what you can buy in a SMP node - a *HUGE* "dynamic range" of computational requirements - hardware failures
On Mon, 25 Jul 2016 02:40:14 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>Hi George, > >[SWMBO grumbling cuz her cookie jar has been empty for many days now. >Guess I'd best plan on baking, tonight, lest I have to deal with >The Frowny Face (sigh). Or, stop *improving* the Rx so she has >an incentive to buy store-bought, instead!! ;-) ]
Leave it empty for a while longer ... maybe she'll try making them herself. 8-)
>On 7/17/2016 9:08 PM, George Neuner wrote: > >> Don is talking about both multicores and physically separate >> processors connected by network. As he has described it elsewhere - >> his system is able to migrate a running task to any suitable platform >> anywhere within the network. > >Actually, I only recently decided to support multicore processors; >mainly, they're too cheap *not* to! <frown> > >But, I'm taking the "easy" way out, there -- assigning specific >subsystems to specific cores (e.g., like moving protected >"capabilities" around; something that every node needs to be able >to do!) > >The other "migration" issues (bad choice of terms) have different >application domains.
Just about all "process" related terminology has been so heavily overloaded that it's hard to have a group discussion: unless you rigorously define absolutely every term, everyone will have a [more or less] different understanding based upon their own experience. There was a time [not so long ago] when "thread" referred to flow of control rather than a "scheduling entity", a single CPU was "multi-programmed" to [appear to] do many things simultaneously, "multi-programming" was distinct from "multi-processing", and "multi-tasking" was a layman's term that could mean almost anything. Ah, the good ol' days.
>E.g., migrating the "client thread" INTO the "server" (i.e., letting >it execute *in* the serving thread AS IF it still had the identity >of the client thread) applies when you do a *local* IPC as well as when >you do a *remote* RPC, obviously. I.e., even on a uniprocessor! >(amusing, then, that it seems to not be supported in COTS/FOSS OS >offerings! Oooops!) > >IIRC, Alpha called these "distributed threads" or, less humbly, "alpha >threads" as a nod to the fact that the thread *conceptually* wanders >around the "system" (regardless of: multitasking on a uniprocessor, >multiprocessing via SMP or in a physically distributed system).
Yes.
>The other sort of "migration" (relocation?) applies to "physically" >moving the process to a different node elsewhere in the network. >In this case, obviously only pertinent to a loosely coupled multiprocessor >system (NORMA).
Not familiar with NORMA. WRT OS literature, "migration" is the usual term for execution moving to a different CPU (or nowadays to a different core).
>I.e., once "migrated", the original host can get struck by >lightning to no effect (wrt the task/process in question)
Maybe. The "original host" might be another core on the same die.
>[Note this is more involved than just packing up registers plus >address space!]
It isn't THAT hard: clustered mainframes in the 1960's had the ability to migrate processes ... swap out here, swap in there. All it really requires is virtual addressing capability and a way to transport the code and runtime data. [The page oriented addressing in modern OSes is space efficient, but it actually complicates things versus simple base:offset segmentation addressing.] I know you're referring to the issues of (de)serializing runtime data structures for shipment over network ... I'm just pointing out that migration can be (and was!) accomplished relatively simply using shared storage.
>> There have been some other operating systems capable of doing this. >> AIUI, the interesting thing about Don's system is that he is >> attempting to do real-time, real-world control ... not simply to >> distribute processing over a bunch of networked "compute servers". > >Exactly. Previous systems just were "processor farms" -- typically >all "powered" just waiting for "workloads". The idea of *bringing* >another node on-line, ON-DEMAND to address increasing needs wasn't >part of their scope (why should it be? Unless you're concerned >with power consumption!). Nor was there a concern over taking >nodes OFF-line when they weren't technically needed.
The current crop of tera-scale computers consume megawatts, and the largest peta-scale computers consume 10s of megawatts when all their CPUs and attached IO devices are active. They have extensive power control systems to manage the partitioning of active/inactive devices. Even modern desktop CPUs have reached the point of needing to turn on/off functional units on demand as the instruction stream dictates. They need to do it because they are unable to dissapate heat effectively enough and will self destruct if too many circuits are powered simultaneously. Your average quad-core processor now lives in a perpetual state of "rolling blackout" with at most about 1/3 of its circuitry powered up at any given instant. Many circuits are turned on/off cycle by cycle. [Server chips, on the whole, are no better - but having more circuitry to work with means their powered up "1/3" can do more. Unless you're nitrogen cooling your system, you really aren't able to use much of what you theoretically paid for.] And nobody has yet come up with a really good way to exploit massively parallel hardware for general application programming. But that's a different discussion.
>Finally, AFAICT, trying to meet timeliness constraints in such >a malleable system was an extra degree of freedom never addressed. >(and taking that into consideration AS you made these other >dispatching/scheduling decisions)
Yes.
>Cluster, Grid, NoW, Cloud, server farm, etc. -- none really address >the concept accurately (for different reasons). > >IMO, designs will increasingly be moving in this direction as >systems become too complex to be addressed reliably on single >processors (even with multiple cores, memory becomes the bottleneck) >as well as physically more dispersed applications. How do you >factor in the activities of another "service" when *YOU* have to >provide some sort of timeliness guarantees? > >[Should be relatively easy to see applications where a single processor >will fall flat!]
George
On Mon, 25 Jul 2016 16:56:42 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 7/25/2016 2:32 PM, George Neuner wrote: > >>> [Note [process migration] is more involved than just packing up >>> registers plus address space!] >> >> It isn't THAT hard: clustered mainframes in the 1960's had the ability >> to migrate processes ... swap out here, swap in there. All it really >> requires is virtual addressing capability and a way to transport the >> code and runtime data. > >Yes. But they already had the "extra bits" (of state) that were >resident IN the OS's data structures. They either packed those >up with the "(formal) process state" as it was swapped out >*or* kept it in the kernel associated with the swapped out >process.
Yes.
>For example, any network traffic that was active at the time the >swap occurred still ended up with its endpoint on the current >node. You didn't have to buffer any incoming messages intended >for that "to be swapped" process and later forward them to the >new destination when the process is "restored".
Yes ... but circuit switching was old hat in the 1940's and packet switching evolved in the late 1950's [arguably it was first used in a real system in 1961]. Through the 1960's few installations had more than modems and terminals to worry about - easily automated using either controlled switching of either type. There were few packet networks (as we know them), most experimental, and no communication standards: there were as many different protocols as there were networks. Almost all of the hardware and the software protocols that people typically think of as being associated with "early" networking: Aloha, PUP, X.25, ARCnet, Ethernet, etc. - really all date from the 1970's. But your point is taken. <grin>
>>> ... Previous systems just were "processor farms" -- typically >>> all "powered" just waiting for "workloads". The idea of *bringing* >>> another node on-line, ON-DEMAND to address increasing needs wasn't >>> part of their scope (why should it be? Unless you're concerned >>> with power consumption!). Nor was there a concern over taking >>> nodes OFF-line when they weren't technically needed. >> >> The current crop of tera-scale computers consume megawatts, and the >> largest peta-scale computers consume 10s of megawatts when all their >> CPUs and attached IO devices are active. They have extensive power >> control systems to manage the partitioning of active/inactive devices. > >But their goal is to *use* all of that compute power, not let it >idle. They tend to be more homogeneous environments with more >"level" I/O usage. It's not like turning on CCTV cameras "because >it's getting dark outside" and, as a result, *needing* that extra compute >power to do video processing.
Supercomputing centers all are batch oriented just like mainframes used to be. The difference is they shut off CPUs that aren't in use - if any - to lower the power bills. It's true that a lot of older machines have plenty of work to keep them runnning ... but in the last 10-15 years, many newer ones have had odd architectures that make writing software for them difficult and time consuming. It is true that a lot of them use Intel or ARM processors, but it isn't true that they all run Linux and can be programmed using GCC/OpenMP. Some of the world's most powerful systems sit idle much of the time, simply for lack of software. And a lot of the software itself is surprisingly flexible. In most SCC environments there is no multi-tasking: a set of CPUs is dedicated to a program for its duration. But external factors may cause a program to be halted before finishing. The stopped program may be restarted later with a different number of CPUs according to the mix of programs that are running at that time. Programs which are expected to need many CPUs (or vast amounts of memory which very often is tied to the number of CPUs), or which are expected to run for more than a few minutes - such programs often are written to checkpoint intermediate processing, to be restartable from saved checkpoints, and to adapt dynamically to the number of CPUs they are given when (re)started.
>> And nobody has yet come up with a really good way to exploit massively >> parallel hardware for general application programming. But that's a >> different discussion. > >I don't believe we'll see any "effective" algorithms -- largely because >there aren't many "available" installations. So, you have groups >with very specific sorts of application sets trying to tackle the >problem for *their* needs; not for "general" needs.
The point is that "massively parallel" is becoming the norm. There are commodity server chips now with 16 and 32 cores, and this year's high end server chip will be in a laptop in 5 years. [ASUS now will happily sell you a water-cooled !! laptop with an overclocked 6th-gen i7 paired with 2 Nvidia 1080 GPUs. If you disconnect the water line it melts ... or maybe just slows down 50%. But seriously, what do you expect for only $7K?] In any event, more cores are fine for running more programs simultaneously, but there's no good *general* way to leverage more cores to make a single program run faster. The ways that are known to be automatically exploitable (by a compiler) are largely limited to parallelizing loops. Parallelizing non looping code [which is most of most programs that need it] invariably relies on the programmer to recognize possible parallelism and write special code to take advantage of it. Why should anyone care? Good question. I don't have a good answer, but a couple of data points: Lots of experience has shown that the average programmer can't write correct parallel code [or even just correct serial code, but that's another discussion]. Automating parallelism - via compilers or smart runtime systems - is the only way any significant percentage of programs will be able to benefit. Surveys have shown that for many people, the only "computer" they own or routinely use is their smartphone. The average person quite soon will reasonably expect their phone to able to do anything: word processing, spreadsheets, audio/video editing and presentation, i.e. general business computing, and (for those few taking college STEM courses) solving differential equations, performing circuit simulations, virtual reality walkthroughs of pyramids, galaxies, cadavers, etc. ... ... while snapchatting, tweeting, facebooking, pinteresting, and still providing 40 hours of use on a battery charge. Unless someone comes up with a way to pack kWh into a AA sized package [that can be recharged in under 30 minutes], making better use of large numbers of low powered cores is going to be the only way forward. As always, YMMV. George
On Tue, 26 Jul 2016 06:27:42 -0400, George Neuner
<gneuner2@comcast.net> wrote:

>On Mon, 25 Jul 2016 16:56:42 -0700, Don Y ><blockedofcourse@foo.invalid> wrote: > >>On 7/25/2016 2:32 PM, George Neuner wrote: >> >>>> [Note [process migration] is more involved than just packing up >>>> registers plus address space!] >>> >>> It isn't THAT hard: clustered mainframes in the 1960's had the ability >>> to migrate processes ... swap out here, swap in there. All it really >>> requires is virtual addressing capability and a way to transport the >>> code and runtime data. >> >>Yes. But they already had the "extra bits" (of state) that were >>resident IN the OS's data structures. They either packed those >>up with the "(formal) process state" as it was swapped out >>*or* kept it in the kernel associated with the swapped out >>process. > >Yes. > >>For example, any network traffic that was active at the time the >>swap occurred still ended up with its endpoint on the current >>node. You didn't have to buffer any incoming messages intended >>for that "to be swapped" process and later forward them to the >>new destination when the process is "restored".
Is this any different from the situation in the old days, when you had to swap out complete programs from core to make room for other programs ? If the task had some active I/O going on, it had some I/O-buffers (DMA-buffers) locked in memory. The situation was quite nasty, especially with slow I/O such as mag tape (possibly involving a tape rewind). There are several alternatives: * lock the whole program in memory until I/O is complete (nasty) * just lock the I/O buffers (possibly part of a small I/O program) and swap that out too, when I/O is completed * abort the I/O and retry again when the program is swapped back into memory. Possible for read operations from mass storage The last alternative is useful also with network traffic, provided that the sender buffers the transmitted data until it is acknowledged by the receiver. With modern multicore/processors with virtual memory this should be trivial as long as the processors share the same physical memory bus. In a system with physically separate platforms, some network helper programs are needed to transfer data between two buffers in different platforms with available transfer systems, such as Ethernet.
Hi George,

[Early morning meeting -- WTF?  Had *hoped* I'd get a nap in beforehand
but obviously got caught up in "stuff"...  <frown>]

 >>>> ... Previous systems just were "processor farms" -- typically
>>>> all "powered" just waiting for "workloads". The idea of *bringing* >>>> another node on-line, ON-DEMAND to address increasing needs wasn't >>>> part of their scope (why should it be? Unless you're concerned >>>> with power consumption!). Nor was there a concern over taking >>>> nodes OFF-line when they weren't technically needed. >>> >>> The current crop of tera-scale computers consume megawatts, and the >>> largest peta-scale computers consume 10s of megawatts when all their >>> CPUs and attached IO devices are active. They have extensive power >>> control systems to manage the partitioning of active/inactive devices. >> >> But their goal is to *use* all of that compute power, not let it >> idle. They tend to be more homogeneous environments with more >> "level" I/O usage. It's not like turning on CCTV cameras "because >> it's getting dark outside" and, as a result, *needing* that extra compute >> power to do video processing. > > Supercomputing centers all are batch oriented just like mainframes > used to be. The difference is they shut off CPUs that aren't in use - > if any - to lower the power bills. > > It's true that a lot of older machines have plenty of work to keep > them runnning ... but in the last 10-15 years, many newer ones have > had odd architectures that make writing software for them difficult > and time consuming. It is true that a lot of them use Intel or ARM > processors, but it isn't true that they all run Linux and can be > programmed using GCC/OpenMP. Some of the world's most powerful > systems sit idle much of the time, simply for lack of software. > > And a lot of the software itself is surprisingly flexible. In most > SCC environments there is no multi-tasking: a set of CPUs is dedicated > to a program for its duration. But external factors may cause a > program to be halted before finishing. The stopped program may be > restarted later with a different number of CPUs according to the mix > of programs that are running at that time.
Different than being "paused" while "relocated" -- yet expecting the rest of the "system" to accommodate their time "in transit". Note that, conceptually, my code could run on a single CPU (with enough resources) -- it isn't *inherently* parallel. I think this is an essential aspect for development and future maintenance: you only need to expect the *illusion* of parallelism. [The other aspects we've been discussing are all hidden from the application developer in much the same way as GC. I.e., if you think about things, you realize "its happening" but your code never really understands why/where]
> Programs which are expected to need many CPUs (or vast amounts of > memory which very often is tied to the number of CPUs), or which are > expected to run for more than a few minutes - such programs often are > written to checkpoint intermediate processing, to be restartable from > saved checkpoints, and to adapt dynamically to the number of CPUs they > are given when (re)started. > >>> And nobody has yet come up with a really good way to exploit massively >>> parallel hardware for general application programming. But that's a >>> different discussion. >> >> I don't believe we'll see any "effective" algorithms -- largely because >> there aren't many "available" installations. So, you have groups >> with very specific sorts of application sets trying to tackle the >> problem for *their* needs; not for "general" needs. > > The point is that "massively parallel" is becoming the norm. There > are commodity server chips now with 16 and 32 cores, and this year's > high end server chip will be in a laptop in 5 years.
I consider SMP a different beast than distributed systems. E.g., my blade server has ~60 (?) cores in the same box -- but it's really 14 separate "machines" that really are only suited to working in a "server farm" sort of environment; tied to applications that are simply replicated with a split workload (instead of a single application that has been "diced up" to run on the many processors).
> [ASUS now will happily sell you a water-cooled !! laptop with an > overclocked 6th-gen i7 paired with 2 Nvidia 1080 GPUs. If you > disconnect the water line it melts ... or maybe just slows down 50%. > But seriously, what do you expect for only $7K?] > > In any event, more cores are fine for running more programs
Exactly. Which is how I've exploited "CPUs" (and, now, cores). Conveniently sidestepping the issue of "how do you develop for this environment" by treating it as many "programs" instead of a single program that magically diced itself to pieces.
> simultaneously, but there's no good *general* way to leverage more > cores to make a single program run faster. The ways that are known to > be automatically exploitable (by a compiler) are largely limited to > parallelizing loops. Parallelizing non looping code [which is most of > most programs that need it] invariably relies on the programmer to > recognize possible parallelism and write special code to take > advantage of it. > > Why should anyone care? Good question. I don't have a good answer, > but a couple of data points: > > Lots of experience has shown that the average programmer can't write > correct parallel code [or even just correct serial code, but that's
I think that's inherent in the way that most people *think* of algorithms: "this, then that, followed by yet another thing". Being a hardware person, I implicitly see walking and chewing gum as "obvious"... why should this bit of hardware *pause* waiting for this other bit to finish ("either-or")?
> another discussion]. Automating parallelism - via compilers or smart > runtime systems - is the only way any significant percentage of > programs will be able to benefit.
I don't see big gains, there. E.g., I probably represent an "above average compute load" (in terms of what and how much I use machines for during development), yet am reasonably sure most of the apps that I run would not SIGNIFICANTLY benefit from even a *second* core (well, maybe autorouting a PCB while simultaneously doing something else, but that's the excpetion, not the rule). Note my previous comment re: how I've decided to use multicore CPUs in the HA system: by assigning specific jobs to specific cores KNOWING what those loads represent in my system *and* thereby being able to sidestep the core-scheduling issue. Will this partitioning be the most efficient use of the cores? Probably not. But, they're inexpensive and this *will* make it easier to ignore some of the costs that my design "naturally" incurs.
> Surveys have shown that for many people, the only "computer" they own > or routinely use is their smartphone. The average person quite soon > will reasonably expect their phone to able to do anything: word > processing, spreadsheets, audio/video editing and presentation, i.e. > general business computing, and (for those few taking college STEM > courses) solving differential equations, performing circuit > simulations, virtual reality walkthroughs of pyramids, galaxies, > cadavers, etc. ... > ... while snapchatting, tweeting, facebooking, pinteresting, and still > providing 40 hours of use on a battery charge.
I disagree with the "while" assumption (as in true concurrency). I still think people think about "work" serially. As long as you *appear* to make some progress on "task A" while they are busy with "task B", they're unable to understand how *good* your effort happened to be. E.g., if I'm updating a schematic *while* autorouting (another) design, I only care that the autorouter made *some* progress while I had the machine "tied up" with my schematic entry activities. It's not like I review its progress and think: "Gee, it should be farther along than it is...". OTOH, if it had (obviously) *paused* "while I was looking elsewhere", I'd be annoyed. With this in mind, I think it makes it easier to develop application sets that can achieve "satisfactory" performance without aiming for "ideal" performance. The challenging (portions of) applications being those things that the user *expects* to run in "real time" (e.g., the "answering machine" application on your phone can't suddenly start speaking the OGM in slow motion just because your attention is focused on something else!). E.g., I can do "commercial detection" in recorded video "off-line"... as long as I can get it DONE before the user wants to view that video! (which, of course, can't be *while* it is streaming cuz it consumes more real time than it *occupies*). OTOH, I have to do motion detection (CCTV) in real time cuz that affects the latency of the actions triggered by that motion.
> Unless someone comes up with a way to pack kWh into a AA sized package > [that can be recharged in under 30 minutes], making better use of > large numbers of low powered cores is going to be the only way > forward.
Or, convince people that they don't need to "take it with them"! E.g., my UI's have far less power requirements than the "code" to run them implies. (OTOH, they're featherweight devices so the battery issue still applies). I don't think people care *where* the processing is performed; they just want to ACCESS it "locally" (wherever "locally" may be!) I look at how my HA system (and its successor) have evolved and that's the single-most striking "optimization" in it all! Decouple the UI from the application. And, make the UI ubiquitous at the same time --> simpler, more desirable system! [Of course, for most people, that means reliance on someone else to provide that "service"... still baffles me to see what people pay for that "convenience" re: cell phones!] (sigh) Shower then gather up my prototypes and get my *ss out of here. THEN, perhaps, the nap I've been awaiting?
On 7/26/2016 5:59 AM, upsidedown@downunder.com wrote:
> On Tue, 26 Jul 2016 06:27:42 -0400, George Neuner > <gneuner2@comcast.net> wrote: > >> On Mon, 25 Jul 2016 16:56:42 -0700, Don Y >> <blockedofcourse@foo.invalid> wrote: >> >>> On 7/25/2016 2:32 PM, George Neuner wrote: >>> >>>>> [Note [process migration] is more involved than just packing up >>>>> registers plus address space!] >>>> >>>> It isn't THAT hard: clustered mainframes in the 1960's had the ability >>>> to migrate processes ... swap out here, swap in there. All it really >>>> requires is virtual addressing capability and a way to transport the >>>> code and runtime data. >>> >>> Yes. But they already had the "extra bits" (of state) that were >>> resident IN the OS's data structures. They either packed those >>> up with the "(formal) process state" as it was swapped out >>> *or* kept it in the kernel associated with the swapped out >>> process. >> >> Yes. >> >>> For example, any network traffic that was active at the time the >>> swap occurred still ended up with its endpoint on the current >>> node. You didn't have to buffer any incoming messages intended >>> for that "to be swapped" process and later forward them to the >>> new destination when the process is "restored". > > Is this any different from the situation in the old days, when you had > to swap out complete programs from core to make room for other > programs ? If the task had some active I/O going on, it had some > I/O-buffers (DMA-buffers) locked in memory. The situation was quite > nasty, especially with slow I/O such as mag tape (possibly involving a > tape rewind).
The I/O was associated with a specific task. You could simply defer swapping out its results until the I/O had completed (e.g., at the next IRG). This is no different than deferring the context switch of a "coprocessor" (most typically FPU) until a convenient point AFTER the body of the task had undergone its context switch: you know who the FPU's state belongs to, you just have to "remember" that when it comes time to attempt to USE the FPU after the "primary" context switch. Instead, imagine if the I/O started by a task in a multitasking system *stopped* when that task wasn't "running" (i.e., had control of the processor). Consider how you'd address that sort of implementation. I.e., the swapped out task is no longer able to provide SERVICES that other tasks are counting upon for their continued operation. How useful would "multitasking" be in that scenario?
> There are several alternatives: > > * lock the whole program in memory until I/O is complete (nasty) > > * just lock the I/O buffers (possibly part of a small I/O program) and > swap that out too, when I/O is completed > > * abort the I/O and retry again when the program is swapped back into > memory. Possible for read operations from mass storage > > The last alternative is useful also with network traffic, provided > that the sender buffers the transmitted data until it is acknowledged > by the receiver.
You're thinking about network traffic that *can* be stopped/paused. Imagine "suddenly" (from the standpoint of other applications) saying that "printf() will not be available" (while the server that implements the printf() functionality is being relocated (or, swapped out). Do all of the programs that have printf()'s have to learn how to deal with that situation? ("OK, I'll print the results, later..." when?) Or, does printf()'s unavailability automatically cause those other dependent tasks to block (indefinitely)?
> With modern multicore/processors with virtual memory this should be > trivial as long as the processors share the same physical memory bus.
Moving the contents of memory is trivial: one system call, in my case (object that references the task's memory space; object that references the destination node). The problem is all the other cruft that has to be gathered up (extricated from the OS) to go along with the "task" while it is in transition. - "Hello, Mr Task. Here are the results of that last RPC that you issued..." - "Hello, Mr. Task. Could you please perform this service for me?" - "Hey, Mr task! Where are you going?? You still haven't given me the results of that last service that I requested!!" - "Hey, Mr Task, you're holding some locks that I need! Please don't tell me you're going to continue holding them while you're being swapped out? That's just plain RUDE!" - "Um, while you're 'away', can I use the resources that you've RESERVED?" etc.
> In a system with physically separate platforms, some network helper > programs are needed to transfer data between two buffers in different > platforms with available transfer systems, such as Ethernet.
The 2026 Embedded Online Conference