Reply by Don Y August 3, 20162016-08-03
Hi George,

Rain, rain, rain...  We tend to forget what 90% RH feels like when its
typically 9%!  :-/   (OTOH, never say 'no' to rain!)

On 7/27/2016 11:22 AM, George Neuner wrote:
> On Tue, 26 Jul 2016 22:07:13 -0700, Don Y > <blockedofcourse@foo.invalid> wrote: > >> On 7/26/2016 9:50 AM, George Neuner wrote: >>> >>> In actual fact, early tape based mainframes quickly developed block >>> based tape "filesystems" that allowed data "files" on the tape to be >>> discontiguous. To make it faster, multiple tapes were used as >>> RAID-like stripe sets - reading/writing one while seeking on another, >>> etc. A file only had to be contigous on tape when it entered or left >>> the system. >>> >>> When disks were introduced, the "filesystem" concept followed. >> >> When I first adopted NetBSD (early 90's), they had no support for >> 9T tape (which is what I was using for "portable" tape exchange >> as well as offline storage, at the time). Originally, an F880 >> (800/1600bpi, 100ips) >> >> As few folks *had* 9T transports, it fell to me to write a driver to >> support my controller. Adding the "block" device as well as the more >> traditional "character" device. So, I could build "tape filesystems" >> "just as easily" as disk-based! >> >> Conceptually, a piece of cake! The "strategy" routine took a bit >> of work to add optimizations like "read reverse" but it was kinda cool! >> >> However, the first time I tried to "read" a file system mounted on >> the device, the "violence" of the mechanism's motions was scary! >> Made me wonder if I was tearing things up with all the short >> back-and-forth motions (seeks) and aggressive acceleration and >> braking in the transport! Sort of like watching a pen-plotter running >> balls-out! >> >> And, that was with files laid down in contiguous tape blocks; >> the idea of the transport being commanded to seek blocks >> scattered around the medium (no buffering in the transport, >> formatter or controller!) led me to believe I'd quickly >> wear SOMETHING out in the transport -- even if only the head >> or tension arm roller! > > I don't doubt it. But I think you might have hit upon a worst case > scenario for a single tape.
I suspect a good part of it was due to the fact that neither the transport, formatter nor controller did any sort of buffering. I.e., bytes came right off the head into a latch on the ISA bus; get it NOW or its gone! Coupled with relatively large IRG's and low density (800/1600PE) meant the transport had to make "noticeable" movements to do *anything*! (BSF/BSR, etc.) By contrast, more modern "drives" have huge buffers, VToC's, really high densities and much less "mass" involved so they don't have to "work" as hard to get at data. I think about ~50MB on a 10" BlackWatch... compared to a sizeable fraction of a TB on a DLT-S4 (or many TB on newer LTO's <shudder>) OTOH, modern drives have more brains in their onboard firmware than was available in the HOSTS of most 9T's!
> Mainframe tape drivers - really just an I/O program - performed > request reordering and elevator seeking to minimize thrashing of the > tapes. Any particular tape in a big system was likely to have many > outstanding requests at any given time. > > Discontiguous files were only ever used on scratch tapes that were > temporary storage. Code images on "program" tapes always were > contiguous, as were any files on external "transport" tapes. Input > entering the system or output leaving it always was contiguous (for > portability). > > 1960 era mainframes had (by today's standards) severe memory > limitations: a machine with 256K words[*] of RAM was a *BIG* system > that would be juggling dozens of jobs. The average program used less > than 12K of memory and scripted execution of a "pipeline" of small > programs was very common. > > [*] https://en.wikipedia.org/wiki/Word_(computer_architecture)
Yeah, my first disk (RS08) had 256K words (12b?). Drew something like 500W (for the drive+controller) and could source data at a whopping ~50KW/s. If you were lucky, you could fit *four* of them in a full size rack (for a 1 megaword store! <drool>) But, it was cool in that it was WORD ADDRESSABLE! (!) [And, had a boatload of blnkenlites!]
> Quite a lot of routine data manipulation was done by writing/reading > temporary files, as was all passing of data between programs. It > wasn't feasible to dedicate individual scratch tapes to every running > job, and often jobs needed multiple files simultaneously which was > hideously slow and inefficient unless the files all were on separate > tapes. > > And once multitasking systems became the norm, it was no longer > politically expedient for the mainframe operators to delay jobs of > "important" users until scratch tapes were available. > > Thus the discontiguous tape filesystem was born. > > On my shelf, I have a textbook of "external" - i.e. file based - data > processing algorithms. It was written in 1988, an indication that > file based techniques were still being taught in some CS/CIS programs > then. > > Modern "big data" processing has shown that file based techniques are > *still* relevant and that it was a mistake to stop teaching them: > programmers now often have no clue how to proceed when, for whatever > reason, their data can't be fit into memory. > [Or worse: their "in-memory" processing is much slower than they > expect because because their data is swapped out due to multitasking.]
Nothing *ever* "goes obsolete" -- much like fashion; you just have to wait for folks to rediscover *old* techniques as *new* problems manifest and prove to no longer fit the assumptions they've artificially imposed on their solutions! Had to do that just last week to (brute force) process some huge "lists" to document how I'd built the diskless workstations for my upcoming "class". Annoying as the "files" weren't even local (NFS)! After finishing (quick and dirty), I realized a simpler way of getting the same results... :<
> Not that still teaching file techniques today would have any great > impact on the world because the majority of programmers now have no > formal CS/CIS education.
We've had this discussion before: "teaching what industry wants" instead of "teaching what industry WILL want". It's the "Water Cooler Institute of Technology" -- devoid of detail and all the caveats that come with a more "formal" education. I had an early employer who made no bones about this in his hiring decisions: "I hire from University X when I am looking for an employee that I need TODAY; I'll hire from University Y when I'm looking for someone that might NOT be productive, today, but will be a better LONG TERM investment (the University *X* employee requiring renewed education to remain viable -- almost as soon as hired!)" It was an interesting insight to see how (some?) employers approach their human resource needs...
>> When I got the M990 (much higher density, much deeper pockets!), >> I was even *less* inclined to 'experiment'. :< >> >> I learned to settle for simple streaming... > > As always, YMMV. > George >
Reply by George Neuner July 27, 20162016-07-27
On Tue, 26 Jul 2016 22:07:13 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 7/26/2016 9:50 AM, George Neuner wrote: >> >> In actual fact, early tape based mainframes quickly developed block >> based tape "filesystems" that allowed data "files" on the tape to be >> discontiguous. To make it faster, multiple tapes were used as >> RAID-like stripe sets - reading/writing one while seeking on another, >> etc. A file only had to be contigous on tape when it entered or left >> the system. >> >> When disks were introduced, the "filesystem" concept followed. > >When I first adopted NetBSD (early 90's), they had no support for >9T tape (which is what I was using for "portable" tape exchange >as well as offline storage, at the time). Originally, an F880 >(800/1600bpi, 100ips) > >As few folks *had* 9T transports, it fell to me to write a driver to >support my controller. Adding the "block" device as well as the more >traditional "character" device. So, I could build "tape filesystems" >"just as easily" as disk-based! > >Conceptually, a piece of cake! The "strategy" routine took a bit >of work to add optimizations like "read reverse" but it was kinda cool! > >However, the first time I tried to "read" a file system mounted on >the device, the "violence" of the mechanism's motions was scary! >Made me wonder if I was tearing things up with all the short >back-and-forth motions (seeks) and aggressive acceleration and >braking in the transport! Sort of like watching a pen-plotter running >balls-out! > >And, that was with files laid down in contiguous tape blocks; >the idea of the transport being commanded to seek blocks >scattered around the medium (no buffering in the transport, >formatter or controller!) led me to believe I'd quickly >wear SOMETHING out in the transport -- even if only the head >or tension arm roller!
I don't doubt it. But I think you might have hit upon a worst case scenario for a single tape. Mainframe tape drivers - really just an I/O program - performed request reordering and elevator seeking to minimize thrashing of the tapes. Any particular tape in a big system was likely to have many outstanding requests at any given time. Discontiguous files were only ever used on scratch tapes that were temporary storage. Code images on "program" tapes always were contiguous, as were any files on external "transport" tapes. Input entering the system or output leaving it always was contiguous (for portability). 1960 era mainframes had (by today's standards) severe memory limitations: a machine with 256K words[*] of RAM was a *BIG* system that would be juggling dozens of jobs. The average program used less than 12K of memory and scripted execution of a "pipeline" of small programs was very common. [*] https://en.wikipedia.org/wiki/Word_(computer_architecture) Quite a lot of routine data manipulation was done by writing/reading temporary files, as was all passing of data between programs. It wasn't feasible to dedicate individual scratch tapes to every running job, and often jobs needed multiple files simultaneously which was hideously slow and inefficient unless the files all were on separate tapes. And once multitasking systems became the norm, it was no longer politically expedient for the mainframe operators to delay jobs of "important" users until scratch tapes were available. Thus the discontiguous tape filesystem was born. On my shelf, I have a textbook of "external" - i.e. file based - data processing algorithms. It was written in 1988, an indication that file based techniques were still being taught in some CS/CIS programs then. Modern "big data" processing has shown that file based techniques are *still* relevant and that it was a mistake to stop teaching them: programmers now often have no clue how to proceed when, for whatever reason, their data can't be fit into memory. [Or worse: their "in-memory" processing is much slower than they expect because because their data is swapped out due to multitasking.] Not that still teaching file techniques today would have any great impact on the world because the majority of programmers now have no formal CS/CIS education.
>When I got the M990 (much higher density, much deeper pockets!), >I was even *less* inclined to 'experiment'. :< > >I learned to settle for simple streaming...
As always, YMMV. George
Reply by Don Y July 27, 20162016-07-27
On 7/26/2016 9:50 AM, George Neuner wrote:
> > Wrong person. DonY wrote the piece you responded to. > > > On Tue, 26 Jul 2016 15:59:32 +0300, upsidedown@downunder.com wrote: > >> On Tue, 26 Jul 2016 06:27:42 -0400, George Neuner >> <gneuner2@comcast.net> wrote: >> >>> On Mon, 25 Jul 2016 16:56:42 -0700, Don Y >>> <blockedofcourse@foo.invalid> wrote: >>> >>>> : >>>> For example, any network traffic that was active at the time the >>>> swap occurred still ended up with its endpoint on the current >>>> node. You didn't have to buffer any incoming messages intended >>>> for that "to be swapped" process and later forward them to the >>>> new destination when the process is "restored". >> >> Is this any different from the situation in the old days, when you had >> to swap out complete programs from core to make room for other >> programs ? If the task had some active I/O going on, it had some >> I/O-buffers (DMA-buffers) locked in memory. The situation was quite >> nasty, especially with slow I/O such as mag tape (possibly involving a >> tape rewind). > > If you go back to the very early days of unblocked streaming tapes, > swapping processes then was all but impossible. But in the 60's > mainframe era, most tapes used hard block formatting, and better > drives could quickly pause and reposition the tape. > > Most machines had dedicated I/O processors separate from the main > CPU(s). The OS tracked the activity of these processors and did not > swap processes while an I/O transfer was in progress. > > But since most I/O was block oriented, a large logical transfer could > be done as a series of smaller physical ones - enabling the logical > transfer to be halted and resumed. > > In actual fact, early tape based mainframes quickly developed block > based tape "filesystems" that allowed data "files" on the tape to be > discontiguous. To make it faster, multiple tapes were used as > RAID-like stripe sets - reading/writing one while seeking on another, > etc. A file only had to be contigous on tape when it entered or left > the system. > > When disks were introduced, the "filesystem" concept followed.
When I first adopted NetBSD (early 90's), they had no support for 9T tape (which is what I was using for "portable" tape exchange as well as offline storage, at the time). Originally, an F880 (800/1600bpi, 100ips) As few folks *had* 9T transports, it fell to me to write a driver to support my controller. Adding the "block" device as well as the more traditional "character" device. So, I could build "tape filesystems" "just as easily" as disk-based! Conceptually, a piece of cake! The "strategy" routine took a bit of work to add optimizations like "read reverse" but it was kinda cool! However, the first time I tried to "read" a file system mounted on the device, the "violence" of the mechanism's motions was scary! Made me wonder if I was tearing things up with all the short back-and-forth motions (seeks) and aggressive acceleration and braking in the transport! Sort of like watching a pen-plotter running balls-out! And, that was with files laid down in contiguous tape blocks; the idea of the transport being commanded to seek blocks scattered around the medium (no buffering in the transport, formatter or controller!) led me to believe I'd quickly wear SOMETHING out in the transport -- even if only the head or tension arm roller! When I got the M990 (much higher density, much deeper pockets!), I was even *less* inclined to 'experiment'. :< I learned to settle for simple streaming...
Reply by George Neuner July 26, 20162016-07-26
Wrong person.  DonY wrote the piece you responded to.


On Tue, 26 Jul 2016 15:59:32 +0300, upsidedown@downunder.com wrote:

>On Tue, 26 Jul 2016 06:27:42 -0400, George Neuner ><gneuner2@comcast.net> wrote: > >>On Mon, 25 Jul 2016 16:56:42 -0700, Don Y >><blockedofcourse@foo.invalid> wrote: >> >>> : >>>For example, any network traffic that was active at the time the >>>swap occurred still ended up with its endpoint on the current >>>node. You didn't have to buffer any incoming messages intended >>>for that "to be swapped" process and later forward them to the >>>new destination when the process is "restored". > >Is this any different from the situation in the old days, when you had >to swap out complete programs from core to make room for other >programs ? If the task had some active I/O going on, it had some >I/O-buffers (DMA-buffers) locked in memory. The situation was quite >nasty, especially with slow I/O such as mag tape (possibly involving a >tape rewind).
If you go back to the very early days of unblocked streaming tapes, swapping processes then was all but impossible. But in the 60's mainframe era, most tapes used hard block formatting, and better drives could quickly pause and reposition the tape. Most machines had dedicated I/O processors separate from the main CPU(s). The OS tracked the activity of these processors and did not swap processes while an I/O transfer was in progress. But since most I/O was block oriented, a large logical transfer could be done as a series of smaller physical ones - enabling the logical transfer to be halted and resumed. In actual fact, early tape based mainframes quickly developed block based tape "filesystems" that allowed data "files" on the tape to be discontiguous. To make it faster, multiple tapes were used as RAID-like stripe sets - reading/writing one while seeking on another, etc. A file only had to be contigous on tape when it entered or left the system. When disks were introduced, the "filesystem" concept followed.
>There are several alternatives: > >* lock the whole program in memory until I/O is complete (nasty)
Which is what was done but distinguishing physical I/O from logical.
>* just lock the I/O buffers (possibly part of a small I/O program) and >swap that out too, when I/O is completed
Too complicated. The code/data/rss/stack segmentation of Unix programs mirrors how early mainframes (and minis) actually worked: pure segmentation with base:offset addressing [the segment bases being known only to the operating system]. I/O processors ran a single program - effectively the device driver - which was never swapped. Any local buffering the program may have needed likewise would be in the processor's partitioned space. But many I/O processors simply accessed the program's data space buffer directly. The buffer could not be swapped out *while* the I/O processor was accessing it. But again, the physical I/O could be done incrementally with opportunity to swap between transfers.
>* abort the I/O and retry again when the program is swapped back into >memory. Possible for read operations from mass storage
Sometimes had to be done with unblocked streaming tapes. Once block devices became the norm, aborting I/O for swapping never had to be done.
>The last alternative is useful also with network traffic, provided >that the sender buffers the transmitted data until it is acknowledged >by the receiver. > >With modern multicore/processors with virtual memory this should be >trivial as long as the processors share the same physical memory bus. > >In a system with physically separate platforms, some network helper >programs are needed to transfer data between two buffers in different >platforms with available transfer systems, such as Ethernet.
YMMV, George
Reply by Dimiter_Popoff July 26, 20162016-07-26
On 26.7.2016 &#1075;. 15:59, upsidedown@downunder.com wrote:
> On Tue, 26 Jul 2016 06:27:42 -0400, George Neuner > <gneuner2@comcast.net> wrote: > >> On Mon, 25 Jul 2016 16:56:42 -0700, Don Y >> <blockedofcourse@foo.invalid> wrote: >> >>> On 7/25/2016 2:32 PM, George Neuner wrote: >>> >>>>> [Note [process migration] is more involved than just packing up >>>>> registers plus address space!] >>>> >>>> It isn't THAT hard: clustered mainframes in the 1960's had the ability >>>> to migrate processes ... swap out here, swap in there. All it really >>>> requires is virtual addressing capability and a way to transport the >>>> code and runtime data. >>> >>> Yes. But they already had the "extra bits" (of state) that were >>> resident IN the OS's data structures. They either packed those >>> up with the "(formal) process state" as it was swapped out >>> *or* kept it in the kernel associated with the swapped out >>> process. >> >> Yes. >> >>> For example, any network traffic that was active at the time the >>> swap occurred still ended up with its endpoint on the current >>> node. You didn't have to buffer any incoming messages intended >>> for that "to be swapped" process and later forward them to the >>> new destination when the process is "restored". > > Is this any different from the situation in the old days, when you had > to swap out complete programs from core to make room for other > programs ? If the task had some active I/O going on, it had some > I/O-buffers (DMA-buffers) locked in memory. The situation was quite > nasty, especially with slow I/O such as mag tape (possibly involving a > tape rewind). > > There are several alternatives: > > * lock the whole program in memory until I/O is complete (nasty) > > * just lock the I/O buffers (possibly part of a small I/O program) and > swap that out too, when I/O is completed > > * abort the I/O and retry again when the program is swapped back into > memory. Possible for read operations from mass storage > > The last alternative is useful also with network traffic, provided > that the sender buffers the transmitted data until it is acknowledged > by the receiver. > > With modern multicore/processors with virtual memory this should be > trivial as long as the processors share the same physical memory bus.
Yes indeed. Then you grab your phone and discover things can be really messed up in this context even with *gigabytes* of RAM.... Sorry for the rant, I know this is obvious to all of us here. Dimiter
Reply by Don Y July 26, 20162016-07-26
On 7/26/2016 5:59 AM, upsidedown@downunder.com wrote:
> On Tue, 26 Jul 2016 06:27:42 -0400, George Neuner > <gneuner2@comcast.net> wrote: > >> On Mon, 25 Jul 2016 16:56:42 -0700, Don Y >> <blockedofcourse@foo.invalid> wrote: >> >>> On 7/25/2016 2:32 PM, George Neuner wrote: >>> >>>>> [Note [process migration] is more involved than just packing up >>>>> registers plus address space!] >>>> >>>> It isn't THAT hard: clustered mainframes in the 1960's had the ability >>>> to migrate processes ... swap out here, swap in there. All it really >>>> requires is virtual addressing capability and a way to transport the >>>> code and runtime data. >>> >>> Yes. But they already had the "extra bits" (of state) that were >>> resident IN the OS's data structures. They either packed those >>> up with the "(formal) process state" as it was swapped out >>> *or* kept it in the kernel associated with the swapped out >>> process. >> >> Yes. >> >>> For example, any network traffic that was active at the time the >>> swap occurred still ended up with its endpoint on the current >>> node. You didn't have to buffer any incoming messages intended >>> for that "to be swapped" process and later forward them to the >>> new destination when the process is "restored". > > Is this any different from the situation in the old days, when you had > to swap out complete programs from core to make room for other > programs ? If the task had some active I/O going on, it had some > I/O-buffers (DMA-buffers) locked in memory. The situation was quite > nasty, especially with slow I/O such as mag tape (possibly involving a > tape rewind).
The I/O was associated with a specific task. You could simply defer swapping out its results until the I/O had completed (e.g., at the next IRG). This is no different than deferring the context switch of a "coprocessor" (most typically FPU) until a convenient point AFTER the body of the task had undergone its context switch: you know who the FPU's state belongs to, you just have to "remember" that when it comes time to attempt to USE the FPU after the "primary" context switch. Instead, imagine if the I/O started by a task in a multitasking system *stopped* when that task wasn't "running" (i.e., had control of the processor). Consider how you'd address that sort of implementation. I.e., the swapped out task is no longer able to provide SERVICES that other tasks are counting upon for their continued operation. How useful would "multitasking" be in that scenario?
> There are several alternatives: > > * lock the whole program in memory until I/O is complete (nasty) > > * just lock the I/O buffers (possibly part of a small I/O program) and > swap that out too, when I/O is completed > > * abort the I/O and retry again when the program is swapped back into > memory. Possible for read operations from mass storage > > The last alternative is useful also with network traffic, provided > that the sender buffers the transmitted data until it is acknowledged > by the receiver.
You're thinking about network traffic that *can* be stopped/paused. Imagine "suddenly" (from the standpoint of other applications) saying that "printf() will not be available" (while the server that implements the printf() functionality is being relocated (or, swapped out). Do all of the programs that have printf()'s have to learn how to deal with that situation? ("OK, I'll print the results, later..." when?) Or, does printf()'s unavailability automatically cause those other dependent tasks to block (indefinitely)?
> With modern multicore/processors with virtual memory this should be > trivial as long as the processors share the same physical memory bus.
Moving the contents of memory is trivial: one system call, in my case (object that references the task's memory space; object that references the destination node). The problem is all the other cruft that has to be gathered up (extricated from the OS) to go along with the "task" while it is in transition. - "Hello, Mr Task. Here are the results of that last RPC that you issued..." - "Hello, Mr. Task. Could you please perform this service for me?" - "Hey, Mr task! Where are you going?? You still haven't given me the results of that last service that I requested!!" - "Hey, Mr Task, you're holding some locks that I need! Please don't tell me you're going to continue holding them while you're being swapped out? That's just plain RUDE!" - "Um, while you're 'away', can I use the resources that you've RESERVED?" etc.
> In a system with physically separate platforms, some network helper > programs are needed to transfer data between two buffers in different > platforms with available transfer systems, such as Ethernet.
Reply by Don Y July 26, 20162016-07-26
Hi George,

[Early morning meeting -- WTF?  Had *hoped* I'd get a nap in beforehand
but obviously got caught up in "stuff"...  <frown>]

 >>>> ... Previous systems just were "processor farms" -- typically
>>>> all "powered" just waiting for "workloads". The idea of *bringing* >>>> another node on-line, ON-DEMAND to address increasing needs wasn't >>>> part of their scope (why should it be? Unless you're concerned >>>> with power consumption!). Nor was there a concern over taking >>>> nodes OFF-line when they weren't technically needed. >>> >>> The current crop of tera-scale computers consume megawatts, and the >>> largest peta-scale computers consume 10s of megawatts when all their >>> CPUs and attached IO devices are active. They have extensive power >>> control systems to manage the partitioning of active/inactive devices. >> >> But their goal is to *use* all of that compute power, not let it >> idle. They tend to be more homogeneous environments with more >> "level" I/O usage. It's not like turning on CCTV cameras "because >> it's getting dark outside" and, as a result, *needing* that extra compute >> power to do video processing. > > Supercomputing centers all are batch oriented just like mainframes > used to be. The difference is they shut off CPUs that aren't in use - > if any - to lower the power bills. > > It's true that a lot of older machines have plenty of work to keep > them runnning ... but in the last 10-15 years, many newer ones have > had odd architectures that make writing software for them difficult > and time consuming. It is true that a lot of them use Intel or ARM > processors, but it isn't true that they all run Linux and can be > programmed using GCC/OpenMP. Some of the world's most powerful > systems sit idle much of the time, simply for lack of software. > > And a lot of the software itself is surprisingly flexible. In most > SCC environments there is no multi-tasking: a set of CPUs is dedicated > to a program for its duration. But external factors may cause a > program to be halted before finishing. The stopped program may be > restarted later with a different number of CPUs according to the mix > of programs that are running at that time.
Different than being "paused" while "relocated" -- yet expecting the rest of the "system" to accommodate their time "in transit". Note that, conceptually, my code could run on a single CPU (with enough resources) -- it isn't *inherently* parallel. I think this is an essential aspect for development and future maintenance: you only need to expect the *illusion* of parallelism. [The other aspects we've been discussing are all hidden from the application developer in much the same way as GC. I.e., if you think about things, you realize "its happening" but your code never really understands why/where]
> Programs which are expected to need many CPUs (or vast amounts of > memory which very often is tied to the number of CPUs), or which are > expected to run for more than a few minutes - such programs often are > written to checkpoint intermediate processing, to be restartable from > saved checkpoints, and to adapt dynamically to the number of CPUs they > are given when (re)started. > >>> And nobody has yet come up with a really good way to exploit massively >>> parallel hardware for general application programming. But that's a >>> different discussion. >> >> I don't believe we'll see any "effective" algorithms -- largely because >> there aren't many "available" installations. So, you have groups >> with very specific sorts of application sets trying to tackle the >> problem for *their* needs; not for "general" needs. > > The point is that "massively parallel" is becoming the norm. There > are commodity server chips now with 16 and 32 cores, and this year's > high end server chip will be in a laptop in 5 years.
I consider SMP a different beast than distributed systems. E.g., my blade server has ~60 (?) cores in the same box -- but it's really 14 separate "machines" that really are only suited to working in a "server farm" sort of environment; tied to applications that are simply replicated with a split workload (instead of a single application that has been "diced up" to run on the many processors).
> [ASUS now will happily sell you a water-cooled !! laptop with an > overclocked 6th-gen i7 paired with 2 Nvidia 1080 GPUs. If you > disconnect the water line it melts ... or maybe just slows down 50%. > But seriously, what do you expect for only $7K?] > > In any event, more cores are fine for running more programs
Exactly. Which is how I've exploited "CPUs" (and, now, cores). Conveniently sidestepping the issue of "how do you develop for this environment" by treating it as many "programs" instead of a single program that magically diced itself to pieces.
> simultaneously, but there's no good *general* way to leverage more > cores to make a single program run faster. The ways that are known to > be automatically exploitable (by a compiler) are largely limited to > parallelizing loops. Parallelizing non looping code [which is most of > most programs that need it] invariably relies on the programmer to > recognize possible parallelism and write special code to take > advantage of it. > > Why should anyone care? Good question. I don't have a good answer, > but a couple of data points: > > Lots of experience has shown that the average programmer can't write > correct parallel code [or even just correct serial code, but that's
I think that's inherent in the way that most people *think* of algorithms: "this, then that, followed by yet another thing". Being a hardware person, I implicitly see walking and chewing gum as "obvious"... why should this bit of hardware *pause* waiting for this other bit to finish ("either-or")?
> another discussion]. Automating parallelism - via compilers or smart > runtime systems - is the only way any significant percentage of > programs will be able to benefit.
I don't see big gains, there. E.g., I probably represent an "above average compute load" (in terms of what and how much I use machines for during development), yet am reasonably sure most of the apps that I run would not SIGNIFICANTLY benefit from even a *second* core (well, maybe autorouting a PCB while simultaneously doing something else, but that's the excpetion, not the rule). Note my previous comment re: how I've decided to use multicore CPUs in the HA system: by assigning specific jobs to specific cores KNOWING what those loads represent in my system *and* thereby being able to sidestep the core-scheduling issue. Will this partitioning be the most efficient use of the cores? Probably not. But, they're inexpensive and this *will* make it easier to ignore some of the costs that my design "naturally" incurs.
> Surveys have shown that for many people, the only "computer" they own > or routinely use is their smartphone. The average person quite soon > will reasonably expect their phone to able to do anything: word > processing, spreadsheets, audio/video editing and presentation, i.e. > general business computing, and (for those few taking college STEM > courses) solving differential equations, performing circuit > simulations, virtual reality walkthroughs of pyramids, galaxies, > cadavers, etc. ... > ... while snapchatting, tweeting, facebooking, pinteresting, and still > providing 40 hours of use on a battery charge.
I disagree with the "while" assumption (as in true concurrency). I still think people think about "work" serially. As long as you *appear* to make some progress on "task A" while they are busy with "task B", they're unable to understand how *good* your effort happened to be. E.g., if I'm updating a schematic *while* autorouting (another) design, I only care that the autorouter made *some* progress while I had the machine "tied up" with my schematic entry activities. It's not like I review its progress and think: "Gee, it should be farther along than it is...". OTOH, if it had (obviously) *paused* "while I was looking elsewhere", I'd be annoyed. With this in mind, I think it makes it easier to develop application sets that can achieve "satisfactory" performance without aiming for "ideal" performance. The challenging (portions of) applications being those things that the user *expects* to run in "real time" (e.g., the "answering machine" application on your phone can't suddenly start speaking the OGM in slow motion just because your attention is focused on something else!). E.g., I can do "commercial detection" in recorded video "off-line"... as long as I can get it DONE before the user wants to view that video! (which, of course, can't be *while* it is streaming cuz it consumes more real time than it *occupies*). OTOH, I have to do motion detection (CCTV) in real time cuz that affects the latency of the actions triggered by that motion.
> Unless someone comes up with a way to pack kWh into a AA sized package > [that can be recharged in under 30 minutes], making better use of > large numbers of low powered cores is going to be the only way > forward.
Or, convince people that they don't need to "take it with them"! E.g., my UI's have far less power requirements than the "code" to run them implies. (OTOH, they're featherweight devices so the battery issue still applies). I don't think people care *where* the processing is performed; they just want to ACCESS it "locally" (wherever "locally" may be!) I look at how my HA system (and its successor) have evolved and that's the single-most striking "optimization" in it all! Decouple the UI from the application. And, make the UI ubiquitous at the same time --> simpler, more desirable system! [Of course, for most people, that means reliance on someone else to provide that "service"... still baffles me to see what people pay for that "convenience" re: cell phones!] (sigh) Shower then gather up my prototypes and get my *ss out of here. THEN, perhaps, the nap I've been awaiting?
Reply by July 26, 20162016-07-26
On Tue, 26 Jul 2016 06:27:42 -0400, George Neuner
<gneuner2@comcast.net> wrote:

>On Mon, 25 Jul 2016 16:56:42 -0700, Don Y ><blockedofcourse@foo.invalid> wrote: > >>On 7/25/2016 2:32 PM, George Neuner wrote: >> >>>> [Note [process migration] is more involved than just packing up >>>> registers plus address space!] >>> >>> It isn't THAT hard: clustered mainframes in the 1960's had the ability >>> to migrate processes ... swap out here, swap in there. All it really >>> requires is virtual addressing capability and a way to transport the >>> code and runtime data. >> >>Yes. But they already had the "extra bits" (of state) that were >>resident IN the OS's data structures. They either packed those >>up with the "(formal) process state" as it was swapped out >>*or* kept it in the kernel associated with the swapped out >>process. > >Yes. > >>For example, any network traffic that was active at the time the >>swap occurred still ended up with its endpoint on the current >>node. You didn't have to buffer any incoming messages intended >>for that "to be swapped" process and later forward them to the >>new destination when the process is "restored".
Is this any different from the situation in the old days, when you had to swap out complete programs from core to make room for other programs ? If the task had some active I/O going on, it had some I/O-buffers (DMA-buffers) locked in memory. The situation was quite nasty, especially with slow I/O such as mag tape (possibly involving a tape rewind). There are several alternatives: * lock the whole program in memory until I/O is complete (nasty) * just lock the I/O buffers (possibly part of a small I/O program) and swap that out too, when I/O is completed * abort the I/O and retry again when the program is swapped back into memory. Possible for read operations from mass storage The last alternative is useful also with network traffic, provided that the sender buffers the transmitted data until it is acknowledged by the receiver. With modern multicore/processors with virtual memory this should be trivial as long as the processors share the same physical memory bus. In a system with physically separate platforms, some network helper programs are needed to transfer data between two buffers in different platforms with available transfer systems, such as Ethernet.
Reply by George Neuner July 26, 20162016-07-26
On Mon, 25 Jul 2016 16:56:42 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 7/25/2016 2:32 PM, George Neuner wrote: > >>> [Note [process migration] is more involved than just packing up >>> registers plus address space!] >> >> It isn't THAT hard: clustered mainframes in the 1960's had the ability >> to migrate processes ... swap out here, swap in there. All it really >> requires is virtual addressing capability and a way to transport the >> code and runtime data. > >Yes. But they already had the "extra bits" (of state) that were >resident IN the OS's data structures. They either packed those >up with the "(formal) process state" as it was swapped out >*or* kept it in the kernel associated with the swapped out >process.
Yes.
>For example, any network traffic that was active at the time the >swap occurred still ended up with its endpoint on the current >node. You didn't have to buffer any incoming messages intended >for that "to be swapped" process and later forward them to the >new destination when the process is "restored".
Yes ... but circuit switching was old hat in the 1940's and packet switching evolved in the late 1950's [arguably it was first used in a real system in 1961]. Through the 1960's few installations had more than modems and terminals to worry about - easily automated using either controlled switching of either type. There were few packet networks (as we know them), most experimental, and no communication standards: there were as many different protocols as there were networks. Almost all of the hardware and the software protocols that people typically think of as being associated with "early" networking: Aloha, PUP, X.25, ARCnet, Ethernet, etc. - really all date from the 1970's. But your point is taken. <grin>
>>> ... Previous systems just were "processor farms" -- typically >>> all "powered" just waiting for "workloads". The idea of *bringing* >>> another node on-line, ON-DEMAND to address increasing needs wasn't >>> part of their scope (why should it be? Unless you're concerned >>> with power consumption!). Nor was there a concern over taking >>> nodes OFF-line when they weren't technically needed. >> >> The current crop of tera-scale computers consume megawatts, and the >> largest peta-scale computers consume 10s of megawatts when all their >> CPUs and attached IO devices are active. They have extensive power >> control systems to manage the partitioning of active/inactive devices. > >But their goal is to *use* all of that compute power, not let it >idle. They tend to be more homogeneous environments with more >"level" I/O usage. It's not like turning on CCTV cameras "because >it's getting dark outside" and, as a result, *needing* that extra compute >power to do video processing.
Supercomputing centers all are batch oriented just like mainframes used to be. The difference is they shut off CPUs that aren't in use - if any - to lower the power bills. It's true that a lot of older machines have plenty of work to keep them runnning ... but in the last 10-15 years, many newer ones have had odd architectures that make writing software for them difficult and time consuming. It is true that a lot of them use Intel or ARM processors, but it isn't true that they all run Linux and can be programmed using GCC/OpenMP. Some of the world's most powerful systems sit idle much of the time, simply for lack of software. And a lot of the software itself is surprisingly flexible. In most SCC environments there is no multi-tasking: a set of CPUs is dedicated to a program for its duration. But external factors may cause a program to be halted before finishing. The stopped program may be restarted later with a different number of CPUs according to the mix of programs that are running at that time. Programs which are expected to need many CPUs (or vast amounts of memory which very often is tied to the number of CPUs), or which are expected to run for more than a few minutes - such programs often are written to checkpoint intermediate processing, to be restartable from saved checkpoints, and to adapt dynamically to the number of CPUs they are given when (re)started.
>> And nobody has yet come up with a really good way to exploit massively >> parallel hardware for general application programming. But that's a >> different discussion. > >I don't believe we'll see any "effective" algorithms -- largely because >there aren't many "available" installations. So, you have groups >with very specific sorts of application sets trying to tackle the >problem for *their* needs; not for "general" needs.
The point is that "massively parallel" is becoming the norm. There are commodity server chips now with 16 and 32 cores, and this year's high end server chip will be in a laptop in 5 years. [ASUS now will happily sell you a water-cooled !! laptop with an overclocked 6th-gen i7 paired with 2 Nvidia 1080 GPUs. If you disconnect the water line it melts ... or maybe just slows down 50%. But seriously, what do you expect for only $7K?] In any event, more cores are fine for running more programs simultaneously, but there's no good *general* way to leverage more cores to make a single program run faster. The ways that are known to be automatically exploitable (by a compiler) are largely limited to parallelizing loops. Parallelizing non looping code [which is most of most programs that need it] invariably relies on the programmer to recognize possible parallelism and write special code to take advantage of it. Why should anyone care? Good question. I don't have a good answer, but a couple of data points: Lots of experience has shown that the average programmer can't write correct parallel code [or even just correct serial code, but that's another discussion]. Automating parallelism - via compilers or smart runtime systems - is the only way any significant percentage of programs will be able to benefit. Surveys have shown that for many people, the only "computer" they own or routinely use is their smartphone. The average person quite soon will reasonably expect their phone to able to do anything: word processing, spreadsheets, audio/video editing and presentation, i.e. general business computing, and (for those few taking college STEM courses) solving differential equations, performing circuit simulations, virtual reality walkthroughs of pyramids, galaxies, cadavers, etc. ... ... while snapchatting, tweeting, facebooking, pinteresting, and still providing 40 hours of use on a battery charge. Unless someone comes up with a way to pack kWh into a AA sized package [that can be recharged in under 30 minutes], making better use of large numbers of low powered cores is going to be the only way forward. As always, YMMV. George
Reply by George Neuner July 25, 20162016-07-25
On Mon, 25 Jul 2016 02:40:14 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>Hi George, > >[SWMBO grumbling cuz her cookie jar has been empty for many days now. >Guess I'd best plan on baking, tonight, lest I have to deal with >The Frowny Face (sigh). Or, stop *improving* the Rx so she has >an incentive to buy store-bought, instead!! ;-) ]
Leave it empty for a while longer ... maybe she'll try making them herself. 8-)
>On 7/17/2016 9:08 PM, George Neuner wrote: > >> Don is talking about both multicores and physically separate >> processors connected by network. As he has described it elsewhere - >> his system is able to migrate a running task to any suitable platform >> anywhere within the network. > >Actually, I only recently decided to support multicore processors; >mainly, they're too cheap *not* to! <frown> > >But, I'm taking the "easy" way out, there -- assigning specific >subsystems to specific cores (e.g., like moving protected >"capabilities" around; something that every node needs to be able >to do!) > >The other "migration" issues (bad choice of terms) have different >application domains.
Just about all "process" related terminology has been so heavily overloaded that it's hard to have a group discussion: unless you rigorously define absolutely every term, everyone will have a [more or less] different understanding based upon their own experience. There was a time [not so long ago] when "thread" referred to flow of control rather than a "scheduling entity", a single CPU was "multi-programmed" to [appear to] do many things simultaneously, "multi-programming" was distinct from "multi-processing", and "multi-tasking" was a layman's term that could mean almost anything. Ah, the good ol' days.
>E.g., migrating the "client thread" INTO the "server" (i.e., letting >it execute *in* the serving thread AS IF it still had the identity >of the client thread) applies when you do a *local* IPC as well as when >you do a *remote* RPC, obviously. I.e., even on a uniprocessor! >(amusing, then, that it seems to not be supported in COTS/FOSS OS >offerings! Oooops!) > >IIRC, Alpha called these "distributed threads" or, less humbly, "alpha >threads" as a nod to the fact that the thread *conceptually* wanders >around the "system" (regardless of: multitasking on a uniprocessor, >multiprocessing via SMP or in a physically distributed system).
Yes.
>The other sort of "migration" (relocation?) applies to "physically" >moving the process to a different node elsewhere in the network. >In this case, obviously only pertinent to a loosely coupled multiprocessor >system (NORMA).
Not familiar with NORMA. WRT OS literature, "migration" is the usual term for execution moving to a different CPU (or nowadays to a different core).
>I.e., once "migrated", the original host can get struck by >lightning to no effect (wrt the task/process in question)
Maybe. The "original host" might be another core on the same die.
>[Note this is more involved than just packing up registers plus >address space!]
It isn't THAT hard: clustered mainframes in the 1960's had the ability to migrate processes ... swap out here, swap in there. All it really requires is virtual addressing capability and a way to transport the code and runtime data. [The page oriented addressing in modern OSes is space efficient, but it actually complicates things versus simple base:offset segmentation addressing.] I know you're referring to the issues of (de)serializing runtime data structures for shipment over network ... I'm just pointing out that migration can be (and was!) accomplished relatively simply using shared storage.
>> There have been some other operating systems capable of doing this. >> AIUI, the interesting thing about Don's system is that he is >> attempting to do real-time, real-world control ... not simply to >> distribute processing over a bunch of networked "compute servers". > >Exactly. Previous systems just were "processor farms" -- typically >all "powered" just waiting for "workloads". The idea of *bringing* >another node on-line, ON-DEMAND to address increasing needs wasn't >part of their scope (why should it be? Unless you're concerned >with power consumption!). Nor was there a concern over taking >nodes OFF-line when they weren't technically needed.
The current crop of tera-scale computers consume megawatts, and the largest peta-scale computers consume 10s of megawatts when all their CPUs and attached IO devices are active. They have extensive power control systems to manage the partitioning of active/inactive devices. Even modern desktop CPUs have reached the point of needing to turn on/off functional units on demand as the instruction stream dictates. They need to do it because they are unable to dissapate heat effectively enough and will self destruct if too many circuits are powered simultaneously. Your average quad-core processor now lives in a perpetual state of "rolling blackout" with at most about 1/3 of its circuitry powered up at any given instant. Many circuits are turned on/off cycle by cycle. [Server chips, on the whole, are no better - but having more circuitry to work with means their powered up "1/3" can do more. Unless you're nitrogen cooling your system, you really aren't able to use much of what you theoretically paid for.] And nobody has yet come up with a really good way to exploit massively parallel hardware for general application programming. But that's a different discussion.
>Finally, AFAICT, trying to meet timeliness constraints in such >a malleable system was an extra degree of freedom never addressed. >(and taking that into consideration AS you made these other >dispatching/scheduling decisions)
Yes.
>Cluster, Grid, NoW, Cloud, server farm, etc. -- none really address >the concept accurately (for different reasons). > >IMO, designs will increasingly be moving in this direction as >systems become too complex to be addressed reliably on single >processors (even with multiple cores, memory becomes the bottleneck) >as well as physically more dispersed applications. How do you >factor in the activities of another "service" when *YOU* have to >provide some sort of timeliness guarantees? > >[Should be relatively easy to see applications where a single processor >will fall flat!]
George