Hi George,

Rain, rain, rain...  We tend to forget what 90% RH feels like when its
typically 9%!  :-/   (OTOH, never say 'no' to rain!)

On 7/27/2016 11:22 AM, George Neuner wrote:
> On Tue, 26 Jul 2016 22:07:13 -0700, Don Y
> <blockedofcourse@foo.invalid> wrote:
>
>> On 7/26/2016 9:50 AM, George Neuner wrote:
>>>
>>> In actual fact, early tape based mainframes quickly developed block
>>> based tape "filesystems" that allowed data "files" on the tape to be
>>> discontiguous.  To make it faster, multiple tapes were used as
>>> RAID-like stripe sets - reading/writing one while seeking on another,
>>> etc.  A file only had to be contigous on tape when it entered or left
>>> the system.
>>>
>>> When disks were introduced, the "filesystem" concept followed.
>>
>> When I first adopted NetBSD (early 90's), they had no support for
>> 9T tape (which is what I was using for "portable" tape exchange
>> as well as offline storage, at the time).  Originally, an F880
>> (800/1600bpi, 100ips)
>>
>> As few folks *had* 9T transports, it fell to me to write a driver to
>> support my controller.  Adding the "block" device as well as the more
>> traditional "character" device.  So, I could build "tape filesystems"
>> "just as easily" as disk-based!
>>
>> Conceptually, a piece of cake!  The "strategy" routine took a bit
>> of work to add optimizations like "read reverse"  but it was kinda cool!
>>
>> However, the first time I tried to "read" a file system mounted on
>> the device, the "violence" of the mechanism's motions was scary!
>> Made me wonder if I was tearing things up with all the short
>> back-and-forth motions (seeks) and aggressive acceleration and
>> braking in the transport!  Sort of like watching a pen-plotter running
>> balls-out!
>>
>> And, that was with files laid down in contiguous tape blocks;
>> the idea of the transport being commanded to seek blocks
>> scattered around the medium (no buffering in the transport,
>> formatter or controller!) led me to believe I'd quickly
>> wear SOMETHING out in the transport -- even if only the head
>> or tension arm roller!
>
> I don't doubt it.  But I think you might have hit upon a worst case
> scenario for a single tape.

I suspect a good part of it was due to the fact that neither the
transport, formatter nor controller did any sort of buffering.
I.e., bytes came right off the head into a latch on the ISA bus;
get it NOW or its gone!

Coupled with relatively large IRG's and low density (800/1600PE)
meant the transport had to make "noticeable" movements to do
*anything*!  (BSF/BSR, etc.)

By contrast, more modern "drives" have huge buffers, VToC's,
really high densities and much less "mass" involved so they don't
have to "work" as hard to get at data.

I think about ~50MB on a 10" BlackWatch... compared to a sizeable
fraction of a TB on a DLT-S4  (or many TB on newer LTO's <shudder>)

OTOH, modern drives have more brains in their onboard firmware
than was available in the HOSTS of most 9T's!

> Mainframe tape drivers - really just an I/O program - performed
> request reordering and elevator seeking to minimize thrashing of the
> tapes.  Any particular tape in a big system was likely to have many
> outstanding requests at any given time.
>
> Discontiguous files were only ever used on scratch tapes that were
> temporary storage.  Code images on "program" tapes always were
> contiguous, as were any files on external "transport" tapes.  Input
> entering the system or output leaving it always was contiguous (for
> portability).
>
> 1960 era mainframes had (by today's standards) severe memory
> limitations: a machine with 256K words[*] of RAM was a *BIG* system
> that would be juggling dozens of jobs.   The average program used less
> than 12K of memory and scripted execution of a "pipeline" of small
> programs was very common.
>
> [*] https://en.wikipedia.org/wiki/Word_(computer_architecture)

Yeah, my first disk (RS08) had 256K words (12b?).  Drew something like 500W
(for the drive+controller) and could source data at a whopping ~50KW/s.
If you were lucky, you could fit *four* of them in a full size rack
(for a 1 megaword store!  <drool>)

But, it was cool in that it was WORD ADDRESSABLE! (!)
[And, had a boatload of blnkenlites!]

> Quite a lot of routine data manipulation was done by writing/reading
> temporary files, as was all passing of data between programs.  It
> wasn't feasible to dedicate individual scratch tapes to every running
> job, and often jobs needed multiple files simultaneously which was
> hideously slow and inefficient unless the files all were on separate
> tapes.
>
> And once multitasking systems became the norm, it was no longer
> politically expedient for the mainframe operators to delay jobs of
> "important" users until scratch tapes were available.
>
> Thus the discontiguous tape filesystem was born.
>
> On my shelf, I have a textbook of "external" -  i.e. file based - data
> processing algorithms.  It was written in 1988, an indication that
> file based techniques were still being taught in some CS/CIS programs
> then.
>
> Modern "big data" processing has shown that file based techniques are
> *still* relevant and that it was a mistake to stop teaching them:
> programmers now often have no clue how to proceed when, for whatever
> reason, their data can't be fit into memory.
> [Or worse: their "in-memory" processing is much slower than they
> expect because because their data is swapped out due to multitasking.]

Nothing *ever* "goes obsolete" -- much like fashion; you just have to wait
for folks to rediscover *old* techniques as *new* problems manifest
and prove to no longer fit the assumptions they've artificially imposed on
their solutions!

Had to do that just last week to (brute force) process some huge "lists"
to document how I'd built the diskless workstations for my upcoming "class".
Annoying as the "files" weren't even local (NFS)!

After finishing (quick and dirty), I realized a simpler way of getting the
same results... :<

> Not that still teaching file techniques today would have any great
> impact on the world because the majority of programmers now have no
> formal CS/CIS education.

We've had this discussion before:  "teaching what industry wants"
instead of "teaching what industry WILL want".  It's the "Water Cooler
Institute of Technology" -- devoid of detail and all the caveats that
come with a more "formal" education.

I had an early employer who made no bones about this in his hiring
decisions:  "I hire from University X when I am looking for an
employee that I need TODAY; I'll hire from University Y when I'm
looking for someone that might NOT be productive, today, but will
be a better LONG TERM investment (the University *X* employee requiring
renewed education to remain viable -- almost as soon as hired!)"

It was an interesting insight to see how (some?) employers approach
their human resource needs...

>> When I got the M990 (much higher density, much deeper pockets!),
>> I was even *less* inclined to 'experiment'.  :<
>>
>> I learned to settle for simple streaming...
>
> As always, YMMV.
> George
>

On Tue, 26 Jul 2016 22:07:13 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 7/26/2016 9:50 AM, George Neuner wrote:
>>
>> In actual fact, early tape based mainframes quickly developed block
>> based tape "filesystems" that allowed data "files" on the tape to be
>> discontiguous.  To make it faster, multiple tapes were used as
>> RAID-like stripe sets - reading/writing one while seeking on another,
>> etc.  A file only had to be contigous on tape when it entered or left
>> the system.
>>
>> When disks were introduced, the "filesystem" concept followed.
>
>When I first adopted NetBSD (early 90's), they had no support for
>9T tape (which is what I was using for "portable" tape exchange
>as well as offline storage, at the time).  Originally, an F880
>(800/1600bpi, 100ips)
>
>As few folks *had* 9T transports, it fell to me to write a driver to
>support my controller.  Adding the "block" device as well as the more
>traditional "character" device.  So, I could build "tape filesystems"
>"just as easily" as disk-based!
>
>Conceptually, a piece of cake!  The "strategy" routine took a bit
>of work to add optimizations like "read reverse"  but it was kinda cool!
>
>However, the first time I tried to "read" a file system mounted on
>the device, the "violence" of the mechanism's motions was scary!
>Made me wonder if I was tearing things up with all the short
>back-and-forth motions (seeks) and aggressive acceleration and
>braking in the transport!  Sort of like watching a pen-plotter running
>balls-out!
>
>And, that was with files laid down in contiguous tape blocks;
>the idea of the transport being commanded to seek blocks
>scattered around the medium (no buffering in the transport,
>formatter or controller!) led me to believe I'd quickly
>wear SOMETHING out in the transport -- even if only the head
>or tension arm roller!

I don't doubt it.  But I think you might have hit upon a worst case
scenario for a single tape.

Mainframe tape drivers - really just an I/O program - performed
request reordering and elevator seeking to minimize thrashing of the
tapes.  Any particular tape in a big system was likely to have many
outstanding requests at any given time.

Discontiguous files were only ever used on scratch tapes that were
temporary storage.  Code images on "program" tapes always were
contiguous, as were any files on external "transport" tapes.  Input
entering the system or output leaving it always was contiguous (for
portability).

1960 era mainframes had (by today's standards) severe memory
limitations: a machine with 256K words[*] of RAM was a *BIG* system
that would be juggling dozens of jobs.   The average program used less
than 12K of memory and scripted execution of a "pipeline" of small
programs was very common.

[*] https://en.wikipedia.org/wiki/Word_(computer_architecture)

Quite a lot of routine data manipulation was done by writing/reading
temporary files, as was all passing of data between programs.  It
wasn't feasible to dedicate individual scratch tapes to every running
job, and often jobs needed multiple files simultaneously which was
hideously slow and inefficient unless the files all were on separate
tapes.

And once multitasking systems became the norm, it was no longer
politically expedient for the mainframe operators to delay jobs of
"important" users until scratch tapes were available.

Thus the discontiguous tape filesystem was born.

On my shelf, I have a textbook of "external" -  i.e. file based - data
processing algorithms.  It was written in 1988, an indication that
file based techniques were still being taught in some CS/CIS programs
then.

Modern "big data" processing has shown that file based techniques are
*still* relevant and that it was a mistake to stop teaching them:
programmers now often have no clue how to proceed when, for whatever
reason, their data can't be fit into memory. 
[Or worse: their "in-memory" processing is much slower than they
expect because because their data is swapped out due to multitasking.]

Not that still teaching file techniques today would have any great
impact on the world because the majority of programmers now have no
formal CS/CIS education.

>When I got the M990 (much higher density, much deeper pockets!),
>I was even *less* inclined to 'experiment'.  :<
>
>I learned to settle for simple streaming...

As always, YMMV.
George

On 7/26/2016 9:50 AM, George Neuner wrote:
>
> Wrong person.  DonY wrote the piece you responded to.
>
>
> On Tue, 26 Jul 2016 15:59:32 +0300, upsidedown@downunder.com wrote:
>
>> On Tue, 26 Jul 2016 06:27:42 -0400, George Neuner
>> <gneuner2@comcast.net> wrote:
>>
>>> On Mon, 25 Jul 2016 16:56:42 -0700, Don Y
>>> <blockedofcourse@foo.invalid> wrote:
>>>
>>>> :
>>>> For example, any network traffic that was active at the time the
>>>> swap occurred still ended up with its endpoint on the current
>>>> node.  You didn't have to buffer any incoming messages intended
>>>> for that "to be swapped" process and later forward them to the
>>>> new destination when the process is "restored".
>>
>> Is this any different from the situation in the old days, when you had
>> to swap out complete programs from core to make room for other
>> programs ? If the task had some active I/O going on, it had some
>> I/O-buffers (DMA-buffers) locked in memory. The situation was quite
>> nasty, especially with slow I/O such as mag tape (possibly involving a
>> tape rewind).
>
> If you go back to the very early days of unblocked streaming tapes,
> swapping processes then was all but impossible.  But in the 60's
> mainframe era, most tapes used hard block formatting, and better
> drives could quickly pause and reposition the tape.
>
> Most machines had dedicated I/O processors separate from the main
> CPU(s).   The OS tracked the activity of these processors and did not
> swap processes while an I/O transfer was in progress.
>
> But since most I/O was block oriented, a large logical transfer could
> be done as a series of smaller physical ones - enabling the logical
> transfer to be halted and resumed.
>
> In actual fact, early tape based mainframes quickly developed block
> based tape "filesystems" that allowed data "files" on the tape to be
> discontiguous.  To make it faster, multiple tapes were used as
> RAID-like stripe sets - reading/writing one while seeking on another,
> etc.  A file only had to be contigous on tape when it entered or left
> the system.
>
> When disks were introduced, the "filesystem" concept followed.

When I first adopted NetBSD (early 90's), they had no support for
9T tape (which is what I was using for "portable" tape exchange
as well as offline storage, at the time).  Originally, an F880
(800/1600bpi, 100ips)

As few folks *had* 9T transports, it fell to me to write a driver to
support my controller.  Adding the "block" device as well as the more
traditional "character" device.  So, I could build "tape filesystems"
"just as easily" as disk-based!

Conceptually, a piece of cake!  The "strategy" routine took a bit
of work to add optimizations like "read reverse"  but it was kinda cool!

However, the first time I tried to "read" a file system mounted on
the device, the "violence" of the mechanism's motions was scary!
Made me wonder if I was tearing things up with all the short
back-and-forth motions (seeks) and aggressive acceleration and
braking in the transport!  Sort of like watching a pen-plotter running
balls-out!

And, that was with files laid down in contiguous tape blocks;
the idea of the transport being commanded to seek blocks
scattered around the medium (no buffering in the transport,
formatter or controller!) led me to believe I'd quickly
wear SOMETHING out in the transport -- even if only the head
or tension arm roller!

When I got the M990 (much higher density, much deeper pockets!),
I was even *less* inclined to 'experiment'.  :<

I learned to settle for simple streaming...

Wrong person.  DonY wrote the piece you responded to.

On Tue, 26 Jul 2016 15:59:32 +0300, upsidedown@downunder.com wrote:

>On Tue, 26 Jul 2016 06:27:42 -0400, George Neuner
><gneuner2@comcast.net> wrote:
>
>>On Mon, 25 Jul 2016 16:56:42 -0700, Don Y
>><blockedofcourse@foo.invalid> wrote:
>>
>>> :
>>>For example, any network traffic that was active at the time the
>>>swap occurred still ended up with its endpoint on the current
>>>node.  You didn't have to buffer any incoming messages intended
>>>for that "to be swapped" process and later forward them to the
>>>new destination when the process is "restored".
>
>Is this any different from the situation in the old days, when you had
>to swap out complete programs from core to make room for other
>programs ? If the task had some active I/O going on, it had some
>I/O-buffers (DMA-buffers) locked in memory. The situation was quite
>nasty, especially with slow I/O such as mag tape (possibly involving a
>tape rewind). 

If you go back to the very early days of unblocked streaming tapes,
swapping processes then was all but impossible.  But in the 60's
mainframe era, most tapes used hard block formatting, and better
drives could quickly pause and reposition the tape.

Most machines had dedicated I/O processors separate from the main
CPU(s).   The OS tracked the activity of these processors and did not
swap processes while an I/O transfer was in progress.

But since most I/O was block oriented, a large logical transfer could
be done as a series of smaller physical ones - enabling the logical
transfer to be halted and resumed.

In actual fact, early tape based mainframes quickly developed block
based tape "filesystems" that allowed data "files" on the tape to be
discontiguous.  To make it faster, multiple tapes were used as
RAID-like stripe sets - reading/writing one while seeking on another,
etc.  A file only had to be contigous on tape when it entered or left
the system.

When disks were introduced, the "filesystem" concept followed.

>There are several alternatives:
>
>* lock the whole program in memory until I/O is complete (nasty)

Which is what was done but distinguishing physical I/O from logical.

>* just lock the I/O buffers (possibly part of a small I/O program) and
>swap that out too, when I/O is completed

Too complicated.  The code/data/rss/stack segmentation of Unix
programs mirrors how early mainframes (and minis) actually worked:
pure segmentation with base:offset addressing  [the segment bases
being known only to the operating system].

I/O processors ran a single program - effectively the device driver -
which was never swapped.   Any local buffering the program may have
needed likewise would be in the processor's partitioned space.

But many I/O processors simply accessed the program's data space
buffer directly.  The buffer could not be swapped out *while* the I/O
processor was accessing it.  But again, the physical I/O could be done
incrementally with opportunity to swap between transfers.

>* abort the I/O and retry again when the program is swapped back into
>memory. Possible for read operations from mass storage

Sometimes had to be done with unblocked streaming tapes.  Once block
devices became the norm, aborting I/O for swapping never had to be
done.

>The last alternative is useful also with network traffic, provided
>that the sender buffers the transmitted data until it is acknowledged
>by the receiver.
>
>With modern multicore/processors with virtual memory this should be
>trivial as long as the processors share the same physical memory bus. 
>
>In a system with physically separate platforms, some network helper
>programs are needed to transfer data between two buffers in different
>platforms with available transfer systems, such as Ethernet.

YMMV,
George

On 26.7.2016 &#1075;. 15:59, upsidedown@downunder.com wrote:
> On Tue, 26 Jul 2016 06:27:42 -0400, George Neuner
> <gneuner2@comcast.net> wrote:
>
>> On Mon, 25 Jul 2016 16:56:42 -0700, Don Y
>> <blockedofcourse@foo.invalid> wrote:
>>
>>> On 7/25/2016 2:32 PM, George Neuner wrote:
>>>
>>>>> [Note [process migration] is more involved than just packing up
>>>>> registers plus address space!]
>>>>
>>>> It isn't THAT hard: clustered mainframes in the 1960's had the ability
>>>> to migrate processes ... swap out here, swap in there.  All it really
>>>> requires is virtual addressing capability and a way to transport the
>>>> code and runtime data.
>>>
>>> Yes.  But they already had the "extra bits" (of state) that were
>>> resident IN the OS's data structures.  They either packed those
>>> up with the "(formal) process state" as it was swapped out
>>> *or* kept it in the kernel associated with the swapped out
>>> process.
>>
>> Yes.
>>
>>> For example, any network traffic that was active at the time the
>>> swap occurred still ended up with its endpoint on the current
>>> node.  You didn't have to buffer any incoming messages intended
>>> for that "to be swapped" process and later forward them to the
>>> new destination when the process is "restored".
>
> Is this any different from the situation in the old days, when you had
> to swap out complete programs from core to make room for other
> programs ? If the task had some active I/O going on, it had some
> I/O-buffers (DMA-buffers) locked in memory. The situation was quite
> nasty, especially with slow I/O such as mag tape (possibly involving a
> tape rewind).
>
> There are several alternatives:
>
> * lock the whole program in memory until I/O is complete (nasty)
>
> * just lock the I/O buffers (possibly part of a small I/O program) and
> swap that out too, when I/O is completed
>
> * abort the I/O and retry again when the program is swapped back into
> memory. Possible for read operations from mass storage
>
> The last alternative is useful also with network traffic, provided
> that the sender buffers the transmitted data until it is acknowledged
> by the receiver.
>
> With modern multicore/processors with virtual memory this should be
> trivial as long as the processors share the same physical memory bus.

Yes indeed. Then you grab your phone and discover things can be really
messed up in this context even with *gigabytes* of RAM....

Sorry for the rant, I know this is obvious to all of us here.

Dimiter

On 7/26/2016 5:59 AM, upsidedown@downunder.com wrote:
> On Tue, 26 Jul 2016 06:27:42 -0400, George Neuner
> <gneuner2@comcast.net> wrote:
>
>> On Mon, 25 Jul 2016 16:56:42 -0700, Don Y
>> <blockedofcourse@foo.invalid> wrote:
>>
>>> On 7/25/2016 2:32 PM, George Neuner wrote:
>>>
>>>>> [Note [process migration] is more involved than just packing up
>>>>> registers plus address space!]
>>>>
>>>> It isn't THAT hard: clustered mainframes in the 1960's had the ability
>>>> to migrate processes ... swap out here, swap in there.  All it really
>>>> requires is virtual addressing capability and a way to transport the
>>>> code and runtime data.
>>>
>>> Yes.  But they already had the "extra bits" (of state) that were
>>> resident IN the OS's data structures.  They either packed those
>>> up with the "(formal) process state" as it was swapped out
>>> *or* kept it in the kernel associated with the swapped out
>>> process.
>>
>> Yes.
>>
>>> For example, any network traffic that was active at the time the
>>> swap occurred still ended up with its endpoint on the current
>>> node.  You didn't have to buffer any incoming messages intended
>>> for that "to be swapped" process and later forward them to the
>>> new destination when the process is "restored".
>
> Is this any different from the situation in the old days, when you had
> to swap out complete programs from core to make room for other
> programs ? If the task had some active I/O going on, it had some
> I/O-buffers (DMA-buffers) locked in memory. The situation was quite
> nasty, especially with slow I/O such as mag tape (possibly involving a
> tape rewind).

The I/O was associated with a specific task.  You could simply defer
swapping out its results until the I/O had completed (e.g., at the
next IRG).  This is no different than deferring the context switch
of a "coprocessor" (most typically FPU) until a convenient point
AFTER the body of the task had undergone its context switch:  you
know who the FPU's state belongs to, you just have to "remember"
that when it comes time to attempt to USE the FPU after the
"primary" context switch.

Instead, imagine if the I/O started by a task in a multitasking system
*stopped* when that task wasn't "running" (i.e., had control of the
processor).  Consider how you'd address that sort of implementation.
I.e., the swapped out task is no longer able to provide SERVICES
that other tasks are counting upon for their continued operation.

How useful would "multitasking" be in that scenario?

> There are several alternatives:
>
> * lock the whole program in memory until I/O is complete (nasty)
>
> * just lock the I/O buffers (possibly part of a small I/O program) and
> swap that out too, when I/O is completed
>
> * abort the I/O and retry again when the program is swapped back into
> memory. Possible for read operations from mass storage
>
> The last alternative is useful also with network traffic, provided
> that the sender buffers the transmitted data until it is acknowledged
> by the receiver.

You're thinking about network traffic that *can* be stopped/paused.
Imagine "suddenly" (from the standpoint of other applications)
saying that "printf() will not be available" (while the server that
implements the printf() functionality is being relocated (or, swapped
out).

Do all of the programs that have printf()'s have to learn how
to deal with that situation?  ("OK, I'll print the results,
later..."  when?)  Or, does printf()'s unavailability automatically
cause those other dependent tasks to block (indefinitely)?

> With modern multicore/processors with virtual memory this should be
> trivial as long as the processors share the same physical memory bus.

Moving the contents of memory is trivial:  one system call, in my case
(object that references the task's memory space; object that references
the destination node).

The problem is all the other cruft that has to be gathered up
(extricated from the OS) to go along with the "task" while it is
in transition.
- "Hello, Mr Task.  Here are the results of that last RPC that
   you issued..."
- "Hello, Mr. Task.  Could you please perform this service for me?"
- "Hey, Mr task!  Where are you going??  You still haven't given me
   the results of that last service that I requested!!"
- "Hey, Mr Task, you're holding some locks that I need!  Please
   don't tell me you're going to continue holding them while you're
   being swapped out?  That's just plain RUDE!"
- "Um, while you're 'away', can I use the resources that you've
   RESERVED?"
etc.

> In a system with physically separate platforms, some network helper
> programs are needed to transfer data between two buffers in different
> platforms with available transfer systems, such as Ethernet.

Hi George,

[Early morning meeting -- WTF?  Had *hoped* I'd get a nap in beforehand
but obviously got caught up in "stuff"...  <frown>]

 >>>> ... Previous systems just were "processor farms" -- typically
>>>> all "powered" just waiting for "workloads".  The idea of *bringing*
>>>> another node on-line, ON-DEMAND to address increasing needs wasn't
>>>> part of their scope (why should it be?  Unless you're concerned
>>>> with power consumption!).  Nor was there a concern over taking
>>>> nodes OFF-line when they weren't technically needed.
>>>
>>> The current crop of tera-scale computers consume megawatts, and the
>>> largest peta-scale computers consume 10s of megawatts when all their
>>> CPUs and attached IO devices are active.  They have extensive power
>>> control systems to manage the partitioning of active/inactive devices.
>>
>> But their goal is to *use* all of that compute power, not let it
>> idle.  They tend to be more homogeneous environments with more
>> "level" I/O usage.  It's not like turning on CCTV cameras "because
>> it's getting dark outside" and, as a result, *needing* that extra compute
>> power to do video processing.
>
> Supercomputing centers all are batch oriented just like mainframes
> used to be.  The difference is they shut off CPUs that aren't in use -
> if any - to lower the power bills.
>
> It's true that a lot of older machines have plenty of work to keep
> them runnning ... but in the last 10-15 years, many newer ones have
> had odd architectures that make writing software for them difficult
> and time consuming.  It is true that a lot of them use Intel or ARM
> processors, but it isn't true that they all run Linux and can be
> programmed using GCC/OpenMP.  Some of the world's most powerful
> systems sit idle much of the time, simply for lack of software.
>
> And a lot of the software itself is surprisingly flexible.  In most
> SCC environments there is no multi-tasking: a set of CPUs is dedicated
> to a program for its duration.  But external factors may cause a
> program to be halted before finishing.  The stopped program may be
> restarted later with a different number of CPUs according to the mix
> of programs that are running at that time.

Different than being "paused" while "relocated" -- yet expecting
the rest of the "system" to accommodate their time "in transit".

Note that, conceptually, my code could run on a single CPU
(with enough resources) -- it isn't *inherently* parallel.
I think this is an essential aspect for development and
future maintenance:  you only need to expect the *illusion*
of parallelism.

[The other aspects we've been discussing are all hidden from
the application developer in much the same way as GC.  I.e.,
if you think about things, you realize "its happening" but
your code never really understands why/where]

> Programs which are expected to need many CPUs (or vast amounts of
> memory which very often is tied to the number of CPUs), or which are
> expected to run for more than a few minutes - such programs often are
> written to checkpoint intermediate processing, to be restartable from
> saved checkpoints, and to adapt dynamically to the number of CPUs they
> are given when (re)started.
>
>>> And nobody has yet come up with a really good way to exploit massively
>>> parallel hardware for general application programming.  But that's a
>>> different discussion.
>>
>> I don't believe we'll see any "effective" algorithms -- largely because
>> there aren't many "available" installations.  So, you have groups
>> with very specific sorts of application sets trying to tackle the
>> problem for *their* needs; not for "general" needs.
>
> The point is that "massively parallel" is becoming the norm.  There
> are commodity server chips now with 16 and 32 cores, and this year's
> high end server chip will be in a laptop in 5 years.

I consider SMP a different beast than distributed systems.  E.g.,
my blade server has ~60 (?) cores in the same box -- but it's really
14 separate "machines" that really are only suited to working in
a "server farm" sort of environment; tied to applications that
are simply replicated with a split workload (instead of a single
application that has been "diced up" to run on the many processors).

> [ASUS now will happily sell you a water-cooled !! laptop with an
> overclocked 6th-gen i7 paired with 2 Nvidia 1080 GPUs.  If you
> disconnect the water line it melts ... or maybe just slows down 50%.
> But seriously, what do you expect for only $7K?]
>
> In any event, more cores are fine for running more programs

Exactly.  Which is how I've exploited "CPUs" (and, now, cores).
Conveniently sidestepping the issue of "how do you develop
for this environment" by treating it as many "programs" instead
of a single program that magically diced itself to pieces.

> simultaneously, but there's no good *general* way to leverage more
> cores to make a single program run faster.  The ways that are known to
> be automatically exploitable (by a compiler) are largely limited to
> parallelizing loops.  Parallelizing non looping code [which is most of
> most programs that need it] invariably relies on the programmer to
> recognize possible parallelism and write special code to take
> advantage of it.
>
> Why should anyone care?  Good question.  I don't have a good answer,
> but a couple of data points:
>
> Lots of experience has shown that the average programmer can't write
> correct parallel code  [or even just correct serial code, but that's

I think that's inherent in the way that most people *think* of algorithms:
"this, then that, followed by yet another thing".  Being a hardware
person, I implicitly see walking and chewing gum as "obvious"... why
should this bit of hardware *pause* waiting for this other bit to
finish ("either-or")?

> another discussion].  Automating parallelism - via compilers or smart
> runtime systems - is the only way any significant percentage of
> programs will be able to benefit.

I don't see big gains, there.  E.g., I probably represent an "above
average compute load" (in terms of what and how much I use machines for
during development), yet am reasonably sure most of the apps that
I run would not SIGNIFICANTLY benefit from even a *second* core
(well, maybe autorouting a PCB while simultaneously doing something
else, but that's the excpetion, not the rule).

Note my previous comment re: how I've decided to use multicore CPUs
in the HA system:  by assigning specific jobs to specific cores
KNOWING what those loads represent in my system *and* thereby
being able to sidestep the core-scheduling issue.

Will this partitioning be the most efficient use of the cores?  Probably
not.  But, they're inexpensive and this *will* make it easier to ignore
some of the costs that my design "naturally" incurs.

> Surveys have shown that for many people, the only "computer" they own
> or routinely use is their smartphone.  The average person quite soon
> will reasonably expect their phone to able to do anything: word
> processing, spreadsheets, audio/video editing and presentation, i.e.
> general business computing, and (for those few taking college STEM
> courses) solving differential equations, performing circuit
> simulations, virtual reality walkthroughs of pyramids, galaxies,
> cadavers, etc.  ...
> ... while snapchatting, tweeting, facebooking, pinteresting, and still
> providing 40 hours of use on a battery charge.

I disagree with the "while" assumption (as in true concurrency).
I still think people think about "work" serially.  As long as you
*appear* to make some progress on "task A" while they are busy
with "task B", they're unable to understand how *good* your
effort happened to be.

E.g., if I'm updating a schematic *while* autorouting (another) design,
I only care that the autorouter made *some* progress while I had
the machine "tied up" with my schematic entry activities.  It's
not like I review its progress and think:  "Gee, it should be farther
along than it is...".  OTOH, if it had (obviously) *paused*
"while I was looking elsewhere", I'd be annoyed.

With this in mind, I think it makes it easier to develop application
sets that can achieve "satisfactory" performance without aiming for
"ideal" performance.  The challenging (portions of) applications being
those things that the user *expects* to run in "real time" (e.g.,
the "answering machine" application on your phone can't suddenly start
speaking the OGM in slow motion just because your attention is focused
on something else!).

E.g., I can do "commercial detection" in recorded video "off-line"...
as long as I can get it DONE before the user wants to view that
video!  (which, of course, can't be *while* it is streaming cuz
it consumes more real time than it *occupies*).

OTOH, I have to do motion detection (CCTV) in real time cuz that
affects the latency of the actions triggered by that motion.

> Unless someone comes up with a way to pack kWh into a AA sized package
> [that can be recharged in under 30 minutes], making better use of
> large numbers of low powered cores is going to be the only way
> forward.

Or, convince people that they don't need to "take it with them"!
E.g., my UI's have far less power requirements than the "code"
to run them implies.  (OTOH, they're featherweight devices so
the battery issue still applies).  I don't think people care *where*
the processing is performed; they just want to ACCESS it "locally"
(wherever "locally" may be!)

I look at how my HA system (and its successor) have evolved and
that's the single-most striking "optimization" in it all!
Decouple the UI from the application.  And, make the UI ubiquitous
at the same time --> simpler, more desirable system!

[Of course, for most people, that means reliance on someone else to
provide that "service"... still baffles me to see what people pay
for that "convenience" re: cell phones!]

(sigh)  Shower then gather up my prototypes and get my *ss out of here.
THEN, perhaps, the nap I've been awaiting?

On Tue, 26 Jul 2016 06:27:42 -0400, George Neuner
<gneuner2@comcast.net> wrote:

>On Mon, 25 Jul 2016 16:56:42 -0700, Don Y
><blockedofcourse@foo.invalid> wrote:
>
>>On 7/25/2016 2:32 PM, George Neuner wrote:
>>
>>>> [Note [process migration] is more involved than just packing up 
>>>> registers plus address space!]
>>>
>>> It isn't THAT hard: clustered mainframes in the 1960's had the ability
>>> to migrate processes ... swap out here, swap in there.  All it really
>>> requires is virtual addressing capability and a way to transport the
>>> code and runtime data.
>>
>>Yes.  But they already had the "extra bits" (of state) that were
>>resident IN the OS's data structures.  They either packed those
>>up with the "(formal) process state" as it was swapped out
>>*or* kept it in the kernel associated with the swapped out
>>process.
>
>Yes.
>
>>For example, any network traffic that was active at the time the
>>swap occurred still ended up with its endpoint on the current
>>node.  You didn't have to buffer any incoming messages intended
>>for that "to be swapped" process and later forward them to the
>>new destination when the process is "restored".

Is this any different from the situation in the old days, when you had
to swap out complete programs from core to make room for other
programs ? If the task had some active I/O going on, it had some
I/O-buffers (DMA-buffers) locked in memory. The situation was quite
nasty, especially with slow I/O such as mag tape (possibly involving a
tape rewind). 

There are several alternatives:

* lock the whole program in memory until I/O is complete (nasty)

* just lock the I/O buffers (possibly part of a small I/O program) and
swap that out too, when I/O is completed

* abort the I/O and retry again when the program is swapped back into
memory. Possible for read operations from mass storage

The last alternative is useful also with network traffic, provided
that the sender buffers the transmitted data until it is acknowledged
by the receiver.

With modern multicore/processors with virtual memory this should be
trivial as long as the processors share the same physical memory bus. 

In a system with physically separate platforms, some network helper
programs are needed to transfer data between two buffers in different
platforms with available transfer systems, such as Ethernet.

On Mon, 25 Jul 2016 16:56:42 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 7/25/2016 2:32 PM, George Neuner wrote:
>
>>> [Note [process migration] is more involved than just packing up 
>>> registers plus address space!]
>>
>> It isn't THAT hard: clustered mainframes in the 1960's had the ability
>> to migrate processes ... swap out here, swap in there.  All it really
>> requires is virtual addressing capability and a way to transport the
>> code and runtime data.
>
>Yes.  But they already had the "extra bits" (of state) that were
>resident IN the OS's data structures.  They either packed those
>up with the "(formal) process state" as it was swapped out
>*or* kept it in the kernel associated with the swapped out
>process.

Yes.

>For example, any network traffic that was active at the time the
>swap occurred still ended up with its endpoint on the current
>node.  You didn't have to buffer any incoming messages intended
>for that "to be swapped" process and later forward them to the
>new destination when the process is "restored".

Yes ... but circuit switching was old hat in the 1940's and packet
switching evolved in the late 1950's [arguably it was first used in a
real system in 1961].

Through the 1960's few installations had more than modems and
terminals to worry about - easily automated using either controlled
switching of either type.  There were few packet networks (as we know
them), most experimental, and no communication standards: there were
as many different protocols as there were networks.

Almost all of the hardware and the software protocols that people
typically think of as being associated with "early" networking: Aloha,
PUP, X.25, ARCnet, Ethernet, etc. - really all date from the 1970's.

But your point is taken. <grin>

>>> ... Previous systems just were "processor farms" -- typically
>>> all "powered" just waiting for "workloads".  The idea of *bringing*
>>> another node on-line, ON-DEMAND to address increasing needs wasn't
>>> part of their scope (why should it be?  Unless you're concerned
>>> with power consumption!).  Nor was there a concern over taking
>>> nodes OFF-line when they weren't technically needed.
>>
>> The current crop of tera-scale computers consume megawatts, and the
>> largest peta-scale computers consume 10s of megawatts when all their
>> CPUs and attached IO devices are active.  They have extensive power
>> control systems to manage the partitioning of active/inactive devices.
>
>But their goal is to *use* all of that compute power, not let it
>idle.  They tend to be more homogeneous environments with more
>"level" I/O usage.  It's not like turning on CCTV cameras "because
>it's getting dark outside" and, as a result, *needing* that extra compute
>power to do video processing.

Supercomputing centers all are batch oriented just like mainframes
used to be.  The difference is they shut off CPUs that aren't in use -
if any - to lower the power bills.  

It's true that a lot of older machines have plenty of work to keep
them runnning ... but in the last 10-15 years, many newer ones have
had odd architectures that make writing software for them difficult
and time consuming.  It is true that a lot of them use Intel or ARM
processors, but it isn't true that they all run Linux and can be
programmed using GCC/OpenMP.  Some of the world's most powerful
systems sit idle much of the time, simply for lack of software.

And a lot of the software itself is surprisingly flexible.  In most
SCC environments there is no multi-tasking: a set of CPUs is dedicated
to a program for its duration.  But external factors may cause a
program to be halted before finishing.  The stopped program may be
restarted later with a different number of CPUs according to the mix
of programs that are running at that time.

Programs which are expected to need many CPUs (or vast amounts of
memory which very often is tied to the number of CPUs), or which are
expected to run for more than a few minutes - such programs often are
written to checkpoint intermediate processing, to be restartable from
saved checkpoints, and to adapt dynamically to the number of CPUs they
are given when (re)started.  

>> And nobody has yet come up with a really good way to exploit massively
>> parallel hardware for general application programming.  But that's a
>> different discussion.
>
>I don't believe we'll see any "effective" algorithms -- largely because
>there aren't many "available" installations.  So, you have groups
>with very specific sorts of application sets trying to tackle the
>problem for *their* needs; not for "general" needs.

The point is that "massively parallel" is becoming the norm.  There
are commodity server chips now with 16 and 32 cores, and this year's
high end server chip will be in a laptop in 5 years.

[ASUS now will happily sell you a water-cooled !! laptop with an
overclocked 6th-gen i7 paired with 2 Nvidia 1080 GPUs.  If you
disconnect the water line it melts ... or maybe just slows down 50%.
But seriously, what do you expect for only $7K?]

In any event, more cores are fine for running more programs
simultaneously, but there's no good *general* way to leverage more
cores to make a single program run faster.  The ways that are known to
be automatically exploitable (by a compiler) are largely limited to
parallelizing loops.  Parallelizing non looping code [which is most of
most programs that need it] invariably relies on the programmer to
recognize possible parallelism and write special code to take
advantage of it.

Why should anyone care?  Good question.  I don't have a good answer,
but a couple of data points:

Lots of experience has shown that the average programmer can't write
correct parallel code  [or even just correct serial code, but that's
another discussion].  Automating parallelism - via compilers or smart
runtime systems - is the only way any significant percentage of
programs will be able to benefit.

Surveys have shown that for many people, the only "computer" they own
or routinely use is their smartphone.  The average person quite soon
will reasonably expect their phone to able to do anything: word
processing, spreadsheets, audio/video editing and presentation, i.e.
general business computing, and (for those few taking college STEM
courses) solving differential equations, performing circuit
simulations, virtual reality walkthroughs of pyramids, galaxies,
cadavers, etc.  ...
... while snapchatting, tweeting, facebooking, pinteresting, and still
providing 40 hours of use on a battery charge.

Unless someone comes up with a way to pack kWh into a AA sized package
[that can be recharged in under 30 minutes], making better use of
large numbers of low powered cores is going to be the only way
forward.

As always, YMMV.
George

On Mon, 25 Jul 2016 02:40:14 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>Hi George,
>
>[SWMBO grumbling cuz her cookie jar has been empty for many days now.
>Guess I'd best plan on baking, tonight, lest I have to deal with
>The Frowny Face  (sigh).  Or, stop *improving* the Rx so she has
>an incentive to buy store-bought, instead!!  ;-)  ]

Leave it empty for a while longer ... maybe she'll try making them
herself.  8-)

>On 7/17/2016 9:08 PM, George Neuner wrote:
>
>> Don is talking about both multicores and physically separate
>> processors connected by network.  As he has described it elsewhere -
>> his system is able to migrate a running task to any suitable platform
>> anywhere within the network.
>
>Actually, I only recently decided to support multicore processors;
>mainly, they're too cheap *not* to!  <frown>
>
>But, I'm taking the "easy" way out, there -- assigning specific
>subsystems to specific cores (e.g., like moving protected
>"capabilities" around; something that every node needs to be able
>to do!)
>
>The other "migration" issues (bad choice of terms) have different
>application domains.

Just about all "process" related terminology has been so heavily
overloaded that it's hard to have a group discussion: unless you
rigorously define absolutely every term, everyone will have a [more or
less] different understanding based upon their own experience.

There was a time [not so long ago] when "thread" referred to flow of
control rather than a "scheduling entity", a single CPU was
"multi-programmed" to [appear to] do many things simultaneously,
"multi-programming" was distinct from "multi-processing", and
"multi-tasking" was a layman's term that could mean almost anything.

Ah, the good ol' days.

>E.g., migrating the "client thread" INTO the "server" (i.e., letting
>it execute *in* the serving thread AS IF it still had the identity
>of the client thread) applies when you do a *local* IPC as well as when
>you do a *remote* RPC, obviously.  I.e., even on a uniprocessor!
>(amusing, then, that it seems to not be supported in COTS/FOSS OS
>offerings!  Oooops!)
>
>IIRC, Alpha called these "distributed threads" or, less humbly, "alpha
>threads" as a nod to the fact that the thread *conceptually* wanders
>around the "system" (regardless of: multitasking on a uniprocessor,
>multiprocessing via SMP or in a physically distributed system).

Yes.

>The other sort of "migration" (relocation?) applies to "physically"
>moving the process to a different node elsewhere in the network.
>In this case, obviously only pertinent to a loosely coupled multiprocessor
>system (NORMA).

Not familiar with NORMA.  

WRT OS literature, "migration" is the usual term for execution moving
to a different CPU (or nowadays to a different core).  

>I.e., once "migrated", the original host can get struck by
>lightning to no effect (wrt the task/process in question)

Maybe.  The "original host" might be another core on the same die.

>[Note this is more involved than just packing up registers plus
>address space!]

It isn't THAT hard: clustered mainframes in the 1960's had the ability
to migrate processes ... swap out here, swap in there.  All it really
requires is virtual addressing capability and a way to transport the
code and runtime data.

[The page oriented addressing in modern OSes is space efficient, but
it actually complicates things versus simple base:offset segmentation
addressing.]

I know you're referring to the issues of (de)serializing runtime data
structures for shipment over network  ... I'm just pointing out that
migration can be (and was!) accomplished relatively simply using
shared storage.

>> There have been some other operating systems capable of doing this.
>> AIUI, the interesting thing about Don's system is that he is
>> attempting to do real-time, real-world control ... not simply to
>> distribute processing over a bunch of networked "compute servers".
>
>Exactly.  Previous systems just were "processor farms" -- typically
>all "powered" just waiting for "workloads".  The idea of *bringing*
>another node on-line, ON-DEMAND to address increasing needs wasn't
>part of their scope (why should it be?  Unless you're concerned
>with power consumption!).  Nor was there a concern over taking
>nodes OFF-line when they weren't technically needed.

The current crop of tera-scale computers consume megawatts, and the
largest peta-scale computers consume 10s of megawatts when all their
CPUs and attached IO devices are active.  They have extensive power
control systems to manage the partitioning of active/inactive devices.

Even modern desktop CPUs have reached the point of needing to turn
on/off functional units on demand as the instruction stream dictates.
They need to do it because they are unable to dissapate heat
effectively enough and will self destruct if too many circuits are
powered simultaneously.  Your average quad-core processor now lives in
a perpetual state of "rolling blackout" with at most about 1/3 of its
circuitry powered up at any given instant.  Many circuits are turned
on/off cycle by cycle.
[Server chips, on the whole, are no better - but having more circuitry
to work with means their powered up "1/3" can do more.  Unless you're
nitrogen cooling your system, you really aren't able to use much of
what you theoretically paid for.]

And nobody has yet come up with a really good way to exploit massively
parallel hardware for general application programming.  But that's a
different discussion.

>Finally, AFAICT, trying to meet timeliness constraints in such
>a malleable system was an extra degree of freedom never addressed.
>(and taking that into consideration AS you made these other
>dispatching/scheduling decisions)

Yes.

>Cluster, Grid, NoW, Cloud, server farm, etc. -- none really address
>the concept accurately (for different reasons).
>
>IMO, designs will increasingly be moving in this direction as
>systems become too complex to be addressed reliably on single
>processors (even with multiple cores, memory becomes the bottleneck)
>as well as physically more dispersed applications.  How do you
>factor in the activities of another "service" when *YOU* have to
>provide some sort of timeliness guarantees?
>
>[Should be relatively easy to see applications where a single processor
>will fall flat!]

George