64-bit embedded computing is here and now| page 5

Reply by Hans-Bernhard Bröker ●June 9, 20212021-06-09

Am 09.06.2021 um 10:40 schrieb David Brown:
> On 09/06/2021 06:16, George Neuner wrote:

>> Since (at least) the Pentium 4 x86 really are a CISC decoder bolted
>> onto the front of what essentially is a load/store RISC.

... and at about that time they also abandoned the last traces of their 
original von-Neumann architecture.  The actual core is quite strictly 
Harvard now, treating the external RAM banks more like mass storage 
devices than an actual combined code+data memory.

> Absolutely.  But from the user viewpoint, it is the ISA that matters -

That depends rather a lot on who gets to be called the "user".

x86 are quite strictly limited to the PC ecosystem these days: boxes and 
laptops built for Mac OS or Windows, some of them running Linux instead. 
  There the "user" is somebody buying hardware and software from 
completely unrelated suppliers.  I.e. unlike in the embedded world we 
discuss here, the persons writing software for those things had no say 
at all what type of CPU is used.  They're thus not really the "user." 
If they were, they probably wouldn't be using an x86. ;-)

The actual x86 users couldn't care less about the ISA --- the 
overwhelming majority of them haven't the slightest idea what an ISA 
even is.  Some of them used to have a vague idea that there was some 
32bit vs. a 64bit whatchamacallit somewhere in there, but even that has 
surely faded away by now, as users no longer even face the decision 
between them.

Reply by Don Y ●June 9, 20212021-06-09

On 6/9/2021 12:58 PM, Paul Rubin wrote:
> Phil Hobbs <pcdhSpamMeSenseless@electrooptical.net> writes:
>> But if you're using a RasPi or Beaglebone or something like that, you
>> need a reasonably well-upholstered Linux distro, which has to be
>> patched regularly.  At very least it'll need a kernel, and kernel
>> patches affecting security are not exactly rare.
> 
> You're in the same situation with almost anything else connected to the
> internet.  Think of the notorious "smart light bulbs".

No, that's only if you didn't adequately prepare for such "exposure".

How many Linux/Windows boxes are running un-NEEDED services?  Have
ports open that shouldn't be?  How much emphasis was spent on ekeing
out a few percent extra performance from the network stack that
could have, instead, been spent on making it more robust?

How many folks RUNNING something like Linux/Windows in their product
actually know much of anything about what's under the hood?  Do they
even know how to BUILD a kernel, let alone sort out what it's
doing (wrong)?

Exposed to the 'net you always are at the mercy of DoS attacks
consuming your inbound bandwidth (assuming you have no contrtol
of upstream traffic/routing).  But, even a saturated network
connection doesn't have to crash your device.

OTOH, if your box is dutifully trying to respond to incoming packets
that may be malicious, then you'd better hope that response is
"correct" (or at least SAFE) in EVERY case.

For any of these mainstream OS's, an adversary can play with an
exact copy of yours 24/7/365 to determine its vulnerabilities
before ever approaching your device.  And, even dig through
the sources (of some) to see how a potential attack could unfold.
Your device will likely advertise exactly what version of the
kernel (and network stack) it is running.

[An adversary can also BUY one of YOUR devices and do the same
off-line analysis -- but the analysis will only apply to YOUR
device (if you have a proprietary OS/stack) and not a
multitude of other exposed devices]

> On the other hand, you are in reasonable shape if the raspberry pi
> running your fish tank is only reachable through a LAN or VPN.
> Non-networked low end linux boards are also a thing.

Exactly.  But that limits utility/accessibility.

If you only need moderate/occasional access, you can implement
a "stealth mode" that lets the server hide, "unprotected".
Or, require all accesses to be initiated from that server
(*to* the remote client) -- similar to a call-back modem.

And, of course, you can place constraints on what can be done
over that connection instead of just treating it as "God Mode".
[No, you can't set the heat to 105 degrees in the summer time;
I don't care if you happen to have appropriate credentials!
And, no, you can't install an update without my verifying
you and the update through other mechanisms...]

Reply by Don Y ●June 9, 20212021-06-09

On 6/9/2021 10:34 AM, Paul Rubin wrote:
> Theo <theom+news@chiark.greenend.org.uk> writes:
>>> Buy yourself a Raspberry Pi 4 and set it up to run your fish tank via a
>>> remote web browser.  There's your 64 bit embedded system.
>> I suppose there's a question of what embedded tasks intrinsically require
>>> 4GiB RAM, and those that do so because it makes programmers' lives easier?
> 
> You can buy a Raspberry Pi 4 with up to 8gb of ram, but the most common
> configuration is 2gb.  The cpu is 64 bit anyway because why not?

Exactly.  Are they going to give you a *discount* for a 32b version?

(Here, you can have this one for half of 'FREE'...)

>> There are obviously plenty of computer systems doing that, but the
>> question I don't know is what applications can be said to be
>> 'embedded' but need that kind of RAM.
> 
> Lots of stuff is using 32 bit cpus with a few KB of ram these days.  32
> bits is displacing 8 bits in the MCU world.
> 
> Is 64 bit displacing 32 bit in application processors like the Raspberry
> Pi, even when less than 4GB of ram is involved?  I think yes, at least
> to some extent, and it will continue.  My fairly low end mobile phone
> has 2GB of ram and a 64 bit 4-core processor, I think.
> 
> Will 64 bit MCU's displace 32 bit MCUs?  I don't know, maybe not.

Some due to need but, I suspect, most due to pricing or other
features not available in the 32b world.  Just like you don't
find PMMUs on 8/16b devices nor in-built NICs.

> Are application processors displacing MCU's in embedded systems?  Not
> much in portable and wearable stuff (other than phones) at least for
> now, but in larger devices I think yes, at least somewhat for now, and
> probably more going forward.  Even if you're not using networking, it
> makes software and UI development a heck of a lot easier.

This -------------------------------^^^^^^^^^^^^^^^^^^^^^^

Elbow room always takes some of the stress out of design.  You
don't worry (as much) about bumping into limits and, instead,
concentrate on solving the problem at hand.  The idea of
packing 8 'bools' into a byte (cuz I only had a hundred or
so of them available) is SO behind me, now!  Just use something
"more convenient"... eight of them!

I pass pages between processes as an efficiency hack -- even if
I'm only using a fraction of the page.  In smaller processors,
I'd be "upset" by this blatant "waste".  Instead, I shrug it off
and note that it gives me a uniform way of moving data around
(instead of having to tweek interfaces to LIMIT the amount
of data that I move; or "massage" the data JUST for transport).

My "calculator service" uses BigRationals -- because its easier than
trying to explain to users writing scripts that arithmetic can overflow,
suffer rounding errors, that order of operations is important, etc.

Reply by Don Y ●June 9, 20212021-06-09

On 6/9/2021 10:07 AM, Theo wrote:
> Paul Rubin <no.email@nospam.invalid> wrote:
>> James Brakefield <jim.brakefield@ieee.org> writes:
>>> Am trying to puzzle out what a 64-bit embedded processor should look like.
>>
>> Buy yourself a Raspberry Pi 4 and set it up to run your fish tank via a
>> remote web browser.  There's your 64 bit embedded system.
> 
> I suppose there's a question of what embedded tasks intrinsically require
>> 4GiB RAM, and those that do so because it makes programmers' lives easier?
> 
> In other words, you /can/ write a function to detect if your fish tank is
> hot or cold in Javascript that runs in a web app on top of Chromium on top
> of Linux.  Or you could make it out of a 6502, or a pair of logic gates.
> 
> That's complexity that's not fundamental to the application.  OTOH
> maintaining a database that's larger than 4GB physically won't work without
> that amount of memory (or storage, etc).
> 
> There are obviously plenty of computer systems doing that, but the question
> I don't know is what applications can be said to be 'embedded' but need that
> kind of RAM.

Transcoding multiple video sources (for concurrent clients) in a single
appliance?

I have ~30 cameras, here.  Had I naively designed with them all connected
to a "camera processor", I suspect memory would be the least of my
concerns (motion and scene recognition in 30 places simultaneously?)
Instead, it was "easier" to give each camera its own processor.  And,
gain extended "remotability" as part of the process.

Remember, the 32b address space has to simultaneously hold EVERYTHING that
will need to be accessible to your application -- the OS, it's memory
requirements, the application(s) tasks, the stacks/heaps for the threads
they contain, the data to be processed (in and out), the memory-mapped
I/Os consumed by the SoC itself, etc.

When you HAVE a capability/resource, it somehow ALWAYS gets used!  ;-)

Reply by Don Y ●June 9, 20212021-06-09

On 6/9/2021 10:56 AM, Dimiter_Popoff wrote:
> On 6/9/2021 4:29, Don Y wrote:
>> On 6/8/2021 3:01 PM, Dimiter_Popoff wrote:
>>
>>>> Am trying to puzzle out what a 64-bit embedded processor should look like.
>>>> At the low end, yeah, a simple RISC processor.  And support for complex 
>>>> arithmetic
>>>> using 32-bit floats?  And support for pixel alpha blending using quad 
>>>> 16-bit numbers?
>>>> 32-bit pointers into the software?
>>>
>>> The real value in 64 bit integer registers and 64 bit address space is
>>> just that, having an orthogonal "endless" space (well I remember some
>>> 30 years ago 32 bits seemed sort of "endless" to me...).
>>>
>>> Not needing to assign overlapping logical addresses to anything
>>> can make a big difference to how the OS is done.
>>
>> That depends on what you expect from the OS.  If you are
>> comfortable with the possibility of bugs propagating between
>> different subsystems, then you can live with a logical address
>> space that exactly coincides with a physical address space.
> 
> So how does the linear 64 bt address space get in the way of
> any protection you want to implement? Pages are still 4 k and
> each has its own protection attributes governed by the OS,
> it is like that with 32 bit processors as well (I talk power, I am
> not interested in half baked stuff like ARM, risc-v etc., I don't
> know if there could be a problem like that with one of these).

With a linear address space, you typically have to link EVERYTHING
as a single image to place each thing in its own piece of memory
(or use segment based addressing).

I can share code between tasks without conflicting addressing;
the "data" for one instance of the app is isolated from other
instances while the code is untouched -- the code doesn't even
need to know that it is being invoked on different "data"
from one timeslice to the next.  In a flat address space,
you'd need the equivalent of a "context pointer" that you'd
have to pass to the "shared code".  And, have to hope that
all of your context could be represented in a single such
reference!  (I can rearrange physical pages so they each
appear "where expected" to a bit of const CODE).

Similarly, the data passed (or shared) from one task (process) to
another can "appear" at entirely different logical addresses
"at the same time" as befitting the needs of each task WITHOUT
CONCERN (or awareness) of the existence of the other task.
Again, I don't need to pass a pointer to the data; the address
space has been manipulated to make sure it's where it should be.

The needs of a task can be met by resources "harvested" from
some other task.  E.g., where is the stack for your TaskA?
How large is it?  How much of it is in-use *now*?  How much
can it GROW before it bumps into something (because that something
occupies space in "its" address space).

I start a task (thread) with a single page of stack.  And, a
limit on how much it is allowed to consume during its execution.
Then, when it pushes something "off the end" of that page,
I fault a new page in and map it at the faulting address.
This continues as the task's stack needs grow.

When I run out of available pages, I do a GC cycle to
reclaim pages from (other?) tasks that are no longer using
them.

In this way, I can effectively SHARE a stack (or heap)
between multiple tasks -- without having to give any
consideration for where, in memory, they (or the stacks!)
reside.

I can move a page from one task (full of data) to another
task at some place that the destination task finds "convenient".
I can import a page from another network device or export
one *to* another device.

Because each task's address space is effectively empty/sparse,
mapping a page doesn't require much effort to find a "free"
place for it.

I can put constraints on each such mapping -- and then runtime
checks to ensure "things are as I expect":  "Why is this NIC
buffer residing in this particular portion of the address space?"

With a task bound to a semicontiguous portion of memory, it can
deal with that region as if it was a smaller virtual region.
I can store 32b pointers to things if I know that my addresses
are based from 0x000 and the task never extends beyond a 4GB
region.  If available, I can exploit "shorter" addressing modes.

> There is *nothing* to gain on a 64 bit machine from segmentation, assigning 
> overlapping address spaces to tasks etc.

What do you gain by NOT using it?  You're still dicking with the MMU.
(if you aren't then what value the MMU in your "logical" space?  map
each physical page to a corresponding logical page and never talk to
the MMU again; store const page tables and let your OS just tweek the
base pointer for the TLBs to use for THIS task)

You still have to "position" physical resources in particular places
(and you have to deal with the constraints of all tasks, simultaneously,
instead of just those constraints imposed by the "current task")

> Notice I am talking *logical* addresses, I was explicit about
> that.

Reply by David Brown ●June 10, 20212021-06-10

On 09/06/2021 22:52, Hans-Bernhard Br&ouml;ker wrote:
> Am 09.06.2021 um 10:40 schrieb David Brown:
>> On 09/06/2021 06:16, George Neuner wrote:
> 
>>> Since (at least) the Pentium 4 x86 really are a CISC decoder bolted
>>> onto the front of what essentially is a load/store RISC.
> 
> ... and at about that time they also abandoned the last traces of their
> original von-Neumann architecture.&nbsp; The actual core is quite strictly
> Harvard now, treating the external RAM banks more like mass storage
> devices than an actual combined code+data memory.
> 
>> Absolutely.&nbsp; But from the user viewpoint, it is the ISA that matters -
> 
> That depends rather a lot on who gets to be called the "user".
> 

I meant "the person using the ISA" - i.e., the programmer.  And even
then, I meant low-level programmers who have to understand things like
memory models, cache thrashing, coding for vectors and SIMD, etc.  These
are the people who see the ISA.  I was not talking about the person
wiggling the mouse and watching youtube!

Reply by Dimiter_Popoff ●June 10, 20212021-06-10

On 6/10/2021 3:12, Don Y wrote:
> On 6/9/2021 10:56 AM, Dimiter_Popoff wrote:
>> On 6/9/2021 4:29, Don Y wrote:
>>> On 6/8/2021 3:01 PM, Dimiter_Popoff wrote:
>>>
>>>>> Am trying to puzzle out what a 64-bit embedded processor should 
>>>>> look like.
>>>>> At the low end, yeah, a simple RISC processor.&nbsp; And support for 
>>>>> complex arithmetic
>>>>> using 32-bit floats?&nbsp; And support for pixel alpha blending using 
>>>>> quad 16-bit numbers?
>>>>> 32-bit pointers into the software?
>>>>
>>>> The real value in 64 bit integer registers and 64 bit address space is
>>>> just that, having an orthogonal "endless" space (well I remember some
>>>> 30 years ago 32 bits seemed sort of "endless" to me...).
>>>>
>>>> Not needing to assign overlapping logical addresses to anything
>>>> can make a big difference to how the OS is done.
>>>
>>> That depends on what you expect from the OS.&nbsp; If you are
>>> comfortable with the possibility of bugs propagating between
>>> different subsystems, then you can live with a logical address
>>> space that exactly coincides with a physical address space.
>>
>> So how does the linear 64 bt address space get in the way of
>> any protection you want to implement? Pages are still 4 k and
>> each has its own protection attributes governed by the OS,
>> it is like that with 32 bit processors as well (I talk power, I am
>> not interested in half baked stuff like ARM, risc-v etc., I don't
>> know if there could be a problem like that with one of these).
> 
> With a linear address space, you typically have to link EVERYTHING
> as a single image to place each thing in its own piece of memory
> (or use segment based addressing).

Nothing could be further from the truth. What kind of crippled
environment can make you think that? Code can be position
independent on processors which are not dead by design nowadays.
When I started dps some 27 years ago I allowed program modules
to demand a fixed address on which they would reside. This exists
to this day and has been used 0 (zero) times. Same about object
descriptors, program library modules etc., the first system call
I wrote is called "allocm$", allocate memory. You request a number
of bytes and you get back an address and the actual number of
bytes you were given (it comes rounded by the memory cluster
size, typically 4k (a page). This was the *first* thing I did.
And yes, all allocation is done using worst fit strategy, sometimes
enhanced worst fit - things the now popular OS-s have yet to get to,
they still have to defragment their disks, LOL.

> 
> I can share code between tasks without conflicting addressing;
> the "data" for one instance of the app is isolated from other
> instances while the code is untouched -- the code doesn't even
> need to know that it is being invoked on different "data"
> from one timeslice to the next.&nbsp; In a flat address space,
> you'd need the equivalent of a "context pointer" that you'd
> have to pass to the "shared code".&nbsp; And, have to hope that
> all of your context could be represented in a single such
> reference!&nbsp; (I can rearrange physical pages so they each
> appear "where expected" to a bit of const CODE).
> 
> Similarly, the data passed (or shared) from one task (process) to
> another can "appear" at entirely different logical addresses
> "at the same time" as befitting the needs of each task WITHOUT
> CONCERN (or awareness) of the existence of the other task.
> Again, I don't need to pass a pointer to the data; the address
> space has been manipulated to make sure it's where it should be.

So how do you pass the offset from the page beginning if you do
not pass an address.
And how is page manipulation simpler and/or safer than just passing
an address, sounds like a recipe for quite a mess to me.
In a 64 bit address space there is nothing stopping you to
pass addresses or not passing them and allow access to areas
you want to and disallow it elsewhere.
Other than that there is nothing to be gained by a 64 bit architecture
really, on 32 bit machines you do have FPUs, vector units etc.
doing calculation probably faster than the integer unit of a
64 bit processor.
The *whole point* of a 64 bit core is the 64 bit address space.

> 
> The needs of a task can be met by resources "harvested" from
> some other task.&nbsp; E.g., where is the stack for your TaskA?
> How large is it?&nbsp; How much of it is in-use *now*?&nbsp; How much
> can it GROW before it bumps into something (because that something
> occupies space in "its" address space).

This is the beauty of 64 bit logical address space. You allocate
enough logical memory and then you allocate physical on demand,
this is what MMUs are there for. If you want to grow your stack
indefinitely - the messy C style - you can just allocate it
a few gigabytes of logical memory and use the first few kilobytes
of it to no waste of resources. Of course there are much slicker
ways to deal with memory allocation.

> 
> I start a task (thread) with a single page of stack.&nbsp; And, a
> limit on how much it is allowed to consume during its execution.
> Then, when it pushes something "off the end" of that page,
> I fault a new page in and map it at the faulting address.
> This continues as the task's stack needs grow.

This is called "allocate on demand" and has been around
for times immemorial, check my former paragraph.

> 
> When I run out of available pages, I do a GC cycle to
> reclaim pages from (other?) tasks that are no longer using
> them.

This is called "memory swapping", also for times immemorial.
For the case when there is no physical memory to reclaim, that
is.
The first version of dps - some decades ago - ran on a CPU32
(a 68340). It had no MMU so I implemented "memory blocks",
a task can declare a piece  a swap-able block and allow/disallow
its swapping. Those blocks would then be shared or written to disk when
more memory was needed etc., memory swapping without an MMU.
Worked fine, must be still working for code I have not
touched since on my power machines, all those decades later.

> 
> In this way, I can effectively SHARE a stack (or heap)
> between multiple tasks -- without having to give any
> consideration for where, in memory, they (or the stacks!)
> reside.

You can do this in a linear address space, too - this is what
the MMU is for.

> 
> I can move a page from one task (full of data) to another
> task at some place that the destination task finds "convenient".
> I can import a page from another network device or export
> one *to* another device.

So instead of simply passing an address you have to switch page
translation entries, adjust them on each task switch, flush and
sync whatever it takes - does not sound very efficient to me.

> 
> Because each task's address space is effectively empty/sparse,
> mapping a page doesn't require much effort to find a "free"
> place for it.

This is the beauty of having the 64 bit address space, you always
have enough logical memory. The "64 bit address space per task"
buys you *nothing*.

Dimiter

======================================================
Dimiter Popoff, TGI             http://www.tgi-sci.com
======================================================
http://www.flickr.com/photos/didi_tgi/

Reply by Don Y ●June 10, 20212021-06-10

On 6/10/2021 3:45 AM, Dimiter_Popoff wrote:

[attrs elided]

>>>>> Not needing to assign overlapping logical addresses to anything
>>>>> can make a big difference to how the OS is done.
>>>>
>>>> That depends on what you expect from the OS.  If you are
>>>> comfortable with the possibility of bugs propagating between
>>>> different subsystems, then you can live with a logical address
>>>> space that exactly coincides with a physical address space.
>>>
>>> So how does the linear 64 bt address space get in the way of
>>> any protection you want to implement? Pages are still 4 k and
>>> each has its own protection attributes governed by the OS,
>>> it is like that with 32 bit processors as well (I talk power, I am
>>> not interested in half baked stuff like ARM, risc-v etc., I don't
>>> know if there could be a problem like that with one of these).
>>
>> With a linear address space, you typically have to link EVERYTHING
>> as a single image to place each thing in its own piece of memory
>> (or use segment based addressing).
> 
> Nothing could be further from the truth. What kind of crippled
> environment can make you think that? Code can be position
> independent on processors which are not dead by design nowadays.
> When I started dps some 27 years ago I allowed program modules
> to demand a fixed address on which they would reside. This exists
> to this day and has been used 0 (zero) times. Same about object
> descriptors, program library modules etc., the first system call
> I wrote is called "allocm$", allocate memory. You request a number
> of bytes and you get back an address and the actual number of
> bytes you were given (it comes rounded by the memory cluster
> size, typically 4k (a page). This was the *first* thing I did.
> And yes, all allocation is done using worst fit strategy, sometimes
> enhanced worst fit - things the now popular OS-s have yet to get to,
> they still have to defragment their disks, LOL.

You missed my point -- possibly because this issue was raised
BEFORE pointing out how much DYNAMIC management of the MMU
(typically an OS delegated acticity) "buys you":
     "That depends on what you expect from the OS."

If you can ignore the MMU *completely*, then the OS is greatly
simplified.  YOU (developer) take on the responsibilites of remembering
what is where, etc.  EVERYTHING is visible to EVERYONE and at
EVERYTIME.  The OS doesn't have to get involved in the management
of objects/tasks/etc.  That's YOUR responsibility to ensure
your taskA doesn't go dicking around with taskB's resources.

Welcome to the 8/16b world!

The next step up is to statically deploy the MMU.  You build
a SINGLE logical address space to suit your liking.  Then, map
the underlying physical resources to it as best fits.  And,
this never needs to change -- memory doesn't "move around",
it doesn't change characteristics (readable, writeable,
exeuctable, accessable-by-X, etc.)!

But, you can't then change permissions based on which task is
executing -- unless you want to dick with the MMU dynamically
(or swap between N discrete sets of STATIC page tables that
define the many different ways M tasks can share permissions)

So, you *just* use the MMU as a Memory Protection Unit; you mark
sections of memory that have CODE in them as no-write, you mark
regions with DATA as no-execute, and everything else as no-access.

And that's the way it stays for EVERY task!

This lets you convert RAM to ROM and prevents "fetches" from "DATA"
memory.  It ensures your code is never overwritten and that the
processor never tries to execute out of "data memory" and NOTHING
tries to access address regions that are "empty"!

You've implemented a 1980's vintage protection scheme (this is how
we designed arcade pieces, back then, as you wanted your CODE
and FRAME BUFFER to occupy the same limited range of addresses)

<yawn>

Once you start using the MMU to dynamically *manage* memory (which
includes altering protections and re-mapping), then the cost of the
OS increases -- because these are typically things that are delegated
*to* the OS.

Whether or not you have overlapping address spaces or a single
flat address space is immaterial -- you need to dynamically manage
separate page tables for each task in either scheme.  You can't
argue that the OS doesn't need to dick with the MMU "because it's
a flat address space" -- unless you forfeit those abilities
(that I illustrated in my post).

If you want to compare a less-able OS to one that is more featured,
then its disingenuous to blame that on overlapping address spaces;
the real "blame" lies in the support of more advanced features.

The goal of an OS should be to make writing *correct* code easier
by providing features as enhancements.  It's why the OS typically
reads disk files instead of replicating that file system and driver
code into each task that needs to do so.  Or, why it implements
delays/timers -- so each task doesn't reinvent the wheel (with its
own unique set of bugs).

You can live without an OS.  But, typically only for a trivial
application.  And, you're not likely to use a 64b processor just
to count characters received on a serial port!  Or as an egg timer!

>> I can share code between tasks without conflicting addressing;
>> the "data" for one instance of the app is isolated from other
>> instances while the code is untouched -- the code doesn't even
>> need to know that it is being invoked on different "data"
>> from one timeslice to the next.  In a flat address space,
>> you'd need the equivalent of a "context pointer" that you'd
>> have to pass to the "shared code".  And, have to hope that
>> all of your context could be represented in a single such
>> reference!  (I can rearrange physical pages so they each
>> appear "where expected" to a bit of const CODE).
>>
>> Similarly, the data passed (or shared) from one task (process) to
>> another can "appear" at entirely different logical addresses
>> "at the same time" as befitting the needs of each task WITHOUT
>> CONCERN (or awareness) of the existence of the other task.
>> Again, I don't need to pass a pointer to the data; the address
>> space has been manipulated to make sure it's where it should be.
> 
> So how do you pass the offset from the page beginning if you do
> not pass an address.

YOU pass an object to the OS and let the OS map it where *it*
wants, with possible hints from the targeted task (logical address
space).

I routinely pass multiple-page-sized objects around the system.

"Here's a 20MB telephone recording, memory mapped (to wherever YOU,
its recipient, want it).  Because it is memory mapped and has its
own pager, the actual amount of physical memory that is in use
at any given time can vary -- based on the resource allocation
you've been granted and the current resource availability in the
system.  E.g., there may be as little as one page of physical
data present at any given time -- and that page may "move" to
back a different logical address based on WHERE you are presently
looking!

Go through and sort out when Bob is speaking and when Tom is speaking.
"Return" an object of UNKNOWN length that lists each of these time
intervals along with the speaker assumed to be talking in each.  Tell
me where you (the OS) decided it would best fit into my logical address
space, after consulting the hint I provided (but that you may not have
been able to honor because the result ended up *bigger* than the "hole"
I had imagined it fitting into).  No need to tell me how big it really
is as I will be able to parse it (cuz I know how you will have built that
list) and the OS will track the memory that it uses so all I have to  do
is free() it (it may be built out of 1K pages, 4K pages, 16MB pages)!"

How is this HARDER to do when a single task has an entire 64b address
space instead of when it has to SHARE *a* single address space among
all tasks/objects?

> And how is page manipulation simpler and/or safer than just passing
> an address, sounds like a recipe for quite a mess to me.

The MMU has made that mapping a "permanent" part of THIS task's
address space.  It isn't visible to any other task -- why *should*
it be?  Why does the pointer need to indirectly reflect the fact
that portions of that SINGLE address space are ineligible to
contain said object because of OTHER unrelated (to this task) objects??

> In a 64 bit address space there is nothing stopping you to
> pass addresses or not passing them and allow access to areas
> you want to and disallow it elsewhere.

And I can't do that in N overlapping 64b address spaces?

The only "win" you get is by exposing everything to everyone.
That's not the way software is evolving.  Compartmentalization
(to protect from other actors), opacity (to hide implementation
details), accessors (instead of exposing actual data), etc.

This comes at a cost -- in performance as well as OS design.
But, *seems* to be worth the effort, given how "mainstream"
development is heading.

> Other than that there is nothing to be gained by a 64 bit architecture
> really, on 32 bit machines you do have FPUs, vector units etc.
> doing calculation probably faster than the integer unit of a
> 64 bit processor.
> The *whole point* of a 64 bit core is the 64 bit address space.

No, the whole point of a 64b core is the 64b registers.
You can package a 64b CPU so that only 20! address lines
are bonded out.  This limits the physical address space
to 20b.  What value to making the logical address
space bigger -- so you can leave gaps for expansion
between objects??

>> The needs of a task can be met by resources "harvested" from
>> some other task.  E.g., where is the stack for your TaskA?
>> How large is it?  How much of it is in-use *now*?  How much
>> can it GROW before it bumps into something (because that something
>> occupies space in "its" address space).
> 
> This is the beauty of 64 bit logical address space. You allocate
> enough logical memory and then you allocate physical on demand,
> this is what MMUs are there for. If you want to grow your stack
> indefinitely - the messy C style - you can just allocate it
> a few gigabytes of logical memory and use the first few kilobytes
> of it to no waste of resources. Of course there are much slicker
> ways to deal with memory allocation.

Again, how is this any harder with "overlapping" 64b address spaces?
Or, how is it EASIER with nonoverlap?

>> I start a task (thread) with a single page of stack.  And, a
>> limit on how much it is allowed to consume during its execution.
>> Then, when it pushes something "off the end" of that page,
>> I fault a new page in and map it at the faulting address.
>> This continues as the task's stack needs grow.
> 
> This is called "allocate on demand" and has been around
> for times immemorial, check my former paragraph.

I'm not trying to be "novel".  Rather, showing that these
features come from the MMU -- not a "nonoverlapping"
(or overlapping!) address space.

I.e., the take away from all this is the MMU is the win
AND the cost for the OS.  Without it, the OS gets simpler...
and less capable!

>> When I run out of available pages, I do a GC cycle to
>> reclaim pages from (other?) tasks that are no longer using
>> them.
> 
> This is called "memory swapping", also for times immemorial.
> For the case when there is no physical memory to reclaim, that
> is.
> The first version of dps - some decades ago - ran on a CPU32
> (a 68340). It had no MMU so I implemented "memory blocks",
> a task can declare a piece  a swap-able block and allow/disallow
> its swapping. Those blocks would then be shared or written to disk when
> more memory was needed etc., memory swapping without an MMU.
> Worked fine, must be still working for code I have not
> touched since on my power machines, all those decades later.

There's no disk involved.  The amount of physical memory
is limited to what's on-board (unless I try to move resources
to another node or -- *gack* -- use a scratch table in the RDBMS
as a backing store).

Recovering "no longer in use" portions of stack is "low hanging fruit";
look at the task's stack pointer and you know how much allocated stack
is no longer in use.  Try to recover it (of course, the task
may immediately fault another page back into play but that's
an optimization issue).

If there is no "low hanging fruit", then I ask tasks to voluntarily
relinquish memory.  Some tasks may have requested "extra" memory
in order to precompute results for future requests/activities.
If it was available -- and if the task wanted to "pay" for it -- then
the OS would grant the allocation (knowing that it could eventually
revoke it!)  They could relinquish those resources at the expense of
having to recompute those things at a later date ("on demand" *or* when
memory is again available).

If I can't recover enough resources "voluntarily", then I
*take* memory away from a (selected) task and inform it
(raise an exception that it will handle as soon as it gets
a timeslice) of that "theft".  It will either recover from
the loss (because it was being greedy and didn't elect
to forfeit excess memory that it had allocated when I asked,
earlier) *or* it will crash.  <shrug>  When you run out
of resources, SOMETHING has to give (and the OS is better
suited to determining WHAT than the individual tasks are...
they ALL think *they* are important!)

Again, "what do you expect from your OS?"

>> In this way, I can effectively SHARE a stack (or heap)
>> between multiple tasks -- without having to give any
>> consideration for where, in memory, they (or the stacks!)
>> reside.
> 
> You can do this in a linear address space, too - this is what
> the MMU is for.

Yes, see?  There's nothing special about a flat address space!

>> I can move a page from one task (full of data) to another
>> task at some place that the destination task finds "convenient".
>> I can import a page from another network device or export
>> one *to* another device.
> 
> So instead of simply passing an address you have to switch page
> translation entries, adjust them on each task switch, flush and
> sync whatever it takes - does not sound very efficient to me.

It's not intended to be fast/efficient.  It's intended to ensure
that the recipient -- AND ONLY THE RECIPIENT -- is *now*
granted access to that page's contents.  depending on semantics,
it can create a copy of an object or "move" the object, leaving
a "hole" in the original location.

[I.e., if move semantics, then the original owner shouldn't be
trying to access something that he's "given away"!  Any access,
by him, to that memory region should signal a fatal exception!]

If you don't care who sees what, then you don't need the MMU!
And we're back to my initial paragraph of this reply!  :>

>> Because each task's address space is effectively empty/sparse,
>> mapping a page doesn't require much effort to find a "free"
>> place for it.
> 
> This is the beauty of having the 64 bit address space, you always
> have enough logical memory. The "64 bit address space per task"
> buys you *nothing*.

If "always having enough logical memory" is such a great thing,
isn't having MORE logical memory (because you've moved other
things into OVERLAPPING portions of that memory space) an
EVEN BETTER thing?

Again, what does your flat addressing BUY the OS in terms of
complexity reduction?  (your initial assumption)
    "...a big difference to how the OS is done"

Reply by Dimiter_Popoff ●June 10, 20212021-06-10

On 6/10/2021 16:55, Don Y wrote:
> On 6/10/2021 3:45 AM, Dimiter_Popoff wrote:
> 
> [attrs elided]
 >
Don, this becomes way too lengthy and repeating itself.

You keep on saying that a linear 64 bit address space means exposing
everything to everybody after I explained this is not true at all.

You keep on claiming this or that about how I do things without
bothering to understand what I said - like your claim that I use the MMU
for "protection only".
NO, this is not true either. On 32 bit machines - as mine in
production are - mapping 4G logical space into say 128M of physical
memory goes all the way through page translation, block translation
for regions where page translation would be impractical etc.
You sound the way I would have sounded before I had written and
built on for years what is now dps. The devil is in the detail :-).

You pass "objects", pages etc. Well guess what, it *always* boils
down to an *address* for the CPU. The rest is generic talk.
And if you choose to have overlapping address spaces when you
pass a pointer from one task to another the OS has to deal with this
at a significant cost.
In a linear address space, you pass the pointer *as is* so the OS does
not have to deal with anything except access restrictions.
In dps, you can send a message to another task - the message being
data the OS will copy into that tasks memory, the data being
perfectly able to be an address of something in another task's
memory. If a task accesses an address it is not supposed to
the user is notified and allowed to press CR to kill that task.
Then there are common data sections for groups of tasks etc.,
it is pretty huge really.

The concept "one entire address space to all tasks" is from the 60-s
if not earlier (I just don't know and don't care to check now) and it
has done a good job while it was necessary, mostly on 16 bit CPUs.
For today's processors this means just making them run with the
handbrake on, *nothing* is gained because of that - no more security
(please don't repeat that "expose everything" nonsense), just
burning more CPU power, constantly having to remap addresses etc.

Dimiter

======================================================
Dimiter Popoff, TGI             http://www.tgi-sci.com
======================================================
http://www.flickr.com/photos/didi_tgi/

Reply by Don Y ●June 10, 20212021-06-10

On 6/10/2021 8:32 AM, Dimiter_Popoff wrote:
> On 6/10/2021 16:55, Don Y wrote:
>> On 6/10/2021 3:45 AM, Dimiter_Popoff wrote:
>>
>> [attrs elided]
>  >
> Don, this becomes way too lengthy and repeating itself.
> 
> You keep on saying that a linear 64 bit address space means exposing
> everything to everybody after I explained this is not true at all.

Task A has built a structure -- a page worth of data residing
at 0x123456.  It wants to pass this to TaskB so that TaskB can perform
some operations on it.

Can TaskB acccess the data at 0x123456 *before* TaskA has told it
to do so?

Can TaskB access the data at 0x123456 WHILE TaskA is manipulating it?

Can TaskA alter the data at 0x123456 *after* it has "passed it along"
to TaskB -- possibly while TaskB is still using it?

> You keep on claiming this or that about how I do things without
> bothering to understand what I said - like your claim that I use the MMU
> for "protection only".

I didn't say that YOU did that.  I said that to be able to ignore
the MMU after setting it up, you can ONLY use it to protect
code from alteration, data from execution, etc.  The "permissions"
that it applies have to be invariant over the execution time of
ALL of the code.

So, if you DON'T use it "for protection only", then you are admitting
to having to dynamically tweek it.

*THIS* is the cost that the OS incurs -- and having a flat address
space doesn't make it any easier!  If you aren't incurring that cost,
then you're not protecting something.

> NO, this is not true either. On 32 bit machines - as mine in
> production are - mapping 4G logical space into say 128M of physical
> memory goes all the way through page translation, block translation
> for regions where page translation would be impractical etc.
> You sound the way I would have sounded before I had written and
> built on for years what is now dps. The devil is in the detail :-).
> 
> You pass "objects", pages etc. Well guess what, it *always* boils
> down to an *address* for the CPU. The rest is generic talk.

Yes, the question is "who manages the protocol for sharing".
Since forever, you could pass pointers around and let anyone
access anything they wanted.  You could impose -- but not
ENFORCE -- schemes that ensured data was shared properly
(e.g., so YOU wouldn't be altering data that *I* was using).

[Monitors can provide some structure to that sharing but
are costly when you consider the number of things that may
potentially need to be shared.  And, you can still poke
directly at the data being shared, bypassing the monitor,
if you want to (or have a bug)]

But, you had to rely on programming discipline to ensure this
worked.  Just like you have to rely on discipline to ensure
code is "bugfree" (how's that worked for the industry?)

> And if you choose to have overlapping address spaces when you
> pass a pointer from one task to another the OS has to deal with this
> at a significant cost.

How does your system handle the above example?  How do you "pass" the
pointer from TaskA to TaskB -- if not via the OS?  Do you expose a
shared memory region that both tasks can use to exchange data
and hope they follow some rules?  Always use synchronization
primitives for each data exchange?  RELY on the developer to
get it right?  ALWAYS?

Once you've passed the pointer, how does TaskB access that data
WITHOUT having to update the MMU?  Or, has TaskB had access to
the data all along?

What happens when B wants to pass the modified data to C?
Does the MMU have to be updated (C's tables) to grant that
access?  Or, like B, has C had access all along?  And, has
C had to remain disciplined enough not to go mucking around
with that region of memory until A *and* B have done modifying
it?

I don't allow anyone to see anything -- until the owner of that thing
explicitly grants access.  If you try to access something before it's
been made available for your access, the OS traps and aborts your
process -- you've violated the discipline and the OS is going to
enforce it!  In an orderly manner that doesn't penalize other
tasks that have behaved properly.

> In a linear address space, you pass the pointer *as is* so the OS does
> not have to deal with anything except access restrictions.
> In dps, you can send a message to another task - the message being
> data the OS will copy into that tasks memory, the data being
> perfectly able to be an address of something in another task's

So, you don't use the MMU to protect TaskA's resources from TaskB
(or TaskC!) access.  You expect LESS from your OS.

> memory. If a task accesses an address it is not supposed to
> the user is notified and allowed to press CR to kill that task.

What are the addresses "it's not supposed to?"  Some *subset* of
the addresses that "belong" to other tasks?  Perhaps I can
access a buffer that belongs to TaskB but not TaskB's code?
Or, some OTHER buffer that TaskB doesn't want me to see?  Do
you explicitly have to locate ("org") each buffer so that you
can place SOME in protected portions of the address space and
others in shared areas?  How do you change these distinctions
dynamically -- or, do you do a lot of data copying from
"protected" space to "shared" space?

> Then there are common data sections for groups of tasks etc.,
> it is pretty huge really.

Again, you expose things by default -- even if only a subset
of things.  You create shared memory regions where there are
no protections and then rely on your application to behave and
not access data (that has been exposed for its access) until
it *should*.

Everybody does this.  And everyone has bugs as a result.  You
are relying on the developer to *repeatedly* implement the sharing
protocol -- instead of relying on the OS to enforce that for you.

It's like putting tons of globals in your application -- to
make data sharing easier (and, thus, more prone to bugs).

You expect less of your OS.

My tasks are free to do whatever they want in their own protection domain.
They KNOW that nothing can SEE the data they are manipulating *or*
observe HOW they are manipulating it or *influence* their manipulation
of it.

Until they want to expose that data.  And, then, only to those entities
that they think SHOULD see it.

They can give (hand-off) data to another entity -- much like call-by-value
semantics -- and have the other entity know that NOTHING that the
original "donor" can do AFTER that handoff will affect the data that
has been "passed" to them.

Yet, they can still manipulate that data -- update it or reuse that
memory region -- for the next "client".

The OS enforces these guarantees.  Much more than just passing along
a pointer to the data!  Trying to track down the donor's alteration
of data while the recipient is concurrently accessing it (multiple
tasks, multiple cores, multiple CPUs) is a nightmare proposition.
And, making an *unnecessary* copy of it is a waste of resources
(esp if the two parties actually ARE well-behaved)

> The concept "one entire address space to all tasks" is from the 60-s
> if not earlier (I just don't know and don't care to check now) and it
> has done a good job while it was necessary, mostly on 16 bit CPUs.
> For today's processors this means just making them run with the
> handbrake on, *nothing* is gained because of that - no more security
> (please don't repeat that "expose everything" nonsense), just
> burning more CPU power, constantly having to remap addresses etc.

Remapping is done in hardware.  The protection overhead is a
matter of updating page table entries.  *You* gain nothing by creating
a flat address space because *you* aren't trying to compartmentalize
different tasks and subsystems.  You likely protect the kernel's
code/data from direct interference from "userland" (U/S bit) but
want the costs of sharing between tasks to be low -- at the expense
of forfeiting protections between them.

*Most* of the world consists of imperfect coders.  *Most* of us have
to deal with colleagues (of varying abilities) before, after and
during our tenure running code on the same CPU as our applications.

     "The bug is (never!) in my code!  So, it MUST be in YOURS!"

You can either stare at each other, confident in the correctness
of your own code.  Or, find the bug IN THE OTHER GUY'S CODE
(you can't prove yours is correct anymore than he can; so you have to
find the bug SOMEWHERE to make your point), effectively doing his
debugging *for* him.

Why do you think desktop OS's go to such lengths to compartmentalize
applications?  Aren't the coders of application A just as competent
as those who coded application B?  Why would you think application
A might stomp on some resource belonging to application B?  Wouldn't
that be a violation of DISCIPLINE (and outright RUDE)?

You've been isolated from this for far too long.  So, don't see
what it's like to have to deal with another(s)' code impacting
the same product that *you* are working on.

Encapsulation and opacity are the best ways to ensure all interactions
to your code/data are through permitted interfaces.
   "Who overwrote my location 0x123456?  I know *I* didn't..."
   "Who turned on power to the motor?  I'm the only one who should do so!"
   "Who deleted the log file?"
There's a reason we eschew globals!

I can ensure TaskB can't delete the log file -- by simply denying him
access to logfile.delete().  But, letting him use logfile.append()
as much as he wants!  At the same time, allowing TaskA to delete or
logfile.rollover() as it sees fit -- because I've verified that
TaskA does this appropriately as part of its contract.  And, there's
no NEED for TaskB to ever do so -- it's not B's responsibility
(so why allow him the opportunity to ERRONEOUSLY do so -- and then
have to chase down how this happened?)

If TaskB *tries* to access logfile.delete(), I can trap to make his
violation obvious:  "Reason for process termination: illegal access"

And, I don't need to do this with pointers or hardware protection
of the pages in which logfile.delete() resides!  I just don't let
him invoke *that* method!  I *expect* my OS to provide these mechanisms
to the developer to make his job easier AND the resulting code more robust.

There is a cost to all this.  But, *if* something misbehaves, it leaves
visible evidence of its DIRECT actions; you don't have to wonder WHEN
(in the past) some datum was corrupted that NOW manifests as an error
in some, possibly unrelated, manner.

Of course, you don't need any of this if you're a perfect coder.

You don't expose the internals of your OS to your tasks, do you?
Why?  Don't you TRUST them to observe proper discipline in their
interactions with it?  You trust them to observe those same
disciplines when interacting with each other...  Why can't TaskA
see the preserved state for TaskB?  Don't you TRUST it to
only modify it if it truly knows what it's doing?  Not the result
of resolving some errant pointer?

Welcome to the 70's!