Page sizes| page 4

Reply by Don Y ●June 30, 20202020-06-30

On 6/29/2020 8:43 PM, upsidedown@downunder.com wrote:
>> But, here, we're not just talking about it as an addressing/protection mode
>> but, also, as a mechanism for VMM.  Unless you back the segment hardware
>> with some ADDITIONAL mechanism to perform that demand-based activity,
>> you're implicitly stating that every "object" must be completely mapped
>> (or not at all)... you have no notion of a smaller mapping unit beneath
>> the segment interface.
> ,
> Actually swapping out a whole object can be a good idea, Assuming each
> "DLL" is stored in a separate object , each with own code and data
> segments.

This is only possible if you have secondary store.

Or, if you know the object is immutable and can be reloaded from the IPL
store (or equivalent).

> In case of a congestion the OS knows which DLLs are currently active.
> If an object is not currently active, it is a good candidate for
> replacement. The code segment can be discarded directly and the dirty
> data segment should be swapped out,
> 
> Later on. if there is a new reference to a specific DLL, load the code
> and data segments at once as a unit.

I do essentially this -- but up to a scale including whole processors.
If a processor is unneeded -- because its services can be hosted
elsewhere -- then I move the services onto another node and shut down
the processor.  Likewise, if I have a shortage of resources, I being
"cold" processors back online and migrate services onto them.

On a smaller scale, if a service is idle, then I can opt to kill it
and reload it from the permanent store, when needed.  (as killing it
might let me idle that hosting processor)

> The problem with many paged system is selecting which pages should be
> replaced, unless there is a good LRU (Least Recently Used) hardware
> support. Lacking sufficient hardware support, pages to be replaced are
> selected at random.

The problem will always exist because the page scheduling policy is
typically hard-coded into the kernel.  The OS is unaware of the
needs of the application domain that it is hosting.

So, an idle process may really WANT to "sit watching" and not be
swapped out -- esp if there are performance (or correctness) issues
associated with how quickly it responds to the "next" call for service!

> This is a problem especially with OSes that
> support(ed) multiple hardware platforms. Pure pages can be discarded,
> but dirty pages need to be written to the page file. For instance in
> WinNT dirty pages in the working set are selected at random and moved
> to  a queue of pages to be written into the page file.  If there is a
> new reference to the page in the queue, it is moved back to the
> working set and removed from the queue. f there are no recent
> references, the page becomes written to the page file and removed from
> the queue. Not very optimal.

I let processes manage the memory object(s) that they own.  The kernel
communicates a need for resources to a "policy process" that has knowledge
of all of the resources held, locally (memory being just one of those).

[There are no *policy* decisions made in the kernel]

It makes a decision as to where to materialize any "spare" resources.
It, then, contacts the targeted process which, in turn, directs its
"memory management process(es)" to free up some resources.

However, it is under no obligation to do so.  The downside of failing
to honor a REQUEST to relinquish resources is the system can opt to
reclaim ALL of your resources (by killing you off!).  It can also
notice that this was necessary and make a note in the event that your
process is reloaded at some later date (i.e., increase your process's
"cost" to effectively discourage its use)

Reply by ●June 30, 20202020-06-30

On Mon, 29 Jun 2020 23:24:14 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/29/2020 9:05 PM, George Neuner wrote:
>> On Sat, 27 Jun 2020 23:50:50 -0700, Don Y
>> <blockedofcourse@foo.invalid> wrote:
>> 
>>> On 6/27/2020 10:01 PM, George Neuner wrote:
>>>>>> If you want a *useful* segmenting MMU, you probably need to design it
>>>>>> yourself.  Historically there were some units that did it (what I
>>>>>> would call) right, but none are scalable to modern memory sizes.
>>>>>>
>>>>>> Whatever you do, you want the program to work with flat addresses and
>>>>>> have segmentation applied transparently (like paging) during memory
>>>>>> access.  You certainly DO NOT want to follow the x86 example of
>>>>>> exposing segments in the addressing.
>>>>>
>>>>> Agreed.  Segments were meant to address a different problem.
>>>>>
>>>>> OTOH, exposing them to the instruction set removes any potential
>>>>> ambiguity if two (or more) "general purpose" segments could
>>>>> overlap at a particular spot in the address space; the opcode
>>>>> acts as a disambiguator.
>>>>
>>>> ???  Not following.
>>>
>>> In a large, flat address space, it is conceivable that "general purpose"
>>> segments could overlap.  So, in such an environment, an address presented
>>> to the memory subsystem would have to resolve to SOME particular physical
>>> address, "behind" the segment hardware.  The hardware would have to resolve
>>> any possible ambiguities.  (how do you design the HARDWARE to prevent
>>> ambiguities from arising without increasing its complexity even more??).
>> 
>> Unless you refer to x86, I still don't understand what "ambiguity" you
>> are speaking of.
>
>Yes.
>
>> x86 addresses using segments were ambigious because x86 segmentation
>> was implemented poorly with the segment being part of the address.
>> 
>> A segment should ONLY be a protection zone, never an address
>> remapping. Segments should be defined only on the flat address space,
>> and the address should be completely resolved before checking it
>> against segment boundaries.
>
>I'd prefer the VMM system to be based on "variable sized pages" (akin
>to segments) as you can emulate "variable sized protection zones" as
>collections of one or more such "pages".  Though I don't claim to need
>"single byte resolution" on such page sizes.

The idea of having multiple size pages in a single process at least
solves the page table size problem. 

Assuming the page table is divided in 3 hierarchical level, each
handling a number of bits of the virtual address. This would make it
possible to  have Huge, Big and Small pages. 

The page size bits would have to be moved from the processor status
word to each page table entry. For Huge pages, the top level page
table would directly select the huge page and the remaining virtual
address bits would be the offset within the Huge page. 

However, if the top level page table entry contains the Big page flag,
the top size table entry would point to the second level page table,
which then either contains a pointer directly to the Big table or a
pointer to the low level page table to address the Small page.

The worst case total page table size (including all three levels)
would be a few hundred entries total. In fact this would make it
possible to have a 8 to 16 bit task ID field to the left of the
process specific virtual address going thou the virtual memory
hierarchy (task:address). 

Since a reasonable number of bits should be handled by each page table
level, the size difference between different size pages would have to
be hundreds if not thousands.

Reply by George Neuner ●June 30, 20202020-06-30

On Mon, 29 Jun 2020 23:24:14 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>> On 6/27/2020 10:01 PM, George Neuner wrote:

>I'd prefer the VMM system to be based on "variable sized pages" (akin
>to segments) as you can emulate "variable sized protection zones" as
>collections of one or more such "pages".  Though I don't claim to need
>"single byte resolution" on such page sizes.

Mill allocates by cache lines (though what that means is model
dependent).  Is that a granularity you could live with?


>But, the present trend towards larger page sizes renders them less
>useful for many things.  E.g., the 512B VAX page would be an oddity
>in today's world -- I think ARM has some products that offer "tiny"
>1KB pages but even those are off-the-beaten track.

Anything under 4KB is an oddball today. Some 64-bit chips have 1GB
pages now .., handy for huge databases and science simulations but not
much else.


>> So long as segments are only used as
>> protection zones within the address space, they can overlap in any
>> way.
>
>Then you need a means of resolving which segment has priority at a particular
>logical address.  Do you expect that to be free?

Yes, and the right way to do that is to maintain a segment stack: 
e.g.,

  - the whole process space
  - the process data space
  - some heap
  -  :
  - this 100 byte buffer

etc., with the protection being the intersection of all permissions
present in the stack.  Since this naturally coincides with program
scopes, the API only needs to manipulate the top of the stack.

For most programming a segment stack needs only a handful of entries.
Even 8 entries likely is overkill for the most demanding cases.

But this is a protection mechanism separate from address space
allocation, which can be done by pages of any convenient size.


>>> The advantage that fixed size (even if there is a selection of sizes
>>> to choose from) pages offers is each page has a particular location
>>> into which it fits.  You don't have to worry that some *other* page
>>> partially overlaps it or that it will overlap another.
>> 
>> You do if the space is shared.
>
>So, all processes share a single address space?  And, segments resolve
>whether or not the "current process" can access the particular segment's
>contents?  How does that scale?

No, I mean if the same location is shared between processes [or even
threads in your model] that are using different page sizes.


>I think the model of separate address spaces (for processes) is a lot
>easier for folks to grok.  The fact that a particular "memory unit" can
>coexist in more than one (with permissions appropriate to each specific
>process) is a lot easier to manage.

Easier than what?  We haven't talked about different address models.


>If I want a particular process to have access to the payload of a network
>packet, do I have to create a "subsegment" that encompasses JUST the payload
>so that I can leave the framing intact -- yet inaccessible?  If that process
>tries that portion of the packet that is "locked", it gets blocked -- and
>doesn't that leak information (i.e., the fact that there IS framing information
>surrounding the packet implies that it wasn't sourced by a *local* process)

A subsegment of the program's space certainly.  How deep to make the
hierarchy largely is up to the programmer and how much she wants the
hardware to check.

Yes, it may reveal there is something the programmer can't look at. So
what?  People have been frustrated by locked doors for thousands of
years.  Besides which, your system is open source, so anybody can go
in, peek behind the curtain, and potentially remove any restrictions
they don't like.


>[By contrast, if I could create "memory units" that were of any particular
>size, I'd create one that was sized to shrink-wrap to the payload with another
>that encompassed the entire packet.  The first would be mapped into the
>aforementioned process's address space IMMEDIATELY ADJACENT to any other
>packets that were part of the message (as an example).  The larger unit
>would be mapped into the address space of the process that was concerned with
>the framing information.]

I don't see how that's an improvement. Unless you provide byte sized
"pages" then mapping over an existing data structure - e.g., to copy
it - potentially will leak stuff at the ends.

It works for your hypothetical received packet, but not in general.


>>> But, with support for different (fixed) page sizes -- and attendant
>>> performance consequences thereof -- the application needs to hint
>>> the OS on how it plans/needs to use memory in order to make optimum
>>> use of memory system bandwidth.  Silly for the OS to naively choose
>>> a page size for a process based on some crude metric like "size of
>>> object".  That can result in excessive resources being bound that
>>> aren't *typically* USED by that object -- fault in those portions AS
>>> they are needed (why do I need a -- potentially large -- portion of
>>> the object residing in mapped memory if it is only accessed very
>>> infrequently?)
>>>
>>> OTOH, a finer-grained choice (allowing smaller pieces of the object
>>> to be mapped at a time) reduces TLB reach as well as consuming OTHER
>>> resources (e.g., TLB misses) for an object with poor locality of
>>> reference (here-a-hit, there-a-hit, everywhere-a-hit-hit...)
>> 
>> Exactly.  The latency of TLB misses are the very reason for the
>> existence of "large" pages in modern operating systems.
>
>But they assume there will be "something large" occupying that physical
>resource -- or not.  I.e., with a 16MB superpage, you really want/need
>something that is "close to" 16MB (or, just treat memory as costless).

The thing is that segments aren't managed by TLB ... they are managed
by SLB which can do whatever it wants.


>I wonder how "big" most processes are (in c.a.e products) -- assuming
>the whole process can be mapped into a single contiguous page?
>
>Conversely, I wonder how many "smaller objects" need to be moved between
>address spaces in such products (assuming, of course, that they operate
>under those sorts of protection mechanisms)?

I suspect many are single applications in a single address space.


>Regardless (or, "Irregardless", as sayeth The Rat!), I'm stuck with
>the fixed size pages that vendors currently offer.  So, there can't
>be any "policy" inherent in the crafting of my code as it can't know
>whether it will be able to avail itself of "tiny" pages or if it
>will be packaged in a more wasteful container.

Then what was the point of *this* discussion?  Exercise?


George

Reply by Don Y ●June 30, 20202020-06-30

On 6/30/2020 2:45 AM, George Neuner wrote:
> On Mon, 29 Jun 2020 23:24:14 -0700, Don Y
> <blockedofcourse@foo.invalid> wrote:
> 
>>> On 6/27/2020 10:01 PM, George Neuner wrote:
> 
>> I'd prefer the VMM system to be based on "variable sized pages" (akin
>> to segments) as you can emulate "variable sized protection zones" as
>> collections of one or more such "pages".  Though I don't claim to need
>> "single byte resolution" on such page sizes.
> 
> Mill allocates by cache lines (though what that means is model
> dependent).  Is that a granularity you could live with?

That depends.  At the end of the day, the cache determines performance
so anything finer seems wasted.

But, I'm not concerned with performance as much as functionality;
I'd like to be able to do vm_allocate()s in the same way that I can
build buffer pools -- nothing prevents me from building 48-byte buffers
so why shouldn't I (in a world where hardware could do whatever I wanted)
be able to create 48 byte "pages"?  Then, a 4100 byte page to hold this
executable module and a set of 1000 9000-byte pages to hold jumbo packets?

I.e., I' not artificially constrained by what I can do in software...
just what the hardware will accommodate to match my needs.

>> But, the present trend towards larger page sizes renders them less
>> useful for many things.  E.g., the 512B VAX page would be an oddity
>> in today's world -- I think ARM has some products that offer "tiny"
>> 1KB pages but even those are off-the-beaten track.
> 
> Anything under 4KB is an oddball today. Some 64-bit chips have 1GB
> pages now .., handy for huge databases and science simulations but not
> much else.

Exactly.  And, even if you have 1GB objects, there's no guarantee that you
would want to dedicate resources to having it completely mapped at any given
time.  I.e., if you only wanted to map a quarter of it, you have to move to
smaller page sizes --> more levels of page-tables.

(presumably, you could discipline your software to only access parts that it
KNOWS are mapped... but, doesn't that sort of defeat the purpose?)

>>> So long as segments are only used as
>>> protection zones within the address space, they can overlap in any
>>> way.
>>
>> Then you need a means of resolving which segment has priority at a particular
>> logical address.  Do you expect that to be free?
> 
> Yes, and the right way to do that is to maintain a segment stack:
> e.g.,
> 
>    - the whole process space
>    - the process data space
>    - some heap
>    -  :
>    - this 100 byte buffer
> 
> etc., with the protection being the intersection of all permissions
> present in the stack.  Since this naturally coincides with program
> scopes, the API only needs to manipulate the top of the stack.

But you need this for every disjoint 100 (or 300!) byte buffer
(or similar object managed as segment).  I.e., every accessible
segment has to be visible/resolvable in that structure in order
to know what ACLs apply to it, NOW.

> For most programming a segment stack needs only a handful of entries.
> Even 8 entries likely is overkill for the most demanding cases.
> 
> But this is a protection mechanism separate from address space
> allocation, which can be done by pages of any convenient size.
> 
>>>> The advantage that fixed size (even if there is a selection of sizes
>>>> to choose from) pages offers is each page has a particular location
>>>> into which it fits.  You don't have to worry that some *other* page
>>>> partially overlaps it or that it will overlap another.
>>>
>>> You do if the space is shared.
>>
>> So, all processes share a single address space?  And, segments resolve
>> whether or not the "current process" can access the particular segment's
>> contents?  How does that scale?
> 
> No, I mean if the same location is shared between processes [or even
> threads in your model] that are using different page sizes.

[threads exist in a shared container so always have the same address space]

But nothing CAN overlap it that isn't intended to be accessible
in a given process.  E.g., if "foo" resides in a 16K page in process A
and an 8K portion of that same physical memory is mapped into process B,
then A and B can each access foo -- at potentially different logical
addresses.  The "other 8K of the 16K that is accessible in A need not
be mapped in B -- some other (8K) page can appear in that relative
location.

>> If I want a particular process to have access to the payload of a network
>> packet, do I have to create a "subsegment" that encompasses JUST the payload
>> so that I can leave the framing intact -- yet inaccessible?  If that process
>> tries that portion of the packet that is "locked", it gets blocked -- and
>> doesn't that leak information (i.e., the fact that there IS framing information
>> surrounding the packet implies that it wasn't sourced by a *local* process)
> 
> A subsegment of the program's space certainly.  How deep to make the
> hierarchy largely is up to the programmer and how much she wants the
> hardware to check.

We're talking about hypothetical hardware so there's no reason it shouldn't
"do it all"!

> Yes, it may reveal there is something the programmer can't look at. So
> what?  People have been frustrated by locked doors for thousands of
> years.  Besides which, your system is open source, so anybody can go
> in, peek behind the curtain, and potentially remove any restrictions
> they don't like.

Wrong point.

You don't want the code -- at run time -- to be able to deduce anything that
isn't explicitly disclosed to it (FOSS just makes this more damning).

>> [By contrast, if I could create "memory units" that were of any particular
>> size, I'd create one that was sized to shrink-wrap to the payload with another
>> that encompassed the entire packet.  The first would be mapped into the
>> aforementioned process's address space IMMEDIATELY ADJACENT to any other
>> packets that were part of the message (as an example).  The larger unit
>> would be mapped into the address space of the process that was concerned with
>> the framing information.]
> 
> I don't see how that's an improvement. Unless you provide byte sized
> "pages" then mapping over an existing data structure - e.g., to copy
> it - potentially will leak stuff at the ends.

Yes.  So pages that are typically considerably larger than the sorts of buffer
you are inclined to use will tend to leak MORE.

You can work-around this by scrubbing pages (and buffers) after use.  But,
you still have to rely on discipline to ensure a buffer doesn't get rewritten
before being considered "done" (when it can be scrubbed).

E.g., I scrub all "messages" at the end of each RPC and return them to the
"page pool" to ensure nothing leaks between uses.  So, pages that are
significantly larger than what is needed for a message represent a wasted
effort (you have to scrub the whole page because you don't know if
the callee scribbled something on it)

But, I can't guarantee out-of-band pages passed won't leak information
(though I'm working on an architectural change to address that)

>> I wonder how "big" most processes are (in c.a.e products) -- assuming
>> the whole process can be mapped into a single contiguous page?
>>
>> Conversely, I wonder how many "smaller objects" need to be moved between
>> address spaces in such products (assuming, of course, that they operate
>> under those sorts of protection mechanisms)?
> 
> I suspect many are single applications in a single address space.

By contrast, I've been working with disjoint address spaces for
much of my career (though usually not with hardware protection of
those address spaces).  E.g., bank-switching TEXT and BSS/DATA/STACK
on a per-task basis so each task appears to have its own address space
separated from (most) of the other tasks (if a task is small enough,
it can share an address space with another task(s)) while the kernel
hides in another "hidden" bank.

But, even those COULD benefit from some resource reclamation -- if
executing out of RAM (loaded from FLASH).  E.g., reclaiming the memory
that had been used for initialization (i.e., if you need a KB or so to
set things up, then you could reuse that KB for your pushdown stack
or run-time buffers)

>> Regardless (or, "Irregardless", as sayeth The Rat!), I'm stuck with
>> the fixed size pages that vendors currently offer.  So, there can't
>> be any "policy" inherent in the crafting of my code as it can't know
>> whether it will be able to avail itself of "tiny" pages or if it
>> will be packaged in a more wasteful container.
> 
> Then what was the point of *this* discussion?  Exercise?

Indicating why I think variable size pages are of value.
E.g., the "architectural change" that I alluded to, above, will
effectively emulate the variable sized pages that I desire.
But, will do so at the expense of CPU cycles.

<shrug>

I've adopted that philosophy throughout the design -- performance
always improves (for a given cost) so why not "spend" it on features
and mechanisms that make coding easier and more robust?  Christ, the
system will STILL spend most of its time twiddling its thumbs!

It's the same sort of reasoning that lets my processes decide which
of their pages to "swap out" instead of letting the kernel make those
decisions blindly.  (it's just more opcode fetches!)

[But, using 4K -- or larger -- pages for 500-byte objects just
reeks of waste.]

I can't purchase a battery-backed, solar-powered 120-port network
switch with PoE (2000W) and PTP support (along with protection
against malevolent actors trying to physically damage the switch
via exposed connectors) but, I can EMULATE one using COTS parts
(and leverage that as an opportunity to add *other* value!).

"Some day" the hardware (CPU, switch) will move in a direction that is
more accommodating than present day.  If not, my current hardware
will only end up FASTER which means the current implementation
will just more closely emulate (performance-wise) that conceptual
hardware that might have been available "today"!

[Time for C's morning walk -- while it's still < 90F.]

Reply by George Neuner ●June 30, 20202020-06-30

On Tue, 30 Jun 2020 04:13:23 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/30/2020 2:45 AM, George Neuner wrote:
>> On Mon, 29 Jun 2020 23:24:14 -0700, Don Y
>> <blockedofcourse@foo.invalid> wrote:

>> ... the right way to do that is to maintain a segment stack:
>> e.g.,
>> 
>>    - the whole process space
>>    - the process data space
>>    - some heap
>>    -  :
>>    - this 100 byte buffer
>> 
>> etc., with the protection being the intersection of all permissions
>> present in the stack.  Since this naturally coincides with program
>> scopes, the API only needs to manipulate the top of the stack.
>
>But you need this for every disjoint 100 (or 300!) byte buffer
>(or similar object managed as segment).  I.e., every accessible
>segment has to be visible/resolvable in that structure in order
>to know what ACLs apply to it, NOW.

You need a segment descriptor for every region THAT YOU WANT
PROTECTED.  So what?

At most any given thread can only be using one or two buffers at a
time, and the entire stack does not have to be examined: you can look
only at the top level if when a descriptor is loaded its permissions
are modified to the intersection of itself and all the underlying
descriptors.

The hardware would be a (small) stack with an additional top level
entry to handle the case of 2 buffers in use simultaneously (e.g., a
copy operation).

    [ A ]   [ B ]
       \     /
        [   ]
        [   ]
        [   ]

The control API needs to distinguish 2 top level descriptors and allow
at least one of them to push/pop through to the underlying stack.

Each descriptor needs ... probably an ACL ... to indicate what threads
can use it - but when any particular thread loads the descriptor to
use it, only that thread's id needs to be in the actual MMU hardware.

It isn't as complex as you might think initially - though, obviously,
the hardware has to be designed carefully.

>> For most programming a segment stack needs only a handful of entries.
>> Even 8 entries likely is overkill for the most demanding cases.
>> 
>> But this is a protection mechanism separate from address space
>> allocation, which can be done by pages of any convenient size.

>>>>> The advantage that fixed size (even if there is a selection of sizes
>>>>> to choose from) pages offers is each page has a particular location
>>>>> into which it fits.  You don't have to worry that some *other* page
>>>>> partially overlaps it or that it will overlap another.
>>>>
>>>> You do if the space is shared.
>>>
>>> So, all processes share a single address space?  And, segments resolve
>>> whether or not the "current process" can access the particular segment's
>>> contents?  How does that scale?
>> 
>> ... I mean if the same location is shared between processes [or even
>> threads in your model] that are using different page sizes.
>
>[threads exist in a shared container so always have the same address space]
>
>But nothing CAN overlap it that isn't intended to be accessible
>in a given process.  E.g., if "foo" resides in a 16K page in process A
>and an 8K portion of that same physical memory is mapped into process B,
>then A and B can each access foo -- at potentially different logical
>addresses.  The "other 8K of the 16K that is accessible in A need not
>be mapped in B -- some other (8K) page can appear in that relative
>location.

So no different than what OSes are doing now.

>> [segments] may reveal there is something the programmer can't look at. So
>> what?  People have been frustrated by locked doors for thousands of
>> years.  Besides which, your system is open source, so anybody can go
>> in, peek behind the curtain, and potentially remove any restrictions
>> they don't like.
>
>Wrong point.
>
>You don't want the code -- at run time -- to be able to deduce anything that
>isn't explicitly disclosed to it (FOSS just makes this more damning).

That's a straw man ... code can deduce that it's running in a sandbox
simply by failure to deduce certain things about its environment. Note
that "failure" here includes both getting no answer and getting an
answer that is unreasonable for real hardware.

Since your aim AIUI is to build a control system, there is a limit to
how far you can go in sandboxing programs.

>"Some day" the hardware (CPU, switch) will move in a direction that is
>more accommodating than present day.  

Doubtful.  CPU vendors have given up on segmentation, and paging units
are allowing larger (and larger) pages to match ever growing physical
memories.

What you seem to want is a real capability machine ... which is
something no vendor will ever produce while C continues to be the
language of system programming.

Like paging, segmentation is at least something that can be added (at
some cost of latency) to an existing CPU: you just need to design your
SMMU and insert it into the memory access path.

Historically there were some examples of doing this: e.g., augmenting
a 68000/10/12 with a 68451 SMMU.  But since interest in segmentation
(properly done) died with 32-bit CPUs in the "micro" arena, there is
not much to look at.  Most strong uses of segmentation are only to be
found in mainframe CPUs from the 60s and 70s.

George

Reply by Don Y ●June 30, 20202020-06-30

On 6/30/2020 1:00 AM, upsidedown@downunder.com wrote:
>> I'd prefer the VMM system to be based on "variable sized pages" (akin
>> to segments) as you can emulate "variable sized protection zones" as
>> collections of one or more such "pages".  Though I don't claim to need
>> "single byte resolution" on such page sizes.
> 
> The idea of having multiple size pages in a single process at least
> solves the page table size problem.

Unless you can move all/most of the page table into the TLB,
making it "smaller" (more manageable) doesn't buy you much.
Esp if the page size is larger than needed (i.e., the "work"
is dealing with the need for multiple "too big" pages)

Advances in (paged) MMU performance will come from larger TLBs.
For applications, TLB size is typically not a driving issue.
They "do something" and keep on doing it -- before moving on
to something else.  They tend not to "hop around" memory.

It's OSs that suffer from piss-poor data locality.  But, you
can't predict how much time an application will spend IN the OS
so can't make sweeping generalizations as to when/if you should
lock cache to GAIN performance without risking LOSING performance.

Reply by Don Y ●June 30, 20202020-06-30

On 6/30/2020 9:49 AM, George Neuner wrote:
>>> ... the right way to do that is to maintain a segment stack:
>>> e.g.,
>>>
>>>     - the whole process space
>>>     - the process data space
>>>     - some heap
>>>     -  :
>>>     - this 100 byte buffer
>>>
>>> etc., with the protection being the intersection of all permissions
>>> present in the stack.  Since this naturally coincides with program
>>> scopes, the API only needs to manipulate the top of the stack.
>>
>> But you need this for every disjoint 100 (or 300!) byte buffer
>> (or similar object managed as segment).  I.e., every accessible
>> segment has to be visible/resolvable in that structure in order
>> to know what ACLs apply to it, NOW.
> 
> You need a segment descriptor for every region THAT YOU WANT
> PROTECTED.  So what?

You need a segment STACK for each segment that is exposed.

      - the whole process space
      - the process data space
      - some heap
      -  :
      - this 100 byte buffer

and

      - the whole process space
      - the process data space
      - some heap
      -  :
      - some OTHER 400 byte buffer

and

      - the whole process space
      - the process data space
      - some OTHER heap
      -  :
      - yet ANOTHER 100 byte buffer

and

      - some OTHER process space
      - the process data space
      - some heap
      -  :
      - yet ANOTHER 100 byte buffer

etc.

Each time you create a segment, you'd have to "recompile" the
segment stack(s) for the different environments in which it is
accessible (or, be forced to evaluating each on-the-fly)

> At most any given thread can only be using one or two buffers at a

huh?  If by "at a time" you mean "in a trivially tiny interval of time"
then I might agree.  But, a thread could be accessing lots of buffers
with no readily predictable pattern of the order in which they are
accessed.

> time, and the entire stack does not have to be examined: you can look
> only at the top level if when a descriptor is loaded its permissions
> are modified to the intersection of itself and all the underlying
> descriptors.

The underlying descriptors aren't static.  When process X is executing,
it likely doesn't have the same permissions than when process Y was
executing.  So, on a process swap, you'd have to rebuild (recompile)
each stack to reflect the current process's permissions AT EACH LEVEL
in the stack.

> The hardware would be a (small) stack with an additional top level
> entry to handle the case of 2 buffers in use simultaneously (e.g., a
> copy operation).
> 
>      [ A ]   [ B ]
>         \     /
>          [   ]
>          [   ]
>          [   ]
> 
> The control API needs to distinguish 2 top level descriptors and allow
> at least one of them to push/pop through to the underlying stack.

And, in the very next opcode, DIFFERENT buffers (segments) can be in use.
Your "hardware stack" needs to be rebuilt -- or "instantly reloaded" -- for
each such segment/set of segments.  I don't see how this can be faster
than allowing for DISJOINT "segments" of varying sizes.

> Each descriptor needs ... probably an ACL ... to indicate what threads
> can use it - but when any particular thread loads the descriptor to
> use it, only that thread's id needs to be in the actual MMU hardware.

So, you build this nested structure off in memory someplace (like
page tables) and expect the hardware to find the stack(s) that need
to be active for any given operation, conditioned by the ID of the
process currently executing.  As each stack may be of different
depths, you would have to manage differing sized chunks of memory
in that "structure space" (or, force all to have a maximum depth and
manage fixed size "stacks" regardless of actual content)

> It isn't as complex as you might think initially - though, obviously,
> the hardware has to be designed carefully.

>>> For most programming a segment stack needs only a handful of entries.
>>> Even 8 entries likely is overkill for the most demanding cases.
>>>
>>> But this is a protection mechanism separate from address space
>>> allocation, which can be done by pages of any convenient size.
> 
>>>>>> The advantage that fixed size (even if there is a selection of sizes
>>>>>> to choose from) pages offers is each page has a particular location
>>>>>> into which it fits.  You don't have to worry that some *other* page
>>>>>> partially overlaps it or that it will overlap another.
>>>>>
>>>>> You do if the space is shared.
>>>>
>>>> So, all processes share a single address space?  And, segments resolve
>>>> whether or not the "current process" can access the particular segment's
>>>> contents?  How does that scale?
>>>
>>> ... I mean if the same location is shared between processes [or even
>>> threads in your model] that are using different page sizes.
>>
>> [threads exist in a shared container so always have the same address space]
>>
>> But nothing CAN overlap it that isn't intended to be accessible
>> in a given process.  E.g., if "foo" resides in a 16K page in process A
>> and an 8K portion of that same physical memory is mapped into process B,
>> then A and B can each access foo -- at potentially different logical
>> addresses.  The "other 8K of the 16K that is accessible in A need not
>> be mapped in B -- some other (8K) page can appear in that relative
>> location.
> 
> So no different than what OSes are doing now.

Exactly.  So, "no problem" regardless of whether it is a "shared space"
or not.

>>> [segments] may reveal there is something the programmer can't look at. So
>>> what?  People have been frustrated by locked doors for thousands of
>>> years.  Besides which, your system is open source, so anybody can go
>>> in, peek behind the curtain, and potentially remove any restrictions
>>> they don't like.
>>
>> Wrong point.
>>
>> You don't want the code -- at run time -- to be able to deduce anything that
>> isn't explicitly disclosed to it (FOSS just makes this more damning).
> 
> That's a straw man ... code can deduce that it's running in a sandbox
> simply by failure to deduce certain things about its environment. Note
> that "failure" here includes both getting no answer and getting an
> answer that is unreasonable for real hardware.

With variable size "pages", I can FILL your address space with stuff you
SHOULD be able to see and provide no clues about how much stuff you CAN'T
see -- because the unseeable portions of the address space don't consume
any part of your logical address space.  EVERY address resolves to data
that you can see so how do you know of the existence of unseeable data,
except in the theoretical sense?  It's like my per-process namespaces...
you KNOW there are things out there that you can't "see", but how can
you exploit that to the detriment of the system?  You can waste YOUR
resources trying every name imagineable -- but none of them (other than
those mapped into your context) will resolve to anything!  Measure how
much power the CPU uses while you are making these attempts.  Measure how
much time the OS takes to return your NAK.  There's nothing revealed
in a side-channel.

[you might be able to deduce if a task switch occurred "while away".
Or, if the service is hosted locally vs. remotely.  But, you can't
gain any information about the objects hidden from you and the names
that won't resolve]

OTOH, if you stumble upon 5 asterisks in a latent image of the "password"
textbox, you can deduce that the most recently entered password had 5
characters, even though you can't *see* any of them.

> Since your aim AIUI is to build a control system, there is a limit to
> how far you can go in sandboxing programs.

Current app has no bearing on future apps.  Some of my colleagues are using
my codebase for very different applications.  This has led to many additional
mechanisms being included that complicate my application but are necessary
for theirs (I've been able to implement all as "bolt-ons" due to my choice
of architecture)

>> "Some day" the hardware (CPU, switch) will move in a direction that is
>> more accommodating than present day.
> 
> Doubtful.  CPU vendors have given up on segmentation, and paging units
> are allowing larger (and larger) pages to match ever growing physical
> memories.

Modern processors emulate legacy processors.  Who's to say the future
trend won't expose more of those capabilities to the developer so he
can emulate some *other* (abstract) processor characteristics?

Motogorilla's RGP implemented Bresenham's algorithm in the instruction
set.  Give me access to the microcode in a modern processor and I can
emulate *that* processor's instruction set (though perhaps different
instruction timings).

In the 70's, I designed a graphics processor that treated pixel arrays as
objects that could be stacked in three dimensions (in front of/behind other
objects) as well as signal when any two objects "overlapped" ("collision
detection" in a video game -- "bullet hits enemy").  Again, let me rewrite
a modern CPUs microcode and I can provide the same feature interface
while taking advantage of all the fabrication advancements in the past
50 years!

The 99000 maintained its register complement in main memory ("workspaces").
Nothing stops me from emulating that capability, today.

Etc.

Look at how readily old processors are emulated by userland SOFTWARE
(e.g., MAME, Bochs, etc.)  Note that the excess capabilities of modern
processors need not reflect additional capabilities in the emulated
applications!  While you COULD run Defender's codebase 100 times faster,
you'd end up with an unplayable game!

Just because a vendor doesn't offer a particular solution OTS doesn't
mean you can't MAKE the solution that you want.  Presently, FPGAs are
too general-purpose to efficiently implement these more complex
designs (whereas microcoded CPUs have already optimized-away some
of the flexibility that isn't NEEDED for a CPU).  But, there's no
reason to think that future FPGAs might not be tailored to more
specific market segments.

Or, that foundries won't offer services that allow for "programmed
designs" to be more economically committed to fixed silicon.  This
was already in the works decades ago when I did my last "custom".

I'm out.  I've got too much work to do.  I'm hoping to have the next
off-site here, when "social distancing" eases and I can host my
colleagues (so they can see some of my toys in use!).  I don't think
any of them are eager to deal with the extent of our "outbreak",
nor the insane heat!

> What you seem to want is a real capability machine ... which is
> something no vendor will ever produce while C continues to be the
> language of system programming.
> 
> Like paging, segmentation is at least something that can be added (at
> some cost of latency) to an existing CPU: you just need to design your
> SMMU and insert it into the memory access path.
> 
> Historically there were some examples of doing this: e.g., augmenting
> a 68000/10/12 with a 68451 SMMU.  But since interest in segmentation
> (properly done) died with 32-bit CPUs in the "micro" arena, there is
> not much to look at.  Most strong uses of segmentation are only to be
> found in mainframe CPUs from the 60s and 70s.
> 
> 
> George
>

Reply by ●July 1, 20202020-07-01

On Tue, 30 Jun 2020 12:23:17 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/30/2020 1:00 AM, upsidedown@downunder.com wrote:
>>> I'd prefer the VMM system to be based on "variable sized pages" (akin
>>> to segments) as you can emulate "variable sized protection zones" as
>>> collections of one or more such "pages".  Though I don't claim to need
>>> "single byte resolution" on such page sizes.
>> 
>> The idea of having multiple size pages in a single process at least
>> solves the page table size problem.
>
>Unless you can move all/most of the page table into the TLB,
>making it "smaller" (more manageable) doesn't buy you much.
>Esp if the page size is larger than needed (i.e., the "work"
>is dealing with the need for multiple "too big" pages)

I just tried to show you how to implement variable size pages that you
preferred.

Assuming a x86-64 style system, with available page sizes of 4 K, 2 M
and 1 G.

Assume you want to map a 3.123 G object. This will require 

- 780750 page table entries (PGE) with 4 K pages, or
-     1562 PGEs with 2 M page size (with 1 M lost) or
-           4 PGEs if 1 G pages are used (with 877 M lost)

Assuming the page size can be mixed in the same process, mapping the
3.123 G object would require

-    3 PGEs with 1 G pages _and_
-  61 PGEs of  2 M pages _and_
- 250 PGEs of 4 K pages,

a total of 314 PGEs, which should be easier to fit into TLBs than
thousands required by fixed 4 K or 2 M pages.

Note: For illustrative purposes it is assumed that K is 1000, M is 1E6
and G is 1E9 in calculations above.

Reply by Dimiter_Popoff ●July 1, 20202020-07-01

On 6/30/2020 22:23, Don Y wrote:
> On 6/30/2020 1:00 AM, upsidedown@downunder.com wrote:
>>> I'd prefer the VMM system to be based on "variable sized pages" (akin
>>> to segments) as you can emulate "variable sized protection zones" as
>>> collections of one or more such "pages".&nbsp; Though I don't claim to need
>>> "single byte resolution" on such page sizes.
>>
>> The idea of having multiple size pages in a single process at least
>> solves the page table size problem.
> 
> Unless you can move all/most of the page table into the TLB,
> making it "smaller" (more manageable) doesn't buy you much.
> Esp if the page size is larger than needed (i.e., the "work"
> is dealing with the need for multiple "too big" pages)
> 
> Advances in (paged) MMU performance will come from larger TLBs.
> For applications, TLB size is typically not a driving issue.
> They "do something" and keep on doing it -- before moving on
> to something else.&nbsp; They tend not to "hop around" memory.
> 
> It's OSs that suffer from piss-poor data locality.&nbsp; But, you
> can't predict how much time an application will spend IN the OS
> so can't make sweeping generalizations as to when/if you should
> lock cache to GAIN performance without risking LOSING performance.

Hi Don,
I think you are chasing your own tail with that.
Why do you want to have all pages in a TLB at the same time?
Going through say 4G of memory at 2nS per access will take well above
a second through a 64 bit bus.
So even if you waste a whole ms into tablewalks in the process you will
still have wasted 0.1% of the time.
Having large pages or BAT translation makes things easy enough, nobody
seems to be looking for newer solutions last few decades because there
is no need of such.
Just accept the granularity, if you want the extra predictability
and/or stability for your system just make sure you have enough
physical memory so you can map all the used logical memory without
ever needing to swap and you are done.
If 4k granularity is too much - it can be even in large systems - just
make your own pools of smaller granularity in logical address space,
this is how I have been doing it for decades now in DPS.

Dimiter
======================================================
Dimiter Popoff, TGI             http://www.tgi-sci.com
======================================================
http://www.flickr.com/photos/didi_tgi/

Reply by George Neuner ●July 2, 20202020-07-02

On Tue, 30 Jun 2020 12:44:17 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/30/2020 9:49 AM, George Neuner wrote:
>>>>
>>>> ... the right way to do that is to maintain a segment stack:
>>>> e.g.,
>>>>
>>>>     - the whole process space
>>>>     - the process data space
>>>>     - some heap
>>>>     -  :
>>>>     - this 100 byte buffer
>>>>
>>>> etc., with the protection being the intersection of all permissions
>>>> present in the stack.  Since this naturally coincides with program
>>>> scopes, the API only needs to manipulate the top of the stack.
>>>
>>> But you need this for every disjoint 100 (or 300!) byte buffer
>>> (or similar object managed as segment).  I.e., every accessible
>>> segment has to be visible/resolvable in that structure in order
>>> to know what ACLs apply to it, NOW.
>> 
>> You need a segment descriptor for every region THAT YOU WANT
>> PROTECTED.  So what?
>
>You need a segment STACK for each segment that is exposed.
>
>      - the whole process space
>      - the process data space
>      - some heap
>      -  :
>      - this 100 byte buffer
>
>and
>
>      - the whole process space
>      - the process data space
>      - some heap
>      -  :
>      - some OTHER 400 byte buffer
>
>and ...

For application programming in general you need one stack per
process/thread.  Application "objects" are disjoint only at top level
... underneath there are large common segments which follow the
scoping contour of the program.

 object                       *
 object                     *   
 object                          *
     :
   heap                   *********** 
   dseg        ***      ****************************
process ****************************************************

All the "objects" are subsegments of a heap, which is a subsegment of
the process's data segment, which is a subsegment of the process's
total address space.

Yes, this is simplistic, but also realistic.  Many applications will
never go farther than this.

Yes, a single thread may want to work simultaneously with objects that
were allocated from different heaps, and THAT would require dealing
with multiple stacks ... but it isn't a typical behavior for many
applications.  

Obviously kernel programming is a different matter, and typical
applications in *your* system may not conform to these behavioral
generalities.

The point is that it can be handled, and not by needing a stack for
each object.  See below.

>Each time you create a segment, you'd have to "recompile" the
>segment stack(s) for the different environments in which it is
>accessible (or, be forced to evaluating each on-the-fly)

Most use of segments will be strongly hierarchically scoped.  As an
*analogy*, recall Pascal's record access using WITH:

  with  the_process_space  do
   with  the_data_segment  do
    with  my_heap  do
      :
      with  objA, objB  do
        :
      end;
      :
    end;
   end;
  end;

[note: the multiple WITH above isn't legal Pascal, but something like
it that allowed an arbitrary number of identifiers would be a useful
construct for working with segments.]

Yes, it is a runtime evaluation, but think about it: all that is
really happening is comparing a few ids, limits and access rights to
see if they are compatible, and then shoving descriptors into SMMU
cache.

Also remember that, e.g., the "my_heap" segment above, may control
most/all of the data accesses of entire threads and be in scope in the
process for long periods of time.  And an *application* typically
would have no way to refer to the descriptors for its process and data
space - those would come from the OS and be implied within the
application.

And yes, there would have to be some variant to deal with really
disjoint spaces: different heaps for an application, different
processes for the kernel, etc.  But that's just syntax over the
mechanism.

>> At most any given thread can only be using one or two buffers at a
>
>huh?  If by "at a time" you mean "in a trivially tiny interval of time"
>then I might agree.  But, a thread could be accessing lots of buffers
>with no readily predictable pattern of the order in which they are
>accessed.

Yes, but the segment "cache" doesn't need to be restricted to minimums
[that actually would not be very smart].  

The SMMU doesn't have to maintain stacks of descriptors if you're
willing to go to memory when they change (but sometimes that will be
necessary anyway - just like in paging).  But there will be segments
that in force for long periods and those should be cached for fast
access.

So a more realistic design might look something like:

  [ ]...[ ]  [ ]...[ ]  ...  [ ]...[ ] 
     [ ]        [ ]             [ ]
     [ ]        [ ]             [ ]
      :          :               : 
     [ ]        [ ]             [ ]

in which there are some number of (small) stacks, each with some
number of top-level "cache" lines to deal with transitory objects
controlled by their underlying stack.

[An alternative would be to make the whole top level a unified cache
and be able to validate/subselect from any of the underlying stacks.
It would make it easier to work with, but (maybe far) more complicated
to actually implement.]

Ideally there would be enough stacks to handle some reasonable number
of processes/threads - working on the assumption that any given thread
will stay within the same (lower-level) segment scope most of the
time.  When a thread is idled, its stack can be used for some other
[similar to how PMMUs deal with process context switches].

>> time, and the entire stack does not have to be examined: you can look
>> only at the top level if when a descriptor is loaded its permissions
>> are modified to the intersection of itself and all the underlying
>> descriptors.
>
>The underlying descriptors aren't static.  When process X is executing,
>it likely doesn't have the same permissions than when process Y was
>executing.  So, on a process swap, you'd have to rebuild (recompile)
>each stack to reflect the current process's permissions AT EACH LEVEL
>in the stack.

Or keep multiple stacks.   PMMUs deal with multiple processes by
including the process id with the page descriptors.  Context switches
cause descriptors to be spilled/filled if they aren't where they need
to be.

>> The hardware would be a (small) stack with an additional top level
>> entry to handle the case of 2 buffers in use simultaneously (e.g., a
>> copy operation).
>> 
>>      [ A ]   [ B ]
>>         \     /
>>          [   ]
>>          [   ]
>>          [   ]
>> 
>> The control API needs to distinguish 2 top level descriptors and allow
>> at least one of them to push/pop through to the underlying stack.
>
>And, in the very next opcode, DIFFERENT buffers (segments) can be in use.
>Your "hardware stack" needs to be rebuilt -- or "instantly reloaded" -- for
>each such segment/set of segments.  I don't see how this can be faster
>than allowing for DISJOINT "segments" of varying sizes.

You (Don) have to realize that I'm describing the mechanism at a high
level and in a simple way.  This is NOT a schematic.

I haven't thought about this stuff for ages.  There always are details
that have to be explored - but the general principle is fairly easy to
grasp

>> Each descriptor needs ... probably an ACL ... to indicate what threads
>> can use it - but when any particular thread loads the descriptor to
>> use it, only that thread's id needs to be in the actual MMU hardware.
>
>So, you build this nested structure off in memory someplace (like
>page tables) and expect the hardware to find the stack(s) that need
>to be active for any given operation, conditioned by the ID of the
>process currently executing.  As each stack may be of different
>depths, you would have to manage differing sized chunks of memory
>in that "structure space" (or, force all to have a maximum depth and
>manage fixed size "stacks" regardless of actual content)

Stacks in the SMMU can be spilled/filled linearly - it's only the top
level "cache" of transitory descriptors that needs some special
handling - probably a hash table to find them quickly when needed.

For the OS part, the segment analog of page tables is an "interval"
tree.
https://en.wikipedia.org/wiki/Interval_tree

>> It isn't as complex as you might think initially - though, obviously,
>> the hardware has to be designed carefully.

>>> "Some day" the hardware (CPU, switch) will move in a direction that is
>>> more accommodating than present day.
>> 
>> Doubtful.  CPU vendors have given up on segmentation, and paging units
>> are allowing larger (and larger) pages to match ever growing physical
>> memories.
>
>Modern processors emulate legacy processors.  Who's to say the future
>trend won't expose more of those capabilities to the developer so he
>can emulate some *other* (abstract) processor characteristics?

Progressively, on *average*, each new generation of programmers is
less skilled than those that came before.  Most current programmers
expect the hardware and languages they use to protect them from
mistakes.  Giving them more sharp tools to cut themselves with will
not be a win.

>Look at how readily old processors are emulated by userland SOFTWARE
>(e.g., MAME, Bochs, etc.)  Note that the excess capabilities of modern
>processors need not reflect additional capabilities in the emulated
>applications!  While you COULD run Defender's codebase 100 times faster,
>you'd end up with an unplayable game!

And look at how slow those emulations are [relative to native code].

I believe you were the one who told me that you couldn't rely on
virtual machines because, at some point in the future, they might no
longer support the instruction sets and devices you needed.

George