EmbeddedRelated.com
Forums
Memfault Beyond the Launch

Page sizes

Started by Don Y June 27, 2020
On 6/29/2020 8:43 PM, upsidedown@downunder.com wrote:
>> But, here, we're not just talking about it as an addressing/protection mode >> but, also, as a mechanism for VMM. Unless you back the segment hardware >> with some ADDITIONAL mechanism to perform that demand-based activity, >> you're implicitly stating that every "object" must be completely mapped >> (or not at all)... you have no notion of a smaller mapping unit beneath >> the segment interface. > , > Actually swapping out a whole object can be a good idea, Assuming each > "DLL" is stored in a separate object , each with own code and data > segments.
This is only possible if you have secondary store. Or, if you know the object is immutable and can be reloaded from the IPL store (or equivalent).
> In case of a congestion the OS knows which DLLs are currently active. > If an object is not currently active, it is a good candidate for > replacement. The code segment can be discarded directly and the dirty > data segment should be swapped out, > > Later on. if there is a new reference to a specific DLL, load the code > and data segments at once as a unit.
I do essentially this -- but up to a scale including whole processors. If a processor is unneeded -- because its services can be hosted elsewhere -- then I move the services onto another node and shut down the processor. Likewise, if I have a shortage of resources, I being "cold" processors back online and migrate services onto them. On a smaller scale, if a service is idle, then I can opt to kill it and reload it from the permanent store, when needed. (as killing it might let me idle that hosting processor)
> The problem with many paged system is selecting which pages should be > replaced, unless there is a good LRU (Least Recently Used) hardware > support. Lacking sufficient hardware support, pages to be replaced are > selected at random.
The problem will always exist because the page scheduling policy is typically hard-coded into the kernel. The OS is unaware of the needs of the application domain that it is hosting. So, an idle process may really WANT to "sit watching" and not be swapped out -- esp if there are performance (or correctness) issues associated with how quickly it responds to the "next" call for service!
> This is a problem especially with OSes that > support(ed) multiple hardware platforms. Pure pages can be discarded, > but dirty pages need to be written to the page file. For instance in > WinNT dirty pages in the working set are selected at random and moved > to a queue of pages to be written into the page file. If there is a > new reference to the page in the queue, it is moved back to the > working set and removed from the queue. f there are no recent > references, the page becomes written to the page file and removed from > the queue. Not very optimal.
I let processes manage the memory object(s) that they own. The kernel communicates a need for resources to a "policy process" that has knowledge of all of the resources held, locally (memory being just one of those). [There are no *policy* decisions made in the kernel] It makes a decision as to where to materialize any "spare" resources. It, then, contacts the targeted process which, in turn, directs its "memory management process(es)" to free up some resources. However, it is under no obligation to do so. The downside of failing to honor a REQUEST to relinquish resources is the system can opt to reclaim ALL of your resources (by killing you off!). It can also notice that this was necessary and make a note in the event that your process is reloaded at some later date (i.e., increase your process's "cost" to effectively discourage its use)
On Mon, 29 Jun 2020 23:24:14 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/29/2020 9:05 PM, George Neuner wrote: >> On Sat, 27 Jun 2020 23:50:50 -0700, Don Y >> <blockedofcourse@foo.invalid> wrote: >> >>> On 6/27/2020 10:01 PM, George Neuner wrote: >>>>>> If you want a *useful* segmenting MMU, you probably need to design it >>>>>> yourself. Historically there were some units that did it (what I >>>>>> would call) right, but none are scalable to modern memory sizes. >>>>>> >>>>>> Whatever you do, you want the program to work with flat addresses and >>>>>> have segmentation applied transparently (like paging) during memory >>>>>> access. You certainly DO NOT want to follow the x86 example of >>>>>> exposing segments in the addressing. >>>>> >>>>> Agreed. Segments were meant to address a different problem. >>>>> >>>>> OTOH, exposing them to the instruction set removes any potential >>>>> ambiguity if two (or more) "general purpose" segments could >>>>> overlap at a particular spot in the address space; the opcode >>>>> acts as a disambiguator. >>>> >>>> ??? Not following. >>> >>> In a large, flat address space, it is conceivable that "general purpose" >>> segments could overlap. So, in such an environment, an address presented >>> to the memory subsystem would have to resolve to SOME particular physical >>> address, "behind" the segment hardware. The hardware would have to resolve >>> any possible ambiguities. (how do you design the HARDWARE to prevent >>> ambiguities from arising without increasing its complexity even more??). >> >> Unless you refer to x86, I still don't understand what "ambiguity" you >> are speaking of. > >Yes. > >> x86 addresses using segments were ambigious because x86 segmentation >> was implemented poorly with the segment being part of the address. >> >> A segment should ONLY be a protection zone, never an address >> remapping. Segments should be defined only on the flat address space, >> and the address should be completely resolved before checking it >> against segment boundaries. > >I'd prefer the VMM system to be based on "variable sized pages" (akin >to segments) as you can emulate "variable sized protection zones" as >collections of one or more such "pages". Though I don't claim to need >"single byte resolution" on such page sizes.
The idea of having multiple size pages in a single process at least solves the page table size problem. Assuming the page table is divided in 3 hierarchical level, each handling a number of bits of the virtual address. This would make it possible to have Huge, Big and Small pages. The page size bits would have to be moved from the processor status word to each page table entry. For Huge pages, the top level page table would directly select the huge page and the remaining virtual address bits would be the offset within the Huge page. However, if the top level page table entry contains the Big page flag, the top size table entry would point to the second level page table, which then either contains a pointer directly to the Big table or a pointer to the low level page table to address the Small page. The worst case total page table size (including all three levels) would be a few hundred entries total. In fact this would make it possible to have a 8 to 16 bit task ID field to the left of the process specific virtual address going thou the virtual memory hierarchy (task:address). Since a reasonable number of bits should be handled by each page table level, the size difference between different size pages would have to be hundreds if not thousands.
On Mon, 29 Jun 2020 23:24:14 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>> On 6/27/2020 10:01 PM, George Neuner wrote:
>I'd prefer the VMM system to be based on "variable sized pages" (akin >to segments) as you can emulate "variable sized protection zones" as >collections of one or more such "pages". Though I don't claim to need >"single byte resolution" on such page sizes.
Mill allocates by cache lines (though what that means is model dependent). Is that a granularity you could live with?
>But, the present trend towards larger page sizes renders them less >useful for many things. E.g., the 512B VAX page would be an oddity >in today's world -- I think ARM has some products that offer "tiny" >1KB pages but even those are off-the-beaten track.
Anything under 4KB is an oddball today. Some 64-bit chips have 1GB pages now .., handy for huge databases and science simulations but not much else.
>> So long as segments are only used as >> protection zones within the address space, they can overlap in any >> way. > >Then you need a means of resolving which segment has priority at a particular >logical address. Do you expect that to be free?
Yes, and the right way to do that is to maintain a segment stack: e.g., - the whole process space - the process data space - some heap - : - this 100 byte buffer etc., with the protection being the intersection of all permissions present in the stack. Since this naturally coincides with program scopes, the API only needs to manipulate the top of the stack. For most programming a segment stack needs only a handful of entries. Even 8 entries likely is overkill for the most demanding cases. But this is a protection mechanism separate from address space allocation, which can be done by pages of any convenient size.
>>> The advantage that fixed size (even if there is a selection of sizes >>> to choose from) pages offers is each page has a particular location >>> into which it fits. You don't have to worry that some *other* page >>> partially overlaps it or that it will overlap another. >> >> You do if the space is shared. > >So, all processes share a single address space? And, segments resolve >whether or not the "current process" can access the particular segment's >contents? How does that scale?
No, I mean if the same location is shared between processes [or even threads in your model] that are using different page sizes.
>I think the model of separate address spaces (for processes) is a lot >easier for folks to grok. The fact that a particular "memory unit" can >coexist in more than one (with permissions appropriate to each specific >process) is a lot easier to manage.
Easier than what? We haven't talked about different address models.
>If I want a particular process to have access to the payload of a network >packet, do I have to create a "subsegment" that encompasses JUST the payload >so that I can leave the framing intact -- yet inaccessible? If that process >tries that portion of the packet that is "locked", it gets blocked -- and >doesn't that leak information (i.e., the fact that there IS framing information >surrounding the packet implies that it wasn't sourced by a *local* process)
A subsegment of the program's space certainly. How deep to make the hierarchy largely is up to the programmer and how much she wants the hardware to check. Yes, it may reveal there is something the programmer can't look at. So what? People have been frustrated by locked doors for thousands of years. Besides which, your system is open source, so anybody can go in, peek behind the curtain, and potentially remove any restrictions they don't like.
>[By contrast, if I could create "memory units" that were of any particular >size, I'd create one that was sized to shrink-wrap to the payload with another >that encompassed the entire packet. The first would be mapped into the >aforementioned process's address space IMMEDIATELY ADJACENT to any other >packets that were part of the message (as an example). The larger unit >would be mapped into the address space of the process that was concerned with >the framing information.]
I don't see how that's an improvement. Unless you provide byte sized "pages" then mapping over an existing data structure - e.g., to copy it - potentially will leak stuff at the ends. It works for your hypothetical received packet, but not in general.
>>> But, with support for different (fixed) page sizes -- and attendant >>> performance consequences thereof -- the application needs to hint >>> the OS on how it plans/needs to use memory in order to make optimum >>> use of memory system bandwidth. Silly for the OS to naively choose >>> a page size for a process based on some crude metric like "size of >>> object". That can result in excessive resources being bound that >>> aren't *typically* USED by that object -- fault in those portions AS >>> they are needed (why do I need a -- potentially large -- portion of >>> the object residing in mapped memory if it is only accessed very >>> infrequently?) >>> >>> OTOH, a finer-grained choice (allowing smaller pieces of the object >>> to be mapped at a time) reduces TLB reach as well as consuming OTHER >>> resources (e.g., TLB misses) for an object with poor locality of >>> reference (here-a-hit, there-a-hit, everywhere-a-hit-hit...) >> >> Exactly. The latency of TLB misses are the very reason for the >> existence of "large" pages in modern operating systems. > >But they assume there will be "something large" occupying that physical >resource -- or not. I.e., with a 16MB superpage, you really want/need >something that is "close to" 16MB (or, just treat memory as costless).
The thing is that segments aren't managed by TLB ... they are managed by SLB which can do whatever it wants.
>I wonder how "big" most processes are (in c.a.e products) -- assuming >the whole process can be mapped into a single contiguous page? > >Conversely, I wonder how many "smaller objects" need to be moved between >address spaces in such products (assuming, of course, that they operate >under those sorts of protection mechanisms)?
I suspect many are single applications in a single address space.
>Regardless (or, "Irregardless", as sayeth The Rat!), I'm stuck with >the fixed size pages that vendors currently offer. So, there can't >be any "policy" inherent in the crafting of my code as it can't know >whether it will be able to avail itself of "tiny" pages or if it >will be packaged in a more wasteful container.
Then what was the point of *this* discussion? Exercise? George
On 6/30/2020 2:45 AM, George Neuner wrote:
> On Mon, 29 Jun 2020 23:24:14 -0700, Don Y > <blockedofcourse@foo.invalid> wrote: > >>> On 6/27/2020 10:01 PM, George Neuner wrote: > >> I'd prefer the VMM system to be based on "variable sized pages" (akin >> to segments) as you can emulate "variable sized protection zones" as >> collections of one or more such "pages". Though I don't claim to need >> "single byte resolution" on such page sizes. > > Mill allocates by cache lines (though what that means is model > dependent). Is that a granularity you could live with?
That depends. At the end of the day, the cache determines performance so anything finer seems wasted. But, I'm not concerned with performance as much as functionality; I'd like to be able to do vm_allocate()s in the same way that I can build buffer pools -- nothing prevents me from building 48-byte buffers so why shouldn't I (in a world where hardware could do whatever I wanted) be able to create 48 byte "pages"? Then, a 4100 byte page to hold this executable module and a set of 1000 9000-byte pages to hold jumbo packets? I.e., I' not artificially constrained by what I can do in software... just what the hardware will accommodate to match my needs.
>> But, the present trend towards larger page sizes renders them less >> useful for many things. E.g., the 512B VAX page would be an oddity >> in today's world -- I think ARM has some products that offer "tiny" >> 1KB pages but even those are off-the-beaten track. > > Anything under 4KB is an oddball today. Some 64-bit chips have 1GB > pages now .., handy for huge databases and science simulations but not > much else.
Exactly. And, even if you have 1GB objects, there's no guarantee that you would want to dedicate resources to having it completely mapped at any given time. I.e., if you only wanted to map a quarter of it, you have to move to smaller page sizes --> more levels of page-tables. (presumably, you could discipline your software to only access parts that it KNOWS are mapped... but, doesn't that sort of defeat the purpose?)
>>> So long as segments are only used as >>> protection zones within the address space, they can overlap in any >>> way. >> >> Then you need a means of resolving which segment has priority at a particular >> logical address. Do you expect that to be free? > > Yes, and the right way to do that is to maintain a segment stack: > e.g., > > - the whole process space > - the process data space > - some heap > - : > - this 100 byte buffer > > etc., with the protection being the intersection of all permissions > present in the stack. Since this naturally coincides with program > scopes, the API only needs to manipulate the top of the stack.
But you need this for every disjoint 100 (or 300!) byte buffer (or similar object managed as segment). I.e., every accessible segment has to be visible/resolvable in that structure in order to know what ACLs apply to it, NOW.
> For most programming a segment stack needs only a handful of entries. > Even 8 entries likely is overkill for the most demanding cases. > > But this is a protection mechanism separate from address space > allocation, which can be done by pages of any convenient size. > >>>> The advantage that fixed size (even if there is a selection of sizes >>>> to choose from) pages offers is each page has a particular location >>>> into which it fits. You don't have to worry that some *other* page >>>> partially overlaps it or that it will overlap another. >>> >>> You do if the space is shared. >> >> So, all processes share a single address space? And, segments resolve >> whether or not the "current process" can access the particular segment's >> contents? How does that scale? > > No, I mean if the same location is shared between processes [or even > threads in your model] that are using different page sizes.
[threads exist in a shared container so always have the same address space] But nothing CAN overlap it that isn't intended to be accessible in a given process. E.g., if "foo" resides in a 16K page in process A and an 8K portion of that same physical memory is mapped into process B, then A and B can each access foo -- at potentially different logical addresses. The "other 8K of the 16K that is accessible in A need not be mapped in B -- some other (8K) page can appear in that relative location.
>> If I want a particular process to have access to the payload of a network >> packet, do I have to create a "subsegment" that encompasses JUST the payload >> so that I can leave the framing intact -- yet inaccessible? If that process >> tries that portion of the packet that is "locked", it gets blocked -- and >> doesn't that leak information (i.e., the fact that there IS framing information >> surrounding the packet implies that it wasn't sourced by a *local* process) > > A subsegment of the program's space certainly. How deep to make the > hierarchy largely is up to the programmer and how much she wants the > hardware to check.
We're talking about hypothetical hardware so there's no reason it shouldn't "do it all"!
> Yes, it may reveal there is something the programmer can't look at. So > what? People have been frustrated by locked doors for thousands of > years. Besides which, your system is open source, so anybody can go > in, peek behind the curtain, and potentially remove any restrictions > they don't like.
Wrong point. You don't want the code -- at run time -- to be able to deduce anything that isn't explicitly disclosed to it (FOSS just makes this more damning).
>> [By contrast, if I could create "memory units" that were of any particular >> size, I'd create one that was sized to shrink-wrap to the payload with another >> that encompassed the entire packet. The first would be mapped into the >> aforementioned process's address space IMMEDIATELY ADJACENT to any other >> packets that were part of the message (as an example). The larger unit >> would be mapped into the address space of the process that was concerned with >> the framing information.] > > I don't see how that's an improvement. Unless you provide byte sized > "pages" then mapping over an existing data structure - e.g., to copy > it - potentially will leak stuff at the ends.
Yes. So pages that are typically considerably larger than the sorts of buffer you are inclined to use will tend to leak MORE. You can work-around this by scrubbing pages (and buffers) after use. But, you still have to rely on discipline to ensure a buffer doesn't get rewritten before being considered "done" (when it can be scrubbed). E.g., I scrub all "messages" at the end of each RPC and return them to the "page pool" to ensure nothing leaks between uses. So, pages that are significantly larger than what is needed for a message represent a wasted effort (you have to scrub the whole page because you don't know if the callee scribbled something on it) But, I can't guarantee out-of-band pages passed won't leak information (though I'm working on an architectural change to address that)
>> I wonder how "big" most processes are (in c.a.e products) -- assuming >> the whole process can be mapped into a single contiguous page? >> >> Conversely, I wonder how many "smaller objects" need to be moved between >> address spaces in such products (assuming, of course, that they operate >> under those sorts of protection mechanisms)? > > I suspect many are single applications in a single address space.
By contrast, I've been working with disjoint address spaces for much of my career (though usually not with hardware protection of those address spaces). E.g., bank-switching TEXT and BSS/DATA/STACK on a per-task basis so each task appears to have its own address space separated from (most) of the other tasks (if a task is small enough, it can share an address space with another task(s)) while the kernel hides in another "hidden" bank. But, even those COULD benefit from some resource reclamation -- if executing out of RAM (loaded from FLASH). E.g., reclaiming the memory that had been used for initialization (i.e., if you need a KB or so to set things up, then you could reuse that KB for your pushdown stack or run-time buffers)
>> Regardless (or, "Irregardless", as sayeth The Rat!), I'm stuck with >> the fixed size pages that vendors currently offer. So, there can't >> be any "policy" inherent in the crafting of my code as it can't know >> whether it will be able to avail itself of "tiny" pages or if it >> will be packaged in a more wasteful container. > > Then what was the point of *this* discussion? Exercise?
Indicating why I think variable size pages are of value. E.g., the "architectural change" that I alluded to, above, will effectively emulate the variable sized pages that I desire. But, will do so at the expense of CPU cycles. <shrug> I've adopted that philosophy throughout the design -- performance always improves (for a given cost) so why not "spend" it on features and mechanisms that make coding easier and more robust? Christ, the system will STILL spend most of its time twiddling its thumbs! It's the same sort of reasoning that lets my processes decide which of their pages to "swap out" instead of letting the kernel make those decisions blindly. (it's just more opcode fetches!) [But, using 4K -- or larger -- pages for 500-byte objects just reeks of waste.] I can't purchase a battery-backed, solar-powered 120-port network switch with PoE (2000W) and PTP support (along with protection against malevolent actors trying to physically damage the switch via exposed connectors) but, I can EMULATE one using COTS parts (and leverage that as an opportunity to add *other* value!). "Some day" the hardware (CPU, switch) will move in a direction that is more accommodating than present day. If not, my current hardware will only end up FASTER which means the current implementation will just more closely emulate (performance-wise) that conceptual hardware that might have been available "today"! [Time for C's morning walk -- while it's still < 90F.]
On Tue, 30 Jun 2020 04:13:23 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/30/2020 2:45 AM, George Neuner wrote: >> On Mon, 29 Jun 2020 23:24:14 -0700, Don Y >> <blockedofcourse@foo.invalid> wrote:
>> ... the right way to do that is to maintain a segment stack: >> e.g., >> >> - the whole process space >> - the process data space >> - some heap >> - : >> - this 100 byte buffer >> >> etc., with the protection being the intersection of all permissions >> present in the stack. Since this naturally coincides with program >> scopes, the API only needs to manipulate the top of the stack. > >But you need this for every disjoint 100 (or 300!) byte buffer >(or similar object managed as segment). I.e., every accessible >segment has to be visible/resolvable in that structure in order >to know what ACLs apply to it, NOW.
You need a segment descriptor for every region THAT YOU WANT PROTECTED. So what? At most any given thread can only be using one or two buffers at a time, and the entire stack does not have to be examined: you can look only at the top level if when a descriptor is loaded its permissions are modified to the intersection of itself and all the underlying descriptors. The hardware would be a (small) stack with an additional top level entry to handle the case of 2 buffers in use simultaneously (e.g., a copy operation). [ A ] [ B ] \ / [ ] [ ] [ ] The control API needs to distinguish 2 top level descriptors and allow at least one of them to push/pop through to the underlying stack. Each descriptor needs ... probably an ACL ... to indicate what threads can use it - but when any particular thread loads the descriptor to use it, only that thread's id needs to be in the actual MMU hardware. It isn't as complex as you might think initially - though, obviously, the hardware has to be designed carefully.
>> For most programming a segment stack needs only a handful of entries. >> Even 8 entries likely is overkill for the most demanding cases. >> >> But this is a protection mechanism separate from address space >> allocation, which can be done by pages of any convenient size.
>>>>> The advantage that fixed size (even if there is a selection of sizes >>>>> to choose from) pages offers is each page has a particular location >>>>> into which it fits. You don't have to worry that some *other* page >>>>> partially overlaps it or that it will overlap another. >>>> >>>> You do if the space is shared. >>> >>> So, all processes share a single address space? And, segments resolve >>> whether or not the "current process" can access the particular segment's >>> contents? How does that scale? >> >> ... I mean if the same location is shared between processes [or even >> threads in your model] that are using different page sizes. > >[threads exist in a shared container so always have the same address space] > >But nothing CAN overlap it that isn't intended to be accessible >in a given process. E.g., if "foo" resides in a 16K page in process A >and an 8K portion of that same physical memory is mapped into process B, >then A and B can each access foo -- at potentially different logical >addresses. The "other 8K of the 16K that is accessible in A need not >be mapped in B -- some other (8K) page can appear in that relative >location.
So no different than what OSes are doing now.
>> [segments] may reveal there is something the programmer can't look at. So >> what? People have been frustrated by locked doors for thousands of >> years. Besides which, your system is open source, so anybody can go >> in, peek behind the curtain, and potentially remove any restrictions >> they don't like. > >Wrong point. > >You don't want the code -- at run time -- to be able to deduce anything that >isn't explicitly disclosed to it (FOSS just makes this more damning).
That's a straw man ... code can deduce that it's running in a sandbox simply by failure to deduce certain things about its environment. Note that "failure" here includes both getting no answer and getting an answer that is unreasonable for real hardware. Since your aim AIUI is to build a control system, there is a limit to how far you can go in sandboxing programs.
>"Some day" the hardware (CPU, switch) will move in a direction that is >more accommodating than present day.
Doubtful. CPU vendors have given up on segmentation, and paging units are allowing larger (and larger) pages to match ever growing physical memories. What you seem to want is a real capability machine ... which is something no vendor will ever produce while C continues to be the language of system programming. Like paging, segmentation is at least something that can be added (at some cost of latency) to an existing CPU: you just need to design your SMMU and insert it into the memory access path. Historically there were some examples of doing this: e.g., augmenting a 68000/10/12 with a 68451 SMMU. But since interest in segmentation (properly done) died with 32-bit CPUs in the "micro" arena, there is not much to look at. Most strong uses of segmentation are only to be found in mainframe CPUs from the 60s and 70s. George
On 6/30/2020 1:00 AM, upsidedown@downunder.com wrote:
>> I'd prefer the VMM system to be based on "variable sized pages" (akin >> to segments) as you can emulate "variable sized protection zones" as >> collections of one or more such "pages". Though I don't claim to need >> "single byte resolution" on such page sizes. > > The idea of having multiple size pages in a single process at least > solves the page table size problem.
Unless you can move all/most of the page table into the TLB, making it "smaller" (more manageable) doesn't buy you much. Esp if the page size is larger than needed (i.e., the "work" is dealing with the need for multiple "too big" pages) Advances in (paged) MMU performance will come from larger TLBs. For applications, TLB size is typically not a driving issue. They "do something" and keep on doing it -- before moving on to something else. They tend not to "hop around" memory. It's OSs that suffer from piss-poor data locality. But, you can't predict how much time an application will spend IN the OS so can't make sweeping generalizations as to when/if you should lock cache to GAIN performance without risking LOSING performance.
On 6/30/2020 9:49 AM, George Neuner wrote:
>>> ... the right way to do that is to maintain a segment stack: >>> e.g., >>> >>> - the whole process space >>> - the process data space >>> - some heap >>> - : >>> - this 100 byte buffer >>> >>> etc., with the protection being the intersection of all permissions >>> present in the stack. Since this naturally coincides with program >>> scopes, the API only needs to manipulate the top of the stack. >> >> But you need this for every disjoint 100 (or 300!) byte buffer >> (or similar object managed as segment). I.e., every accessible >> segment has to be visible/resolvable in that structure in order >> to know what ACLs apply to it, NOW. > > You need a segment descriptor for every region THAT YOU WANT > PROTECTED. So what?
You need a segment STACK for each segment that is exposed. - the whole process space - the process data space - some heap - : - this 100 byte buffer and - the whole process space - the process data space - some heap - : - some OTHER 400 byte buffer and - the whole process space - the process data space - some OTHER heap - : - yet ANOTHER 100 byte buffer and - some OTHER process space - the process data space - some heap - : - yet ANOTHER 100 byte buffer etc. Each time you create a segment, you'd have to "recompile" the segment stack(s) for the different environments in which it is accessible (or, be forced to evaluating each on-the-fly)
> At most any given thread can only be using one or two buffers at a
huh? If by "at a time" you mean "in a trivially tiny interval of time" then I might agree. But, a thread could be accessing lots of buffers with no readily predictable pattern of the order in which they are accessed.
> time, and the entire stack does not have to be examined: you can look > only at the top level if when a descriptor is loaded its permissions > are modified to the intersection of itself and all the underlying > descriptors.
The underlying descriptors aren't static. When process X is executing, it likely doesn't have the same permissions than when process Y was executing. So, on a process swap, you'd have to rebuild (recompile) each stack to reflect the current process's permissions AT EACH LEVEL in the stack.
> The hardware would be a (small) stack with an additional top level > entry to handle the case of 2 buffers in use simultaneously (e.g., a > copy operation). > > [ A ] [ B ] > \ / > [ ] > [ ] > [ ] > > The control API needs to distinguish 2 top level descriptors and allow > at least one of them to push/pop through to the underlying stack.
And, in the very next opcode, DIFFERENT buffers (segments) can be in use. Your "hardware stack" needs to be rebuilt -- or "instantly reloaded" -- for each such segment/set of segments. I don't see how this can be faster than allowing for DISJOINT "segments" of varying sizes.
> Each descriptor needs ... probably an ACL ... to indicate what threads > can use it - but when any particular thread loads the descriptor to > use it, only that thread's id needs to be in the actual MMU hardware.
So, you build this nested structure off in memory someplace (like page tables) and expect the hardware to find the stack(s) that need to be active for any given operation, conditioned by the ID of the process currently executing. As each stack may be of different depths, you would have to manage differing sized chunks of memory in that "structure space" (or, force all to have a maximum depth and manage fixed size "stacks" regardless of actual content)
> It isn't as complex as you might think initially - though, obviously, > the hardware has to be designed carefully.
>>> For most programming a segment stack needs only a handful of entries. >>> Even 8 entries likely is overkill for the most demanding cases. >>> >>> But this is a protection mechanism separate from address space >>> allocation, which can be done by pages of any convenient size. > >>>>>> The advantage that fixed size (even if there is a selection of sizes >>>>>> to choose from) pages offers is each page has a particular location >>>>>> into which it fits. You don't have to worry that some *other* page >>>>>> partially overlaps it or that it will overlap another. >>>>> >>>>> You do if the space is shared. >>>> >>>> So, all processes share a single address space? And, segments resolve >>>> whether or not the "current process" can access the particular segment's >>>> contents? How does that scale? >>> >>> ... I mean if the same location is shared between processes [or even >>> threads in your model] that are using different page sizes. >> >> [threads exist in a shared container so always have the same address space] >> >> But nothing CAN overlap it that isn't intended to be accessible >> in a given process. E.g., if "foo" resides in a 16K page in process A >> and an 8K portion of that same physical memory is mapped into process B, >> then A and B can each access foo -- at potentially different logical >> addresses. The "other 8K of the 16K that is accessible in A need not >> be mapped in B -- some other (8K) page can appear in that relative >> location. > > So no different than what OSes are doing now.
Exactly. So, "no problem" regardless of whether it is a "shared space" or not.
>>> [segments] may reveal there is something the programmer can't look at. So >>> what? People have been frustrated by locked doors for thousands of >>> years. Besides which, your system is open source, so anybody can go >>> in, peek behind the curtain, and potentially remove any restrictions >>> they don't like. >> >> Wrong point. >> >> You don't want the code -- at run time -- to be able to deduce anything that >> isn't explicitly disclosed to it (FOSS just makes this more damning). > > That's a straw man ... code can deduce that it's running in a sandbox > simply by failure to deduce certain things about its environment. Note > that "failure" here includes both getting no answer and getting an > answer that is unreasonable for real hardware.
With variable size "pages", I can FILL your address space with stuff you SHOULD be able to see and provide no clues about how much stuff you CAN'T see -- because the unseeable portions of the address space don't consume any part of your logical address space. EVERY address resolves to data that you can see so how do you know of the existence of unseeable data, except in the theoretical sense? It's like my per-process namespaces... you KNOW there are things out there that you can't "see", but how can you exploit that to the detriment of the system? You can waste YOUR resources trying every name imagineable -- but none of them (other than those mapped into your context) will resolve to anything! Measure how much power the CPU uses while you are making these attempts. Measure how much time the OS takes to return your NAK. There's nothing revealed in a side-channel. [you might be able to deduce if a task switch occurred "while away". Or, if the service is hosted locally vs. remotely. But, you can't gain any information about the objects hidden from you and the names that won't resolve] OTOH, if you stumble upon 5 asterisks in a latent image of the "password" textbox, you can deduce that the most recently entered password had 5 characters, even though you can't *see* any of them.
> Since your aim AIUI is to build a control system, there is a limit to > how far you can go in sandboxing programs.
Current app has no bearing on future apps. Some of my colleagues are using my codebase for very different applications. This has led to many additional mechanisms being included that complicate my application but are necessary for theirs (I've been able to implement all as "bolt-ons" due to my choice of architecture)
>> "Some day" the hardware (CPU, switch) will move in a direction that is >> more accommodating than present day. > > Doubtful. CPU vendors have given up on segmentation, and paging units > are allowing larger (and larger) pages to match ever growing physical > memories.
Modern processors emulate legacy processors. Who's to say the future trend won't expose more of those capabilities to the developer so he can emulate some *other* (abstract) processor characteristics? Motogorilla's RGP implemented Bresenham's algorithm in the instruction set. Give me access to the microcode in a modern processor and I can emulate *that* processor's instruction set (though perhaps different instruction timings). In the 70's, I designed a graphics processor that treated pixel arrays as objects that could be stacked in three dimensions (in front of/behind other objects) as well as signal when any two objects "overlapped" ("collision detection" in a video game -- "bullet hits enemy"). Again, let me rewrite a modern CPUs microcode and I can provide the same feature interface while taking advantage of all the fabrication advancements in the past 50 years! The 99000 maintained its register complement in main memory ("workspaces"). Nothing stops me from emulating that capability, today. Etc. Look at how readily old processors are emulated by userland SOFTWARE (e.g., MAME, Bochs, etc.) Note that the excess capabilities of modern processors need not reflect additional capabilities in the emulated applications! While you COULD run Defender's codebase 100 times faster, you'd end up with an unplayable game! Just because a vendor doesn't offer a particular solution OTS doesn't mean you can't MAKE the solution that you want. Presently, FPGAs are too general-purpose to efficiently implement these more complex designs (whereas microcoded CPUs have already optimized-away some of the flexibility that isn't NEEDED for a CPU). But, there's no reason to think that future FPGAs might not be tailored to more specific market segments. Or, that foundries won't offer services that allow for "programmed designs" to be more economically committed to fixed silicon. This was already in the works decades ago when I did my last "custom". I'm out. I've got too much work to do. I'm hoping to have the next off-site here, when "social distancing" eases and I can host my colleagues (so they can see some of my toys in use!). I don't think any of them are eager to deal with the extent of our "outbreak", nor the insane heat!
> What you seem to want is a real capability machine ... which is > something no vendor will ever produce while C continues to be the > language of system programming. > > Like paging, segmentation is at least something that can be added (at > some cost of latency) to an existing CPU: you just need to design your > SMMU and insert it into the memory access path. > > Historically there were some examples of doing this: e.g., augmenting > a 68000/10/12 with a 68451 SMMU. But since interest in segmentation > (properly done) died with 32-bit CPUs in the "micro" arena, there is > not much to look at. Most strong uses of segmentation are only to be > found in mainframe CPUs from the 60s and 70s. > > > George >
On Tue, 30 Jun 2020 12:23:17 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/30/2020 1:00 AM, upsidedown@downunder.com wrote: >>> I'd prefer the VMM system to be based on "variable sized pages" (akin >>> to segments) as you can emulate "variable sized protection zones" as >>> collections of one or more such "pages". Though I don't claim to need >>> "single byte resolution" on such page sizes. >> >> The idea of having multiple size pages in a single process at least >> solves the page table size problem. > >Unless you can move all/most of the page table into the TLB, >making it "smaller" (more manageable) doesn't buy you much. >Esp if the page size is larger than needed (i.e., the "work" >is dealing with the need for multiple "too big" pages)
I just tried to show you how to implement variable size pages that you preferred. Assuming a x86-64 style system, with available page sizes of 4 K, 2 M and 1 G. Assume you want to map a 3.123 G object. This will require - 780750 page table entries (PGE) with 4 K pages, or - 1562 PGEs with 2 M page size (with 1 M lost) or - 4 PGEs if 1 G pages are used (with 877 M lost) Assuming the page size can be mixed in the same process, mapping the 3.123 G object would require - 3 PGEs with 1 G pages _and_ - 61 PGEs of 2 M pages _and_ - 250 PGEs of 4 K pages, a total of 314 PGEs, which should be easier to fit into TLBs than thousands required by fixed 4 K or 2 M pages. Note: For illustrative purposes it is assumed that K is 1000, M is 1E6 and G is 1E9 in calculations above.
On 6/30/2020 22:23, Don Y wrote:
> On 6/30/2020 1:00 AM, upsidedown@downunder.com wrote: >>> I'd prefer the VMM system to be based on "variable sized pages" (akin >>> to segments) as you can emulate "variable sized protection zones" as >>> collections of one or more such "pages".&nbsp; Though I don't claim to need >>> "single byte resolution" on such page sizes. >> >> The idea of having multiple size pages in a single process at least >> solves the page table size problem. > > Unless you can move all/most of the page table into the TLB, > making it "smaller" (more manageable) doesn't buy you much. > Esp if the page size is larger than needed (i.e., the "work" > is dealing with the need for multiple "too big" pages) > > Advances in (paged) MMU performance will come from larger TLBs. > For applications, TLB size is typically not a driving issue. > They "do something" and keep on doing it -- before moving on > to something else.&nbsp; They tend not to "hop around" memory. > > It's OSs that suffer from piss-poor data locality.&nbsp; But, you > can't predict how much time an application will spend IN the OS > so can't make sweeping generalizations as to when/if you should > lock cache to GAIN performance without risking LOSING performance.
Hi Don, I think you are chasing your own tail with that. Why do you want to have all pages in a TLB at the same time? Going through say 4G of memory at 2nS per access will take well above a second through a 64 bit bus. So even if you waste a whole ms into tablewalks in the process you will still have wasted 0.1% of the time. Having large pages or BAT translation makes things easy enough, nobody seems to be looking for newer solutions last few decades because there is no need of such. Just accept the granularity, if you want the extra predictability and/or stability for your system just make sure you have enough physical memory so you can map all the used logical memory without ever needing to swap and you are done. If 4k granularity is too much - it can be even in large systems - just make your own pools of smaller granularity in logical address space, this is how I have been doing it for decades now in DPS. Dimiter ====================================================== Dimiter Popoff, TGI http://www.tgi-sci.com ====================================================== http://www.flickr.com/photos/didi_tgi/
On Tue, 30 Jun 2020 12:44:17 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/30/2020 9:49 AM, George Neuner wrote: >>>> >>>> ... the right way to do that is to maintain a segment stack: >>>> e.g., >>>> >>>> - the whole process space >>>> - the process data space >>>> - some heap >>>> - : >>>> - this 100 byte buffer >>>> >>>> etc., with the protection being the intersection of all permissions >>>> present in the stack. Since this naturally coincides with program >>>> scopes, the API only needs to manipulate the top of the stack. >>> >>> But you need this for every disjoint 100 (or 300!) byte buffer >>> (or similar object managed as segment). I.e., every accessible >>> segment has to be visible/resolvable in that structure in order >>> to know what ACLs apply to it, NOW. >> >> You need a segment descriptor for every region THAT YOU WANT >> PROTECTED. So what? > >You need a segment STACK for each segment that is exposed. > > - the whole process space > - the process data space > - some heap > - : > - this 100 byte buffer > >and > > - the whole process space > - the process data space > - some heap > - : > - some OTHER 400 byte buffer > >and ...
For application programming in general you need one stack per process/thread. Application "objects" are disjoint only at top level ... underneath there are large common segments which follow the scoping contour of the program. object * object * object * : heap *********** dseg *** **************************** process **************************************************** All the "objects" are subsegments of a heap, which is a subsegment of the process's data segment, which is a subsegment of the process's total address space. Yes, this is simplistic, but also realistic. Many applications will never go farther than this. Yes, a single thread may want to work simultaneously with objects that were allocated from different heaps, and THAT would require dealing with multiple stacks ... but it isn't a typical behavior for many applications. Obviously kernel programming is a different matter, and typical applications in *your* system may not conform to these behavioral generalities. The point is that it can be handled, and not by needing a stack for each object. See below.
>Each time you create a segment, you'd have to "recompile" the >segment stack(s) for the different environments in which it is >accessible (or, be forced to evaluating each on-the-fly)
Most use of segments will be strongly hierarchically scoped. As an *analogy*, recall Pascal's record access using WITH: with the_process_space do with the_data_segment do with my_heap do : with objA, objB do : end; : end; end; end; [note: the multiple WITH above isn't legal Pascal, but something like it that allowed an arbitrary number of identifiers would be a useful construct for working with segments.] Yes, it is a runtime evaluation, but think about it: all that is really happening is comparing a few ids, limits and access rights to see if they are compatible, and then shoving descriptors into SMMU cache. Also remember that, e.g., the "my_heap" segment above, may control most/all of the data accesses of entire threads and be in scope in the process for long periods of time. And an *application* typically would have no way to refer to the descriptors for its process and data space - those would come from the OS and be implied within the application. And yes, there would have to be some variant to deal with really disjoint spaces: different heaps for an application, different processes for the kernel, etc. But that's just syntax over the mechanism.
>> At most any given thread can only be using one or two buffers at a > >huh? If by "at a time" you mean "in a trivially tiny interval of time" >then I might agree. But, a thread could be accessing lots of buffers >with no readily predictable pattern of the order in which they are >accessed.
Yes, but the segment "cache" doesn't need to be restricted to minimums [that actually would not be very smart]. The SMMU doesn't have to maintain stacks of descriptors if you're willing to go to memory when they change (but sometimes that will be necessary anyway - just like in paging). But there will be segments that in force for long periods and those should be cached for fast access. So a more realistic design might look something like: [ ]...[ ] [ ]...[ ] ... [ ]...[ ] [ ] [ ] [ ] [ ] [ ] [ ] : : : [ ] [ ] [ ] in which there are some number of (small) stacks, each with some number of top-level "cache" lines to deal with transitory objects controlled by their underlying stack. [An alternative would be to make the whole top level a unified cache and be able to validate/subselect from any of the underlying stacks. It would make it easier to work with, but (maybe far) more complicated to actually implement.] Ideally there would be enough stacks to handle some reasonable number of processes/threads - working on the assumption that any given thread will stay within the same (lower-level) segment scope most of the time. When a thread is idled, its stack can be used for some other [similar to how PMMUs deal with process context switches].
>> time, and the entire stack does not have to be examined: you can look >> only at the top level if when a descriptor is loaded its permissions >> are modified to the intersection of itself and all the underlying >> descriptors. > >The underlying descriptors aren't static. When process X is executing, >it likely doesn't have the same permissions than when process Y was >executing. So, on a process swap, you'd have to rebuild (recompile) >each stack to reflect the current process's permissions AT EACH LEVEL >in the stack.
Or keep multiple stacks. PMMUs deal with multiple processes by including the process id with the page descriptors. Context switches cause descriptors to be spilled/filled if they aren't where they need to be.
>> The hardware would be a (small) stack with an additional top level >> entry to handle the case of 2 buffers in use simultaneously (e.g., a >> copy operation). >> >> [ A ] [ B ] >> \ / >> [ ] >> [ ] >> [ ] >> >> The control API needs to distinguish 2 top level descriptors and allow >> at least one of them to push/pop through to the underlying stack. > >And, in the very next opcode, DIFFERENT buffers (segments) can be in use. >Your "hardware stack" needs to be rebuilt -- or "instantly reloaded" -- for >each such segment/set of segments. I don't see how this can be faster >than allowing for DISJOINT "segments" of varying sizes.
You (Don) have to realize that I'm describing the mechanism at a high level and in a simple way. This is NOT a schematic. I haven't thought about this stuff for ages. There always are details that have to be explored - but the general principle is fairly easy to grasp
>> Each descriptor needs ... probably an ACL ... to indicate what threads >> can use it - but when any particular thread loads the descriptor to >> use it, only that thread's id needs to be in the actual MMU hardware. > >So, you build this nested structure off in memory someplace (like >page tables) and expect the hardware to find the stack(s) that need >to be active for any given operation, conditioned by the ID of the >process currently executing. As each stack may be of different >depths, you would have to manage differing sized chunks of memory >in that "structure space" (or, force all to have a maximum depth and >manage fixed size "stacks" regardless of actual content)
Stacks in the SMMU can be spilled/filled linearly - it's only the top level "cache" of transitory descriptors that needs some special handling - probably a hash table to find them quickly when needed. For the OS part, the segment analog of page tables is an "interval" tree. https://en.wikipedia.org/wiki/Interval_tree
>> It isn't as complex as you might think initially - though, obviously, >> the hardware has to be designed carefully.
>>> "Some day" the hardware (CPU, switch) will move in a direction that is >>> more accommodating than present day. >> >> Doubtful. CPU vendors have given up on segmentation, and paging units >> are allowing larger (and larger) pages to match ever growing physical >> memories. > >Modern processors emulate legacy processors. Who's to say the future >trend won't expose more of those capabilities to the developer so he >can emulate some *other* (abstract) processor characteristics?
Progressively, on *average*, each new generation of programmers is less skilled than those that came before. Most current programmers expect the hardware and languages they use to protect them from mistakes. Giving them more sharp tools to cut themselves with will not be a win.
>Look at how readily old processors are emulated by userland SOFTWARE >(e.g., MAME, Bochs, etc.) Note that the excess capabilities of modern >processors need not reflect additional capabilities in the emulated >applications! While you COULD run Defender's codebase 100 times faster, >you'd end up with an unplayable game!
And look at how slow those emulations are [relative to native code]. I believe you were the one who told me that you couldn't rely on virtual machines because, at some point in the future, they might no longer support the instruction sets and devices you needed. George

Memfault Beyond the Launch