EmbeddedRelated.com
Forums
Memfault Beyond the Launch

Page sizes

Started by Don Y June 27, 2020
On Sunday, June 28, 2020 at 2:24:35 AM UTC-4, upsid...@downunder.com wrote:
> On Sat, 27 Jun 2020 18:35:57 -0400, George Neuner > <gneuner2@comcast.net> wrote: > > >On Sat, 27 Jun 2020 15:10:22 -0700, Don Y > ><blockedofcourse@foo.invalid> wrote: > > > >>Hi George, > >> > >>Hope you are keeping well... bad pandemic response, here; really > >>high temperatures; increasing humidity; and lots of smoke in the > >>air (but really cool "displays" at night!) :< Time to make some > >>ice cream and enjoy the ride! :> > >> > >>On 6/27/2020 2:37 PM, George Neuner wrote: > >>> Hi Don, > >>> On Fri, 26 Jun 2020 21:45:54 -0700, Don Y > >>> <blockedofcourse@foo.invalid> wrote: > >>> > >>>> Are there any processors/PMMUs for which the following would be > >>>> true (nonzero)? > >>>> > >>>> (pagesize - 1) & pagesize > >>> > >>> Not anything you can buy. > >> > >>I'm wondering if some of the "classic" designs might scale to newer > >>device geometries better than some of the newer architectures? > >> > >>E.g., supporting ~100 (variable sized) segments concurrently and > >>binding each to a particular "object" (for want of a better word). > >>If the segment management hardware automatically reloads (in a manner > >>similar to the TLBs functionality), then this should yield better > >>(or comparable) performance to the fixed page-size approach (if > >>you assume the fixed pages poorly "fit" the types of "objects" > >>that you are mapping) > >> > >>[I think we discussed this -- or something similar -- a while ago] > > > >About ~10 years ago 8-) > > > >But you asked about "pages" here, which invariably are fixed sized > >entities. Arbitrarily sized "segments" are a different subject. > > > >If you want a *useful* segmenting MMU, you probably need to design it > >yourself. Historically there were some units that did it (what I > >would call) right, but none are scalable to modern memory sizes. > > > >Whatever you do, you want the program to work with flat addresses and > >have segmentation applied transparently (like paging) during memory > >access. You certainly DO NOT want to follow the x86 example of > >exposing segments in the addressing. > > > The problem with segmented access in x86 is the far too small number > of segment registers. In addition on 8086 the problem was the small > maximum segment size (64 KiB). A small segment size is not a problem > for code, since subroutines are generally much smaller than that, but > data access to a large arrays is a pain. > > Segments are nice if you are going to use shared loadable libraries > ("DLLs"). Just load it and use original link time addresses, no need > for fix-ups at load time. > > In a single 386 style single code space, loading a shared library > needs fix-ups at load time (it is not always possible to make > everything position independent). Also if two libraries are linked for > the same virtual address, at least the other library needs to be > rebased at a different virtual address to avoid the conflict. > > Making fix-ups into the code, means that the fixed page becomes dirty > and can't be shared by multiple processed in the system, by ether > making a copy of the whole library and making fix-ups to the private > copy or at least store the dirty pages in the process specific page > file. > > In a good segmented system (with sufficient segment registers) can > directly share the same library in multiple processes. Since all code > pages are read-only, no need to store it to a page file if running out > of memory,
I believe the x86 segments can be 1 MB each. At the time the x86 architecture was developed, 1 MB of memory was HUGE! I want to say 16 Mb DRAM chips were either just invented, or not yet invented. Either way it would take a lot of chips and money to fully populate an IBM PC with the full 640 kB of RAM. So 1 MB segments appeared very large and reasonable at the time. I recall some years after the x86 and the IBM PC burst onto the scene when DRAM was slightly less expensive and you could get as much as 16 MB in a PC, the people writing FPGA development software were telling me 16 MB was nothing and the software would never run well until machine got a lot more memory. My laptop now has 32 GB of RAM. FPGA design goes quite well. -- Rick C. - Get 1,000 miles of free Supercharging - Tesla referral code - https://ts.la/richard11209
On Sat, 27 Jun 2020 23:50:50 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/27/2020 10:01 PM, George Neuner wrote: >>>> If you want a *useful* segmenting MMU, you probably need to design it >>>> yourself. Historically there were some units that did it (what I >>>> would call) right, but none are scalable to modern memory sizes. >>>> >>>> Whatever you do, you want the program to work with flat addresses and >>>> have segmentation applied transparently (like paging) during memory >>>> access. You certainly DO NOT want to follow the x86 example of >>>> exposing segments in the addressing. >>> >>> Agreed. Segments were meant to address a different problem. >>> >>> OTOH, exposing them to the instruction set removes any potential >>> ambiguity if two (or more) "general purpose" segments could >>> overlap at a particular spot in the address space; the opcode >>> acts as a disambiguator. >> >> ??? Not following. > >In a large, flat address space, it is conceivable that "general purpose" >segments could overlap. So, in such an environment, an address presented >to the memory subsystem would have to resolve to SOME particular physical >address, "behind" the segment hardware. The hardware would have to resolve >any possible ambiguities. (how do you design the HARDWARE to prevent >ambiguities from arising without increasing its complexity even more??). > >If, instead, the segments are exposed to the programmer, then the >choice of opcode determines which segment (hardware) is consulted >to resolve the reference(s). Any "overlap" becomes unimportant. > >>> The PMMU approach sidesteps this issue by rigidly defining where >>> (in the physical and virtual address spaces) a new page CAN begin. >>> It's bank-switching-on-steroids... >>> >>> [IIRC, I had previously concluded that variable sizes were impractical >>> for reasons like this] >> >> The problem is that you're thinking only about the protection aspect >> ... it's the subdivision management of the address space that is made >> slow and difficult if you allow mapping arbitrarily sized regions. > >Perhaps you missed: > > 'You still have a "packing problem" but with a virtual address space > per process, you'd only have to address the "objects" with which a > particular process interacted in any particular address space. > And, that binding (for PIC) could be done at compile time *or* > load time (the latter being more flexible) -- or even RUN-time!' > >You have N "modules" in a typical application. The linkage editor mashes >them together into a single binary to be loaded, ensuring that they don't >overlap each other (d'uh!). Easy-peasy. > >You have the comparable problem with each segment representing a >discrete "object" being made to coexist disjointedly in a single >address space. > >If the "objects" never change, over time, then this is no harder to >address than the linkage editor problem (assuming any segment can >being at any location and have any size). Especially for PIC. > >But, if segments can be added/removed/resized dynamically, then >you're essentially dealing with the same sort of fragmentation >problem that arises in heap management AND the same sort of >algorithm choices for selecting WHERE to create the next requested >segment (unless you pass that off to the application to handle >as IT knows what its current and future needs will be).
The ancient method for increasing the process address space is to swap out the whole process into the swap file, increase the size descriptor and then let the program loader find a new memory area in the physical memory. Of course this is a heavy operation and should not be done for every malloc() call :-). By monitoring the physical memory usage was easy to determine, when a process extended its virtual memory allocation by observing momentary swap outs.
On Sun, 28 Jun 2020 00:11:06 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/27/2020 11:24 PM, upsidedown@downunder.com wrote: >> The problem with segmented access in x86 is the far too small number >> of segment registers. In addition on 8086 the problem was the small >> maximum segment size (64 KiB). A small segment size is not a problem >> for code, since subroutines are generally much smaller than that, but >> data access to a large arrays is a pain. > >The problem with segments is they are a hack to work-around a previous >constraint that was arbitrarily imposed on CPU architectures. When will >we find a 32b space insufficient to represent run-time objects? (it's >already insufficient for filesystems)
The 8086 gave segmentation a bad reputation. More modern implementations talk about "objects". Intel tried to make the iAPX432, but the technology of the day was inadequate. IIRC, IBM AS/400 supermini also used some kind of object features.
>> Segments are nice if you are going to use shared loadable libraries >> ("DLLs"). Just load it and use original link time addresses, no need >> for fix-ups at load time. > >Note that you can get, effectively, the same capability by putting the >object (.so) in a separate (virtual) address space. But, then incur >the costs of IPC for all references.
If you need to transfer large amounts of data, you are going to need shared pages. Make sure that the shared memory doesn't contain any pointers or you have to map the shared areas into the sane virtual addresses in all processes or even worse rebase all pointers in other processes.
> >[Alpha did this for it's notion of passive objects requiring ins and >outs to be located in special accompanying pages passed to the object] > >> In a single 386 style single code space, loading a shared library >> needs fix-ups at load time (it is not always possible to make >> everything position independent). Also if two libraries are linked for >> the same virtual address, at least the other library needs to be >> rebased at a different virtual address to avoid the conflict. >> >> Making fix-ups into the code, means that the fixed page becomes dirty >> and can't be shared by multiple processed in the system, by ether >> making a copy of the whole library and making fix-ups to the private >> copy or at least store the dirty pages in the process specific page >> file. >> >> In a good segmented system (with sufficient segment registers) can >> directly share the same library in multiple processes. Since all code >> pages are read-only, no need to store it to a page file if running out >> of memory, > >But any management scheme requires a fast cache for the parameters >pertinent to the objects being managed by the hardware in THIS process >instance. When does storing a tuple (logical start, physical start, size) >outweigh the savings of using *1* segment (per object) over *many* pages >(per object)? I.e., if an object is always small enough to fit in a page, >then a single TLB entry is sufficient to manage it with the page size >"hard-wired". > >You can do the same in a paged system by mapping a single copy of the >object into each consumer's address space, as needed. Fixups and >"local data" can be deliberately situated in a separate page(s) that >accompanies the object -- but is uniquely instantiated for each >consumer (instead of being shared).
On a WinNT system, use PEdump.exe to look at the fix-ups in .EXE and .DLL files
>The "code" page(s) can be discarded when physical memory is scarce >IF they can be reloaded from their original media (disk, flash, etc.)
On Sun, 28 Jun 2020 01:18:00 -0700 (PDT), Rick C
<gnuarm.deletethisbit@gmail.com> wrote:

>On Sunday, June 28, 2020 at 2:24:35 AM UTC-4, upsid...@downunder.com wrote: >> On Sat, 27 Jun 2020 18:35:57 -0400, George Neuner >> <gneuner2@comcast.net> wrote: >> >> >On Sat, 27 Jun 2020 15:10:22 -0700, Don Y >> ><blockedofcourse@foo.invalid> wrote: >> > >> >>Hi George, >> >> >> >>Hope you are keeping well... bad pandemic response, here; really >> >>high temperatures; increasing humidity; and lots of smoke in the >> >>air (but really cool "displays" at night!) :< Time to make some >> >>ice cream and enjoy the ride! :> >> >> >> >>On 6/27/2020 2:37 PM, George Neuner wrote: >> >>> Hi Don, >> >>> On Fri, 26 Jun 2020 21:45:54 -0700, Don Y >> >>> <blockedofcourse@foo.invalid> wrote: >> >>> >> >>>> Are there any processors/PMMUs for which the following would be >> >>>> true (nonzero)? >> >>>> >> >>>> (pagesize - 1) & pagesize >> >>> >> >>> Not anything you can buy. >> >> >> >>I'm wondering if some of the "classic" designs might scale to newer >> >>device geometries better than some of the newer architectures? >> >> >> >>E.g., supporting ~100 (variable sized) segments concurrently and >> >>binding each to a particular "object" (for want of a better word). >> >>If the segment management hardware automatically reloads (in a manner >> >>similar to the TLBs functionality), then this should yield better >> >>(or comparable) performance to the fixed page-size approach (if >> >>you assume the fixed pages poorly "fit" the types of "objects" >> >>that you are mapping) >> >> >> >>[I think we discussed this -- or something similar -- a while ago] >> > >> >About ~10 years ago 8-) >> > >> >But you asked about "pages" here, which invariably are fixed sized >> >entities. Arbitrarily sized "segments" are a different subject. >> > >> >If you want a *useful* segmenting MMU, you probably need to design it >> >yourself. Historically there were some units that did it (what I >> >would call) right, but none are scalable to modern memory sizes. >> > >> >Whatever you do, you want the program to work with flat addresses and >> >have segmentation applied transparently (like paging) during memory >> >access. You certainly DO NOT want to follow the x86 example of >> >exposing segments in the addressing. >> >> >> The problem with segmented access in x86 is the far too small number >> of segment registers. In addition on 8086 the problem was the small >> maximum segment size (64 KiB). A small segment size is not a problem >> for code, since subroutines are generally much smaller than that, but >> data access to a large arrays is a pain. >> >> Segments are nice if you are going to use shared loadable libraries >> ("DLLs"). Just load it and use original link time addresses, no need >> for fix-ups at load time. >> >> In a single 386 style single code space, loading a shared library >> needs fix-ups at load time (it is not always possible to make >> everything position independent). Also if two libraries are linked for >> the same virtual address, at least the other library needs to be >> rebased at a different virtual address to avoid the conflict. >> >> Making fix-ups into the code, means that the fixed page becomes dirty >> and can't be shared by multiple processed in the system, by ether >> making a copy of the whole library and making fix-ups to the private >> copy or at least store the dirty pages in the process specific page >> file. >> >> In a good segmented system (with sufficient segment registers) can >> directly share the same library in multiple processes. Since all code >> pages are read-only, no need to store it to a page file if running out >> of memory, > >I believe the x86 segments can be 1 MB each.
On the 16 bit 8086 mode, the maximum segment size was 64 KiB, the maximum total physical memory size was 1 MiB. On 32 bit x86 modes, the segment size can be 4 GiB but the physical memory was initially limited well below 4 GiB. Recent address extensions allow more that 4 GiB of physical memory.
>At the time the x86 architecture was developed, 1 MB of memory was HUGE!
Depends who you ask. I just checked the "Intel memory systems 1977 catalog", which contained memory cards mainly for PDPs. The largest module could contain up to 1 MiB, using 16 Kib memory chips. We had initially one such box, which made it possible to have the full 4 MiB physical memory in 4 box in a single cabinet. You could not populate the full 4 MiB physical memory with core memory, since it would have required multiple cabinets of core and the memory bus length limit would have been violated. The 8086 was released in 1978. Since the same company made both processors and memory, they must have known that they had 64 Kib and larger memory chips in the development pipeline.
>I want to say 16 Mb DRAM chips were either just invented, or not yet invented. Either way it would take a lot of chips and money to fully populate an IBM PC with the full 640 kB of RAM.
IBM PC was released as late as 1981.
>So 1 MB segments appeared very large and reasonable at the time.
64 KiB segment size was small already. ,
On 6/28/2020 3:27 AM, upsidedown@downunder.com wrote:
>> But, if segments can be added/removed/resized dynamically, then >> you're essentially dealing with the same sort of fragmentation >> problem that arises in heap management AND the same sort of >> algorithm choices for selecting WHERE to create the next requested >> segment (unless you pass that off to the application to handle >> as IT knows what its current and future needs will be). > > The ancient method for increasing the process address space is to swap > out the whole process into the swap file, increase the size descriptor > and then let the program loader find a new memory area in the physical > memory. Of course this is a heavy operation and should not be done for > every malloc() call :-).
In a desktop environment, you typically only have to worry about how patient the user will be. And, how tolerant he will be if the system crashes because it got into a "sorry, I can't fix this" condition. In an embedded environment (c.a.E), the application has to continue to operate without that possibility of user intervention. And, often has to satisfy timeliness guarantees (though this isn't c.R). All said, you want algorithms and approaches that have a more predictable degree of success regardless of the "current situation", at the time.
On 6/28/2020 5:59 AM, upsidedown@downunder.com wrote:
> On Sun, 28 Jun 2020 00:11:06 -0700, Don Y > <blockedofcourse@foo.invalid> wrote: > >> On 6/27/2020 11:24 PM, upsidedown@downunder.com wrote: >>> The problem with segmented access in x86 is the far too small number >>> of segment registers. In addition on 8086 the problem was the small >>> maximum segment size (64 KiB). A small segment size is not a problem >>> for code, since subroutines are generally much smaller than that, but >>> data access to a large arrays is a pain. >> >> The problem with segments is they are a hack to work-around a previous >> constraint that was arbitrarily imposed on CPU architectures. When will >> we find a 32b space insufficient to represent run-time objects? (it's >> already insufficient for filesystems) > > The 8086 gave segmentation a bad reputation.
Because of the belief that there was so much existing code that relied on the 16b architecture. The past is not a good predictor of the future's needs -- and, saddling the future with assumptions from the past just ADDS to development costs and hinders advancement. [Why are we building embedded systems designed around concepts that were developed for mainframes?? Does your microwave oven NEED filesystem support? Yeah, you can opt to store "settings" in various "files" (./Defrost, ./ReheatBeverage, ./Cook, etc.) but is that really how you'd approach the problem in the absence of BLOATED filesystem support?]
> More modern implementations talk about "objects". > > Intel tried to make the iAPX432, but the technology of the day was > inadequate. > > IIRC, IBM AS/400 supermini also used some kind of object features.
But, here, we're not just talking about it as an addressing/protection mode but, also, as a mechanism for VMM. Unless you back the segment hardware with some ADDITIONAL mechanism to perform that demand-based activity, you're implicitly stating that every "object" must be completely mapped (or not at all)... you have no notion of a smaller mapping unit beneath the segment interface.
>>> Segments are nice if you are going to use shared loadable libraries >>> ("DLLs"). Just load it and use original link time addresses, no need >>> for fix-ups at load time. >> >> Note that you can get, effectively, the same capability by putting the >> object (.so) in a separate (virtual) address space. But, then incur >> the costs of IPC for all references. > > If you need to transfer large amounts of data, you are going to need > shared pages. Make sure that the shared memory doesn't contain any > pointers or you have to map the shared areas into the sane virtual > addresses in all processes or even worse rebase all pointers in other > processes.
You pass data out-of-band to maintain zero-copy semantics. I.e., tell the recipient (effectively) how to map it into its own address space. [I use this to pass large objects (e.g., audio and video frames/streams) between processes -- both locally and remote -- without having to waste effort on copyin()/copyout()] Of course, this can be conditioned with access attributes (e.g., read-only if you don't want the other party to tinker with the contents). And, is a natural mechanism to expedite fork() as you can just map the TEXT of the current process into the newly created one. Lastly, the "demand" capability lets you flag portions of the address space to notify you (or, a pager that you control) as to when the process is tinkering in areas that don't yet exist (e.g., time to expand the stack) or SHOULDN'T exist (dereferencing an errant pointer) [This could also be possible in a segment based VMM system -- if each segment signals when it has resolved a presented address and the absence of such a signal (like a missing DTACK) could trigger an "undefined memory" trap. It could also alert to the presence of overlapping segments (two such signals coinciding) if the hardware didn't explicitly prevent this condition from arising.]
On 6/28/2020 6:29 AM, upsidedown@downunder.com wrote:
> On the 16 bit 8086 mode, the maximum segment size was 64 KiB, the > maximum total physical memory size was 1 MiB.
And, tools available at the time weren't capable of seemlessly exceeding this limitation. I had to "decode" a 16R8 PAL (PLA) that implemented a little state machine (8b state register, 8 inputs -> next state). The easy way to do this was to build a set of 2D arrays: state_t nextstate[present_state][inputs] where each entry indicated the state entered from the specified "present state" with the application of a particular set of "inputs". [You also have to remember "I don't yet know"] This allows you to see which set of inputs have yet to be tried in a given present state. You can then apply them (to the PAL) and observe the result. Having arrived at a particular "next state", iyou may discover that it is completely "resolved". So, you need to transit to some other state that is incompletely resolved whereupon you can try another unresolved "inputs". To do this efficiently, you keep track of the distance from any given state to any other state: distance_t hops[start_state][destination_state] of course, also accounting for "I don't yet know". [N.B. distance_t ~= state_t as it can't be farther to a destinaation than the number of possible destinations supported!] And, iteratively examine this to bring you progressively closer to your desired "destination" -- by applying the inputs dictated in the first array. Doing this under DOS was a PITA. OTOH, run the code under UNIX (NS32k) and you just declare everything "naturally" and let the compiler and VMM handle the fact that you may not have enough virtual memory to support the objects. Easy peasy. [It's actually a TINY piece of code! "single sheet of paper"]
On Sun, 28 Jun 2020 12:04:49 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/28/2020 5:59 AM, upsidedown@downunder.com wrote: >> On Sun, 28 Jun 2020 00:11:06 -0700, Don Y >> <blockedofcourse@foo.invalid> wrote: >> >>> On 6/27/2020 11:24 PM, upsidedown@downunder.com wrote: >>>> The problem with segmented access in x86 is the far too small number >>>> of segment registers. In addition on 8086 the problem was the small >>>> maximum segment size (64 KiB). A small segment size is not a problem >>>> for code, since subroutines are generally much smaller than that, but >>>> data access to a large arrays is a pain. >>>
>> More modern implementations talk about "objects". >> >> Intel tried to make the iAPX432, but the technology of the day was >> inadequate. >> >> IIRC, IBM AS/400 supermini also used some kind of object features. > >But, here, we're not just talking about it as an addressing/protection mode >but, also, as a mechanism for VMM. Unless you back the segment hardware >with some ADDITIONAL mechanism to perform that demand-based activity, >you're implicitly stating that every "object" must be completely mapped >(or not at all)... you have no notion of a smaller mapping unit beneath >the segment interface.
, Actually swapping out a whole object can be a good idea, Assuming each "DLL" is stored in a separate object , each with own code and data segments. In case of a congestion the OS knows which DLLs are currently active. If an object is not currently active, it is a good candidate for replacement. The code segment can be discarded directly and the dirty data segment should be swapped out, Later on. if there is a new reference to a specific DLL, load the code and data segments at once as a unit. The problem with many paged system is selecting which pages should be replaced, unless there is a good LRU (Least Recently Used) hardware support. Lacking sufficient hardware support, pages to be replaced are selected at random. This is a problem especially with OSes that support(ed) multiple hardware platforms. Pure pages can be discarded, but dirty pages need to be written to the page file. For instance in WinNT dirty pages in the working set are selected at random and moved to a queue of pages to be written into the page file. If there is a new reference to the page in the queue, it is moved back to the working set and removed from the queue. f there are no recent references, the page becomes written to the page file and removed from the queue. Not very optimal.
On Sat, 27 Jun 2020 23:50:50 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/27/2020 10:01 PM, George Neuner wrote: >>>> If you want a *useful* segmenting MMU, you probably need to design it >>>> yourself. Historically there were some units that did it (what I >>>> would call) right, but none are scalable to modern memory sizes. >>>> >>>> Whatever you do, you want the program to work with flat addresses and >>>> have segmentation applied transparently (like paging) during memory >>>> access. You certainly DO NOT want to follow the x86 example of >>>> exposing segments in the addressing. >>> >>> Agreed. Segments were meant to address a different problem. >>> >>> OTOH, exposing them to the instruction set removes any potential >>> ambiguity if two (or more) "general purpose" segments could >>> overlap at a particular spot in the address space; the opcode >>> acts as a disambiguator. >> >> ??? Not following. > >In a large, flat address space, it is conceivable that "general purpose" >segments could overlap. So, in such an environment, an address presented >to the memory subsystem would have to resolve to SOME particular physical >address, "behind" the segment hardware. The hardware would have to resolve >any possible ambiguities. (how do you design the HARDWARE to prevent >ambiguities from arising without increasing its complexity even more??).
Unless you refer to x86, I still don't understand what "ambiguity" you are speaking of. x86 addresses using segments were ambigious because x86 segmentation was implemented poorly with the segment being part of the address. A segment should ONLY be a protection zone, never an address remapping. Segments should be defined only on the flat address space, and the address should be completely resolved before checking it against segment boundaries.
>If, instead, the segments are exposed to the programmer, then the >choice of opcode determines which segment (hardware) is consulted >to resolve the reference(s). Any "overlap" becomes unimportant.
Overlap is always important, and the x86 method took that away. Multiple segments can refer to the same address, and segments can be safely stacked with the top level permission being the intersection of all the underlying segments. At any given time, the program needs control only over what is the "topmost" segment.
>>> The PMMU approach sidesteps this issue by rigidly defining where >>> (in the physical and virtual address spaces) a new page CAN begin. >>> It's bank-switching-on-steroids... >>> >>> [IIRC, I had previously concluded that variable sizes were impractical >>> for reasons like this] >> >> The problem is that you're thinking only about the protection aspect >> ... it's the subdivision management of the address space that is made >> slow and difficult if you allow mapping arbitrarily sized regions. > >Perhaps you missed: > > 'You still have a "packing problem" but with a virtual address space > per process, you'd only have to address the "objects" with which a > particular process interacted in any particular address space. > And, that binding (for PIC) could be done at compile time *or* > load time (the latter being more flexible) -- or even RUN-time!' > >You have N "modules" in a typical application. The linkage editor mashes >them together into a single binary to be loaded, ensuring that they don't >overlap each other (d'uh!). Easy-peasy. > >You have the comparable problem with each segment representing a >discrete "object" being made to coexist disjointedly in a single >address space. > >If the "objects" never change, over time, then this is no harder to >address than the linkage editor problem (assuming any segment can >being at any location and have any size). Especially for PIC. > >But, if segments can be added/removed/resized dynamically, then >you're essentially dealing with the same sort of fragmentation >problem that arises in heap management AND the same sort of >algorithm choices for selecting WHERE to create the next requested >segment (unless you pass that off to the application to handle >as IT knows what its current and future needs will be).
No I saw your reference to "packing problem". My point is that it isn't a problem unless you try to use segments as the basis for allocation, or swapping. So long as segments are only used as protection zones within the address space, they can overlap in any way.
>> You have to separate the concerns to do either one efficiently. >> >> That's why pure segment-only MMUs quickly were superceded by >> combination page+segment units with segmenting relegated to protection >> while paging handled address space. And now many CPUs don't even >> bother with segments any more. > >The advantage that fixed size (even if there is a selection of sizes >to choose from) pages offers is each page has a particular location >into which it fits. You don't have to worry that some *other* page >partially overlaps it or that it will overlap another.
You do if the space is shared.
>But, with support for different (fixed) page sizes -- and attendant >performance consequences thereof -- the application needs to hint >the OS on how it plans/needs to use memory in order to make optimum >use of memory system bandwidth. Silly for the OS to naively choose >a page size for a process based on some crude metric like "size of >object". That can result in excessive resources being bound that >aren't *typically* USED by that object -- fault in those portions AS >they are needed (why do I need a -- potentially large -- portion of >the object residing in mapped memory if it is only accessed very >infrequently?) > >OTOH, a finer-grained choice (allowing smaller pieces of the object >to be mapped at a time) reduces TLB reach as well as consuming OTHER >resources (e.g., TLB misses) for an object with poor locality of >reference (here-a-hit, there-a-hit, everywhere-a-hit-hit...)
Exactly. The latency of TLB misses are the very reason for the existence of "large" pages in modern operating systems.
>So, there needs to be a conversation between the OS and the application >regarding how, best, to map the application onto the hardware -- with >"suitable" defaults in place for applications that aren't aware of >the significance of these issues. This is particularly true if the >application binary can be hosted on different hardware -- or >MIGRATED to different hardware while executing!
>Obviously makes sense to design that API in a way that is only as >general as it needs to be; WHY SUPPORT POSSIBILITIES THAT DON'T EXIST? >(Or, that aren't *likely* to exist in COTS hardware?) IOW, you >can KNOW that: > ASSERT( !( (pagesize - 1) & pagesize ) ) >for all supported "pagesize", and code accordingly! > >Paraphrasing: "Make something as simple as it can be -- and no simpler"
George
On 6/29/2020 9:05 PM, George Neuner wrote:
> On Sat, 27 Jun 2020 23:50:50 -0700, Don Y > <blockedofcourse@foo.invalid> wrote: > >> On 6/27/2020 10:01 PM, George Neuner wrote: >>>>> If you want a *useful* segmenting MMU, you probably need to design it >>>>> yourself. Historically there were some units that did it (what I >>>>> would call) right, but none are scalable to modern memory sizes. >>>>> >>>>> Whatever you do, you want the program to work with flat addresses and >>>>> have segmentation applied transparently (like paging) during memory >>>>> access. You certainly DO NOT want to follow the x86 example of >>>>> exposing segments in the addressing. >>>> >>>> Agreed. Segments were meant to address a different problem. >>>> >>>> OTOH, exposing them to the instruction set removes any potential >>>> ambiguity if two (or more) "general purpose" segments could >>>> overlap at a particular spot in the address space; the opcode >>>> acts as a disambiguator. >>> >>> ??? Not following. >> >> In a large, flat address space, it is conceivable that "general purpose" >> segments could overlap. So, in such an environment, an address presented >> to the memory subsystem would have to resolve to SOME particular physical >> address, "behind" the segment hardware. The hardware would have to resolve >> any possible ambiguities. (how do you design the HARDWARE to prevent >> ambiguities from arising without increasing its complexity even more??). > > Unless you refer to x86, I still don't understand what "ambiguity" you > are speaking of.
Yes.
> x86 addresses using segments were ambigious because x86 segmentation > was implemented poorly with the segment being part of the address. > > A segment should ONLY be a protection zone, never an address > remapping. Segments should be defined only on the flat address space, > and the address should be completely resolved before checking it > against segment boundaries.
I'd prefer the VMM system to be based on "variable sized pages" (akin to segments) as you can emulate "variable sized protection zones" as collections of one or more such "pages". Though I don't claim to need "single byte resolution" on such page sizes. But, the present trend towards larger page sizes renders them less useful for many things. E.g., the 512B VAX page would be an oddity in today's world -- I think ARM has some products that offer "tiny" 1KB pages but even those are off-the-beaten track. When you're stuck with 4K (and even larger!) pages, it effects how useful they are for moving/sharing objects between virtual address spaces. E.g., you can't take a network packet and just map it into another address space (without dragging along lots of "empty page space" -- wasted physical memory!). And, if you allocate pages from a pool as buffers, then you need to scrub the excess capacity in any given page lest you disclose previous contents, etc. If, instead, you could call on the VMM system to create an X-byte "page" (segment?) and then map that to some arbitrary place in *an* address space, you can create "right-sized" pages for your particular application -- instead of trying to come up with a way to minimize the "waste" inherent in using a particular page size over which you have no control. [Yes, it makes the hardware more complicated. But, hardware is always evolving -- getting faster and smaller and more capable. In the 70's to do a 16x16 multiply in anything faster than a few hundred cycles, I'd buy a TRW multiplier. Now, it (and more!) is part of many CPUs. How much logic does a second (fourth, sixth...) core displace??]
>> If the "objects" never change, over time, then this is no harder to >> address than the linkage editor problem (assuming any segment can >> being at any location and have any size). Especially for PIC. >> >> But, if segments can be added/removed/resized dynamically, then >> you're essentially dealing with the same sort of fragmentation >> problem that arises in heap management AND the same sort of >> algorithm choices for selecting WHERE to create the next requested >> segment (unless you pass that off to the application to handle >> as IT knows what its current and future needs will be). > > No I saw your reference to "packing problem". My point is that it > isn't a problem unless you try to use segments as the basis for > allocation, or swapping.
Exactly.
> So long as segments are only used as > protection zones within the address space, they can overlap in any > way.
Then you need a means of resolving which segment has priority at a particular logical address. Do you expect that to be free?
>>> You have to separate the concerns to do either one efficiently. >>> >>> That's why pure segment-only MMUs quickly were superceded by >>> combination page+segment units with segmenting relegated to protection >>> while paging handled address space. And now many CPUs don't even >>> bother with segments any more. >> >> The advantage that fixed size (even if there is a selection of sizes >> to choose from) pages offers is each page has a particular location >> into which it fits. You don't have to worry that some *other* page >> partially overlaps it or that it will overlap another. > > You do if the space is shared.
So, all processes share a single address space? And, segments resolve whether or not the "current process" can access the particular segment's contents? How does that scale? I think the model of separate address spaces (for processes) is a lot easier for folks to grok. The fact that a particular "memory unit" can coexist in more than one (with permissions appropriate to each specific process) is a lot easier to manage. If I want a particular process to have access to the payload of a network packet, do I have to create a "subsegment" that encompasses JUST the payload so that I can leave the framing intact -- yet inaccessible? If that process tries that portion of the packet that is "locked", it gets blocked -- and doesn't that leak information (i.e., the fact that there IS framing information surrounding the packet implies that it wasn't sourced by a *local* process) [By contrast, if I could create "memory units" that were of any particular size, I'd create one that was sized to shrink-wrap to the payload with another that encompassed the entire packet. The first would be mapped into the aforementioned process's address space IMMEDIATELY ADJACENT to any other packets that were part of the message (as an example). The larger unit would be mapped into the address space of the process that was concerned with the framing information.]
>> But, with support for different (fixed) page sizes -- and attendant >> performance consequences thereof -- the application needs to hint >> the OS on how it plans/needs to use memory in order to make optimum >> use of memory system bandwidth. Silly for the OS to naively choose >> a page size for a process based on some crude metric like "size of >> object". That can result in excessive resources being bound that >> aren't *typically* USED by that object -- fault in those portions AS >> they are needed (why do I need a -- potentially large -- portion of >> the object residing in mapped memory if it is only accessed very >> infrequently?) >> >> OTOH, a finer-grained choice (allowing smaller pieces of the object >> to be mapped at a time) reduces TLB reach as well as consuming OTHER >> resources (e.g., TLB misses) for an object with poor locality of >> reference (here-a-hit, there-a-hit, everywhere-a-hit-hit...) > > Exactly. The latency of TLB misses are the very reason for the > existence of "large" pages in modern operating systems.
But they assume there will be "something large" occupying that physical resource -- or not. I.e., with a 16MB superpage, you really want/need something that is "close to" 16MB (or, just treat memory as costless). I wonder how "big" most processes are (in c.a.e products) -- assuming the whole process can be mapped into a single contiguous page? Conversely, I wonder how many "smaller objects" need to be moved between address spaces in such products (assuming, of course, that they operate under those sorts of protection mechanisms)? Regardless (or, "Irregardless", as sayeth The Rat!), I'm stuck with the fixed size pages that vendors currently offer. So, there can't be any "policy" inherent in the crafting of my code as it can't know whether it will be able to avail itself of "tiny" pages or if it will be packaged in a more wasteful container. [E.g., silly to partition process initialization into an isolatable page (which can be unallocated as soon as initialization is complete!) if the whole process (TEXT) can fit in said page! Likewise, why add code to grow the stack, on demand, if you're stuck with huge pages? Why ask a process to release "unneeded" resources if the minimum set of resources already exceeds its needs?!] So, beyond the "conversation" (below), there also needs to be some interaction with the developer at *design* time to best utilize the resources that MIGHT be available (without prior knowledge of the hosting hardware!) I'll have to consider how I can condition the code to adapt to these differences. Meanwhile, I'll continue cataloging (fixed) page sizes with an eye towards noticing potential trends...
>> So, there needs to be a conversation between the OS and the application >> regarding how, best, to map the application onto the hardware -- with >> "suitable" defaults in place for applications that aren't aware of >> the significance of these issues. This is particularly true if the >> application binary can be hosted on different hardware -- or >> MIGRATED to different hardware while executing! > >> Obviously makes sense to design that API in a way that is only as >> general as it needs to be; WHY SUPPORT POSSIBILITIES THAT DON'T EXIST? >> (Or, that aren't *likely* to exist in COTS hardware?) IOW, you >> can KNOW that: >> ASSERT( !( (pagesize - 1) & pagesize ) ) >> for all supported "pagesize", and code accordingly! >> >> Paraphrasing: "Make something as simple as it can be -- and no simpler"
Time to go check out the fire...

Memfault Beyond the Launch