EmbeddedRelated.com
Forums
Memfault Beyond the Launch

Page sizes

Started by Don Y June 27, 2020
On 6/27/2020 3:21 AM, Bernd Linsel wrote:
> On 27.06.2020 09:32, Don Y wrote: > >> But, I don't see any VALUE in other sizes as it needlessly complicates >> any *hardware* implementation (though poses little barrier to a software >> implementation!) > > Commonly (in all current mainstream processor architectures like iA32, AMD64, > ARM, MIPS etc.) a MMU divides a logical address at bit boundaries into a page > address and an offset. > As a result, the page address on these platforms is always a power of two.
Yes, that's what my survey has produced. There are "prefered" page sizes across architectures and the range of sizes is constrained by (no doubt) practical implementation issues.
> Yes, there _may_ exist some exotic MMUs that let you choose protection areas > (to avoid the term 'pages') with arbitrary base addresses and sizes. This
But this wasn't always the case. Much of the "adventurism" that was prevalent in CPU design in the 80's seems to have been winnowed down ("electronic Darwinism?") to the fixed page size implementations that are commonplace, today (esp wrt devices supporting DPVMM). And, as an implicit acknowledgement that this isn't "quite sufficient", we see the introduction of superpages, subblocks, page size choices, etc. to further complicate the mess. All targeted to increase TLB reach as working sets get larger.
> flexibility requires heavily increased hardware efforts and cost and > complicates an OS's memory management, so it's unlikely to be used at all. > One example were the i286/i386's Protected Mode segments, but even there was a > granularity of 4K/1M, so the assertion 'segment base address is a power of two' > was also true, you just couldn't be sure each segment had the same size. > Setting up and maintaining the segment descriptor tables was so complicated > that mainstream OS's on i386 (NT, Linux) only set up the most necessary > segments and went on using a flat 4GB address space and the page tables of the > additional MMU. > Furthermore, using segments slowed down hardware memory accesses considerably, > that in the '486 and successors added Segment Descriptor Caches etc etc. > > Conclusion: No, you cannot fundamentally assume that page sizes on any existing > MMU are powers of two. Hardware designers can implement whatever weird and > complicated adressing patterns they like.
Yes, but -- as above -- the trend seems to be towards reducing page-size choice (flexibility) in the hope that performance hits can be mitigated with larger TLBs (or smarter resource scheduling). On the surface, this may (?) be the right approach -- barring a fundamental change in how developers approach system/application development. It's certainly one that silicon developers can more easily wrap their heads around!
On Sat, 27 Jun 2020 12:21:08 +0200, Bernd Linsel <bl1@gmx.com> wrote:

>On 27.06.2020 09:32, Don Y wrote: >> On 6/26/2020 11:33 PM, Bernd Linsel wrote: >> >> Yes, that was the point of the question. >> >> >> As, for my needs, this is one of those "fundamental assumptions", I'm >> looking for more assurance than "very unlikely" -- just as one wouldn't >> pick one SPECIFIC page size and assume it to be ubiquitous (or even >> expecting a single page size to be supported at any given instant)&nbsp; :> >> >> But, I don't see any VALUE in other sizes as it needlessly complicates >> any *hardware* implementation (though poses little barrier to a software >> implementation!) > >Commonly (in all current mainstream processor architectures like iA32, >AMD64, ARM, MIPS etc.) a MMU divides a logical address at bit boundaries >into a page address and an offset. >As a result, the page address on these platforms is always a power of two. > >Yes, there _may_ exist some exotic MMUs that let you choose protection >areas (to avoid the term 'pages') with arbitrary base addresses and >sizes. This flexibility requires heavily increased hardware efforts and >cost and complicates an OS's memory management, so it's unlikely to be >used at all. >One example were the i286/i386's Protected Mode segments, but even there >was a granularity of 4K/1M, so the assertion 'segment base address is a >power of two' was also true, you just couldn't be sure each segment had >the same size. Setting up and maintaining the segment descriptor tables >was so complicated that mainstream OS's on i386 (NT, Linux) only set up >the most necessary segments and went on using a flat 4GB address space >and the page tables of the additional MMU. >Furthermore, using segments slowed down hardware memory accesses >considerably, that in the '486 and successors added Segment Descriptor >Caches etc etc. > >Conclusion: No, you cannot fundamentally assume that page sizes on any >existing MMU are powers of two. Hardware designers can implement >whatever weird and complicated adressing patterns they like.
Minor quibble: You can't assume the minimum protection zone is power-of-2, but some systems separate the notion of the protection zone from the allocation unit. Every MMU I am aware of has allocation / management units that are power-of-2. George
On 6/27/20 12:45 AM, Don Y wrote:
> Are there any processors/PMMUs for which the following would be true > (nonzero)? > > (pagesize - 1) & pagesize
The simple thing to see is that the simplest way for the hardware to address the memory is to break the address up into page_number bits and page_address bits, which for a binary machine, implies a power of 2 page size. It is theoretically possible to design a system using any arbitrary page size and compute the page number as page_number = address / pagesize and page_address = address % pagesize, but except for pagesize being a power of two, these are not simple to compute, so there would need to be a VERY good reason to add the complexity.
On Sat, 27 Jun 2020 15:10:22 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>Hi George, > >Hope you are keeping well... bad pandemic response, here; really >high temperatures; increasing humidity; and lots of smoke in the >air (but really cool "displays" at night!) :< Time to make some >ice cream and enjoy the ride! :> > >On 6/27/2020 2:37 PM, George Neuner wrote: >> Hi Don, >> On Fri, 26 Jun 2020 21:45:54 -0700, Don Y >> <blockedofcourse@foo.invalid> wrote: >> >>> Are there any processors/PMMUs for which the following would be >>> true (nonzero)? >>> >>> (pagesize - 1) & pagesize >> >> Not anything you can buy. > >I'm wondering if some of the "classic" designs might scale to newer >device geometries better than some of the newer architectures? > >E.g., supporting ~100 (variable sized) segments concurrently and >binding each to a particular "object" (for want of a better word). >If the segment management hardware automatically reloads (in a manner >similar to the TLBs functionality), then this should yield better >(or comparable) performance to the fixed page-size approach (if >you assume the fixed pages poorly "fit" the types of "objects" >that you are mapping) > >[I think we discussed this -- or something similar -- a while ago]
About ~10 years ago 8-) But you asked about "pages" here, which invariably are fixed sized entities. Arbitrarily sized "segments" are a different subject. If you want a *useful* segmenting MMU, you probably need to design it yourself. Historically there were some units that did it (what I would call) right, but none are scalable to modern memory sizes. Whatever you do, you want the program to work with flat addresses and have segmentation applied transparently (like paging) during memory access. You certainly DO NOT want to follow the x86 example of exposing segments in the addressing.
>You still have a "packing problem" but with a virtual address space >per process, you'd only have to address the "objects" with which a >particular process interacted in any particular address space. >And, that binding (for PIC) could be done at compile time *or* >load time (the latter being more flexible) -- or even RUN-time!
George
On 6/27/2020 3:35 PM, George Neuner wrote:
> But you asked about "pages" here, which invariably are fixed sized > entities. Arbitrarily sized "segments" are a different subject.
Yes -- but you note that some "modern" CPUs now allow multiple (fixed) page sizes to coexist in the same address space. So, it's a matter of degrees...
> If you want a *useful* segmenting MMU, you probably need to design it > yourself. Historically there were some units that did it (what I > would call) right, but none are scalable to modern memory sizes. > > Whatever you do, you want the program to work with flat addresses and > have segmentation applied transparently (like paging) during memory > access. You certainly DO NOT want to follow the x86 example of > exposing segments in the addressing.
Agreed. Segments were meant to address a different problem. OTOH, exposing them to the instruction set removes any potential ambiguity if two (or more) "general purpose" segments could overlap at a particular spot in the address space; the opcode acts as a disambiguator. The PMMU approach sidesteps this issue by rigidly defining where (in the physical and virtual address spaces) a new page CAN begin. It's bank-switching-on-steroids... [IIRC, I had previously concluded that variable sizes were impractical for reasons like this]
On Sat, 27 Jun 2020 16:36:59 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/27/2020 3:35 PM, George Neuner wrote: >> But you asked about "pages" here, which invariably are fixed sized >> entities. Arbitrarily sized "segments" are a different subject. > >Yes -- but you note that some "modern" CPUs now allow multiple (fixed) >page sizes to coexist in the same address space. So, it's a matter >of degrees...
Not really. Allowing this process to have 4KB pages and that process to have 16KB pages and yet a third process to have 1MB pages (or whatever) is light years from allowing this process to have 109 bytes here and 3002 bytes there and that process to have 1061 bytes of which 53 overlap the other process's memory but with different protection. That isn't "paging". Segmenting MMUs could/can do sh-t like that, but most don't provide enough segments - per process or in total - to make it worthwhile to subdivide memory at such fine granularity. Only Mill claims this capability at sufficient scale for a large memory ... but you can't buy a Mill.
>> If you want a *useful* segmenting MMU, you probably need to design it >> yourself. Historically there were some units that did it (what I >> would call) right, but none are scalable to modern memory sizes. >> >> Whatever you do, you want the program to work with flat addresses and >> have segmentation applied transparently (like paging) during memory >> access. You certainly DO NOT want to follow the x86 example of >> exposing segments in the addressing. > >Agreed. Segments were meant to address a different problem. > >OTOH, exposing them to the instruction set removes any potential >ambiguity if two (or more) "general purpose" segments could >overlap at a particular spot in the address space; the opcode >acts as a disambiguator.
??? Not following.
>The PMMU approach sidesteps this issue by rigidly defining where >(in the physical and virtual address spaces) a new page CAN begin. >It's bank-switching-on-steroids... > >[IIRC, I had previously concluded that variable sizes were impractical >for reasons like this]
The problem is that you're thinking only about the protection aspect ... it's the subdivision management of the address space that is made slow and difficult if you allow mapping arbitrarily sized regions. You have to separate the concerns to do either one efficiently. That's why pure segment-only MMUs quickly were superceded by combination page+segment units with segmenting relegated to protection while paging handled address space. And now many CPUs don't even bother with segments any more. George
On 28/6/20 1:21 am, David Brown wrote:
> On 27/06/2020 09:32, Don Y wrote: >> On 6/26/2020 11:33 PM, Bernd Linsel wrote: >>> On 27.06.2020 06:45, Don Y wrote: >>>> Are there any processors/PMMUs for which the following would be true >>>> (nonzero)? >>>> >>>> (pagesize - 1) & pagesize >>> >>> That would imply that the page size is not an integral power of 2. >> >> Yes, that was the point of the question. > > If that was the point, why didn't you write that?
David, it is Don Y you're addressing.
On Sat, 27 Jun 2020 18:35:57 -0400, George Neuner
<gneuner2@comcast.net> wrote:

>On Sat, 27 Jun 2020 15:10:22 -0700, Don Y ><blockedofcourse@foo.invalid> wrote: > >>Hi George, >> >>Hope you are keeping well... bad pandemic response, here; really >>high temperatures; increasing humidity; and lots of smoke in the >>air (but really cool "displays" at night!) :< Time to make some >>ice cream and enjoy the ride! :> >> >>On 6/27/2020 2:37 PM, George Neuner wrote: >>> Hi Don, >>> On Fri, 26 Jun 2020 21:45:54 -0700, Don Y >>> <blockedofcourse@foo.invalid> wrote: >>> >>>> Are there any processors/PMMUs for which the following would be >>>> true (nonzero)? >>>> >>>> (pagesize - 1) & pagesize >>> >>> Not anything you can buy. >> >>I'm wondering if some of the "classic" designs might scale to newer >>device geometries better than some of the newer architectures? >> >>E.g., supporting ~100 (variable sized) segments concurrently and >>binding each to a particular "object" (for want of a better word). >>If the segment management hardware automatically reloads (in a manner >>similar to the TLBs functionality), then this should yield better >>(or comparable) performance to the fixed page-size approach (if >>you assume the fixed pages poorly "fit" the types of "objects" >>that you are mapping) >> >>[I think we discussed this -- or something similar -- a while ago] > >About ~10 years ago 8-) > >But you asked about "pages" here, which invariably are fixed sized >entities. Arbitrarily sized "segments" are a different subject. > >If you want a *useful* segmenting MMU, you probably need to design it >yourself. Historically there were some units that did it (what I >would call) right, but none are scalable to modern memory sizes. > >Whatever you do, you want the program to work with flat addresses and >have segmentation applied transparently (like paging) during memory >access. You certainly DO NOT want to follow the x86 example of >exposing segments in the addressing.
The problem with segmented access in x86 is the far too small number of segment registers. In addition on 8086 the problem was the small maximum segment size (64 KiB). A small segment size is not a problem for code, since subroutines are generally much smaller than that, but data access to a large arrays is a pain. Segments are nice if you are going to use shared loadable libraries ("DLLs"). Just load it and use original link time addresses, no need for fix-ups at load time. In a single 386 style single code space, loading a shared library needs fix-ups at load time (it is not always possible to make everything position independent). Also if two libraries are linked for the same virtual address, at least the other library needs to be rebased at a different virtual address to avoid the conflict. Making fix-ups into the code, means that the fixed page becomes dirty and can't be shared by multiple processed in the system, by ether making a copy of the whole library and making fix-ups to the private copy or at least store the dirty pages in the process specific page file. In a good segmented system (with sufficient segment registers) can directly share the same library in multiple processes. Since all code pages are read-only, no need to store it to a page file if running out of memory,
On 6/27/2020 10:01 PM, George Neuner wrote:
>>> If you want a *useful* segmenting MMU, you probably need to design it >>> yourself. Historically there were some units that did it (what I >>> would call) right, but none are scalable to modern memory sizes. >>> >>> Whatever you do, you want the program to work with flat addresses and >>> have segmentation applied transparently (like paging) during memory >>> access. You certainly DO NOT want to follow the x86 example of >>> exposing segments in the addressing. >> >> Agreed. Segments were meant to address a different problem. >> >> OTOH, exposing them to the instruction set removes any potential >> ambiguity if two (or more) "general purpose" segments could >> overlap at a particular spot in the address space; the opcode >> acts as a disambiguator. > > ??? Not following.
In a large, flat address space, it is conceivable that "general purpose" segments could overlap. So, in such an environment, an address presented to the memory subsystem would have to resolve to SOME particular physical address, "behind" the segment hardware. The hardware would have to resolve any possible ambiguities. (how do you design the HARDWARE to prevent ambiguities from arising without increasing its complexity even more??). If, instead, the segments are exposed to the programmer, then the choice of opcode determines which segment (hardware) is consulted to resolve the reference(s). Any "overlap" becomes unimportant.
>> The PMMU approach sidesteps this issue by rigidly defining where >> (in the physical and virtual address spaces) a new page CAN begin. >> It's bank-switching-on-steroids... >> >> [IIRC, I had previously concluded that variable sizes were impractical >> for reasons like this] > > The problem is that you're thinking only about the protection aspect > ... it's the subdivision management of the address space that is made > slow and difficult if you allow mapping arbitrarily sized regions.
Perhaps you missed: 'You still have a "packing problem" but with a virtual address space per process, you'd only have to address the "objects" with which a particular process interacted in any particular address space. And, that binding (for PIC) could be done at compile time *or* load time (the latter being more flexible) -- or even RUN-time!' You have N "modules" in a typical application. The linkage editor mashes them together into a single binary to be loaded, ensuring that they don't overlap each other (d'uh!). Easy-peasy. You have the comparable problem with each segment representing a discrete "object" being made to coexist disjointedly in a single address space. If the "objects" never change, over time, then this is no harder to address than the linkage editor problem (assuming any segment can being at any location and have any size). Especially for PIC. But, if segments can be added/removed/resized dynamically, then you're essentially dealing with the same sort of fragmentation problem that arises in heap management AND the same sort of algorithm choices for selecting WHERE to create the next requested segment (unless you pass that off to the application to handle as IT knows what its current and future needs will be).
> You have to separate the concerns to do either one efficiently. > > That's why pure segment-only MMUs quickly were superceded by > combination page+segment units with segmenting relegated to protection > while paging handled address space. And now many CPUs don't even > bother with segments any more.
The advantage that fixed size (even if there is a selection of sizes to choose from) pages offers is each page has a particular location into which it fits. You don't have to worry that some *other* page partially overlaps it or that it will overlap another. But, with support for different (fixed) page sizes -- and attendant performance consequences thereof -- the application needs to hint the OS on how it plans/needs to use memory in order to make optimum use of memory system bandwidth. Silly for the OS to naively choose a page size for a process based on some crude metric like "size of object". That can result in excessive resources being bound that aren't *typically* USED by that object -- fault in those portions AS they are needed (why do I need a -- potentially large -- portion of the object residing in mapped memory if it is only accessed very infrequently?) OTOH, a finer-grained choice (allowing smaller pieces of the object to be mapped at a time) reduces TLB reach as well as consuming OTHER resources (e.g., TLB misses) for an object with poor locality of reference (here-a-hit, there-a-hit, everywhere-a-hit-hit...) So, there needs to be a conversation between the OS and the application regarding how, best, to map the application onto the hardware -- with "suitable" defaults in place for applications that aren't aware of the significance of these issues. This is particularly true if the application binary can be hosted on different hardware -- or MIGRATED to different hardware while executing! Obviously makes sense to design that API in a way that is only as general as it needs to be; WHY SUPPORT POSSIBILITIES THAT DON'T EXIST? (Or, that aren't *likely* to exist in COTS hardware?) IOW, you can KNOW that: ASSERT( !( (pagesize - 1) & pagesize ) ) for all supported "pagesize", and code accordingly! Paraphrasing: "Make something as simple as it can be -- and no simpler" [Time to check the daily briefing on the fire and then go out and take a look at it...]
On 6/27/2020 11:24 PM, upsidedown@downunder.com wrote:
> The problem with segmented access in x86 is the far too small number > of segment registers. In addition on 8086 the problem was the small > maximum segment size (64 KiB). A small segment size is not a problem > for code, since subroutines are generally much smaller than that, but > data access to a large arrays is a pain.
The problem with segments is they are a hack to work-around a previous constraint that was arbitrarily imposed on CPU architectures. When will we find a 32b space insufficient to represent run-time objects? (it's already insufficient for filesystems)
> Segments are nice if you are going to use shared loadable libraries > ("DLLs"). Just load it and use original link time addresses, no need > for fix-ups at load time.
Note that you can get, effectively, the same capability by putting the object (.so) in a separate (virtual) address space. But, then incur the costs of IPC for all references. [Alpha did this for it's notion of passive objects requiring ins and outs to be located in special accompanying pages passed to the object]
> In a single 386 style single code space, loading a shared library > needs fix-ups at load time (it is not always possible to make > everything position independent). Also if two libraries are linked for > the same virtual address, at least the other library needs to be > rebased at a different virtual address to avoid the conflict. > > Making fix-ups into the code, means that the fixed page becomes dirty > and can't be shared by multiple processed in the system, by ether > making a copy of the whole library and making fix-ups to the private > copy or at least store the dirty pages in the process specific page > file. > > In a good segmented system (with sufficient segment registers) can > directly share the same library in multiple processes. Since all code > pages are read-only, no need to store it to a page file if running out > of memory,
But any management scheme requires a fast cache for the parameters pertinent to the objects being managed by the hardware in THIS process instance. When does storing a tuple (logical start, physical start, size) outweigh the savings of using *1* segment (per object) over *many* pages (per object)? I.e., if an object is always small enough to fit in a page, then a single TLB entry is sufficient to manage it with the page size "hard-wired". You can do the same in a paged system by mapping a single copy of the object into each consumer's address space, as needed. Fixups and "local data" can be deliberately situated in a separate page(s) that accompanies the object -- but is uniquely instantiated for each consumer (instead of being shared). The "code" page(s) can be discarded when physical memory is scarce IF they can be reloaded from their original media (disk, flash, etc.)

Memfault Beyond the Launch