Page sizes| page 2

Reply by Don Y ●June 27, 20202020-06-27

On 6/27/2020 3:21 AM, Bernd Linsel wrote:
> On 27.06.2020 09:32, Don Y wrote:
>
>> But, I don't see any VALUE in other sizes as it needlessly complicates
>> any *hardware* implementation (though poses little barrier to a software
>> implementation!)
> 
> Commonly (in all current mainstream processor architectures like iA32, AMD64, 
> ARM, MIPS etc.) a MMU divides a logical address at bit boundaries into a page 
> address and an offset.
> As a result, the page address on these platforms is always a power of two.

Yes, that's what my survey has produced.  There are "prefered" page sizes
across architectures and the range of sizes is constrained by (no doubt)
practical implementation issues.

> Yes, there _may_ exist some exotic MMUs that let you choose protection areas 
> (to avoid the term 'pages') with arbitrary base addresses and sizes. This 

But this wasn't always the case.  Much of the "adventurism" that was
prevalent in CPU design in the 80's seems to have been winnowed down
("electronic Darwinism?") to the fixed page size implementations that
are commonplace, today (esp wrt devices supporting DPVMM).

And, as an implicit acknowledgement that this isn't "quite sufficient",
we see the introduction of superpages, subblocks, page size choices,
etc. to further complicate the mess.

All targeted to increase TLB reach as working sets get larger.

> flexibility requires heavily increased hardware efforts and cost and 
> complicates an OS's memory management, so it's unlikely to be used at all.
> One example were the i286/i386's Protected Mode segments, but even there was a 
> granularity of 4K/1M, so the assertion 'segment base address is a power of two' 
> was also true, you just couldn't be sure each segment had the same size. 
> Setting up and maintaining the segment descriptor tables was so complicated 
> that mainstream OS's on i386 (NT, Linux) only set up the most necessary 
> segments and went on using a flat 4GB address space and the page tables of the 
> additional MMU.
> Furthermore, using segments slowed down hardware memory accesses considerably, 
> that in the '486 and successors added Segment Descriptor Caches etc etc.
> 
> Conclusion: No, you cannot fundamentally assume that page sizes on any existing 
> MMU are powers of two. Hardware designers can implement whatever weird and 
> complicated adressing patterns they like.

Yes, but -- as above -- the trend seems to be towards reducing page-size
choice (flexibility) in the hope that performance hits can be mitigated with
larger TLBs (or smarter resource scheduling).

On the surface, this may (?) be the right approach -- barring a fundamental
change in how developers approach system/application development.  It's
certainly one that silicon developers can more easily wrap their heads around!

Reply by George Neuner ●June 27, 20202020-06-27

On Sat, 27 Jun 2020 12:21:08 +0200, Bernd Linsel <bl1@gmx.com> wrote:

>On 27.06.2020 09:32, Don Y wrote:
>> On 6/26/2020 11:33 PM, Bernd Linsel wrote:
>> 
>> Yes, that was the point of the question.
>> 
>> 
>> As, for my needs, this is one of those "fundamental assumptions", I'm
>> looking for more assurance than "very unlikely" -- just as one wouldn't
>> pick one SPECIFIC page size and assume it to be ubiquitous (or even
>> expecting a single page size to be supported at any given instant)&nbsp; :>
>> 
>> But, I don't see any VALUE in other sizes as it needlessly complicates
>> any *hardware* implementation (though poses little barrier to a software
>> implementation!)
>
>Commonly (in all current mainstream processor architectures like iA32, 
>AMD64, ARM, MIPS etc.) a MMU divides a logical address at bit boundaries 
>into a page address and an offset.
>As a result, the page address on these platforms is always a power of two.
>
>Yes, there _may_ exist some exotic MMUs that let you choose protection 
>areas (to avoid the term 'pages') with arbitrary base addresses and 
>sizes. This flexibility requires heavily increased hardware efforts and 
>cost and complicates an OS's memory management, so it's unlikely to be 
>used at all.
>One example were the i286/i386's Protected Mode segments, but even there 
>was a granularity of 4K/1M, so the assertion 'segment base address is a 
>power of two' was also true, you just couldn't be sure each segment had 
>the same size. Setting up and maintaining the segment descriptor tables 
>was so complicated that mainstream OS's on i386 (NT, Linux) only set up 
>the most necessary segments and went on using a flat 4GB address space 
>and the page tables of the additional MMU.
>Furthermore, using segments slowed down hardware memory accesses 
>considerably, that in the '486 and successors added Segment Descriptor 
>Caches etc etc.
>
>Conclusion: No, you cannot fundamentally assume that page sizes on any 
>existing MMU are powers of two. Hardware designers can implement 
>whatever weird and complicated adressing patterns they like.

Minor quibble:

You can't assume the minimum protection zone is power-of-2, but some
systems separate the notion of the protection zone from the allocation
unit.

Every MMU I am aware of has allocation / management units that are
power-of-2.

George

Reply by Richard Damon ●June 27, 20202020-06-27

On 6/27/20 12:45 AM, Don Y wrote:
> Are there any processors/PMMUs for which the following would be true
> (nonzero)?
> 
> (pagesize - 1) & pagesize

The simple thing to see is that the simplest way for the hardware to
address the memory is to break the address up into page_number bits and
page_address bits, which for a binary machine, implies a power of 2 page
size.

It is theoretically possible to design a system using any arbitrary page
size and compute the page number as page_number = address / pagesize and
page_address = address % pagesize, but except for pagesize being a power
of two, these are not simple to compute, so there would need to be a
VERY good reason to add the complexity.

Reply by George Neuner ●June 27, 20202020-06-27

On Sat, 27 Jun 2020 15:10:22 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>Hi George,
>
>Hope you are keeping well...  bad pandemic response, here; really
>high temperatures; increasing humidity; and lots of smoke in the
>air (but really cool "displays" at night!)  :<   Time to make some
>ice cream and enjoy the ride!  :>
>
>On 6/27/2020 2:37 PM, George Neuner wrote:
>> Hi Don,
>> On Fri, 26 Jun 2020 21:45:54 -0700, Don Y
>> <blockedofcourse@foo.invalid> wrote:
>> 
>>> Are there any processors/PMMUs for which the following would be
>>> true (nonzero)?
>>>
>>> (pagesize - 1) & pagesize
>> 
>> Not anything you can buy.
>
>I'm wondering if some of the "classic" designs might scale to newer
>device geometries better than some of the newer architectures?
>
>E.g., supporting ~100 (variable sized) segments concurrently and
>binding each to a particular "object" (for want of a better word).
>If the segment management hardware automatically reloads (in a manner
>similar to the TLBs functionality), then this should yield better
>(or comparable) performance to the fixed page-size approach (if
>you assume the fixed pages poorly "fit" the types of "objects"
>that you are mapping)
>
>[I think we discussed this -- or something similar -- a while ago]

About ~10 years ago  8-)

But you asked about "pages" here, which invariably are fixed sized
entities.  Arbitrarily sized "segments" are a different subject.

If you want a *useful* segmenting MMU, you probably need to design it
yourself.  Historically there were some units that did it (what I
would call) right, but none are scalable to modern memory sizes.

Whatever you do, you want the program to work with flat addresses and
have segmentation applied transparently (like paging) during memory
access.  You certainly DO NOT want to follow the x86 example of
exposing segments in the addressing.


>You still have a "packing problem" but with a virtual address space
>per process, you'd only have to address the "objects" with which a
>particular process interacted in any particular address space.
>And, that binding (for PIC) could be done at compile time *or*
>load time (the latter being more flexible) -- or even RUN-time!

George

Reply by Don Y ●June 27, 20202020-06-27

On 6/27/2020 3:35 PM, George Neuner wrote:
> But you asked about "pages" here, which invariably are fixed sized
> entities.  Arbitrarily sized "segments" are a different subject.

Yes -- but you note that some "modern" CPUs now allow multiple (fixed)
page sizes to coexist in the same address space.  So, it's a matter
of degrees...

> If you want a *useful* segmenting MMU, you probably need to design it
> yourself.  Historically there were some units that did it (what I
> would call) right, but none are scalable to modern memory sizes.
> 
> Whatever you do, you want the program to work with flat addresses and
> have segmentation applied transparently (like paging) during memory
> access.  You certainly DO NOT want to follow the x86 example of
> exposing segments in the addressing.

Agreed.  Segments were meant to address a different problem.

OTOH, exposing them to the instruction set removes any potential
ambiguity if two (or more) "general purpose" segments could
overlap at a particular spot in the address space; the opcode
acts as a disambiguator.

The PMMU approach sidesteps this issue by rigidly defining where
(in the physical and virtual address spaces) a new page CAN begin.
It's bank-switching-on-steroids...

[IIRC, I had previously concluded that variable sizes were impractical
for reasons like this]

Reply by George Neuner ●June 28, 20202020-06-28

On Sat, 27 Jun 2020 16:36:59 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/27/2020 3:35 PM, George Neuner wrote:
>> But you asked about "pages" here, which invariably are fixed sized
>> entities.  Arbitrarily sized "segments" are a different subject.
>
>Yes -- but you note that some "modern" CPUs now allow multiple (fixed)
>page sizes to coexist in the same address space.  So, it's a matter
>of degrees...

Not really.  Allowing this process to have 4KB pages and that process
to have 16KB pages and yet a third process to have 1MB pages (or
whatever) is light years from allowing this process to have 109 bytes
here and 3002 bytes there and that process to have 1061 bytes of which
53 overlap the other process's memory but with different protection.

That isn't "paging".  Segmenting MMUs could/can do sh-t like that, but
most don't provide enough segments - per process or in total - to make
it worthwhile to subdivide memory at such fine granularity.  Only Mill
claims this capability at sufficient scale for a large memory ... but
you can't buy a Mill.

>> If you want a *useful* segmenting MMU, you probably need to design it
>> yourself.  Historically there were some units that did it (what I
>> would call) right, but none are scalable to modern memory sizes.
>> 
>> Whatever you do, you want the program to work with flat addresses and
>> have segmentation applied transparently (like paging) during memory
>> access.  You certainly DO NOT want to follow the x86 example of
>> exposing segments in the addressing.
>
>Agreed.  Segments were meant to address a different problem.
>
>OTOH, exposing them to the instruction set removes any potential
>ambiguity if two (or more) "general purpose" segments could
>overlap at a particular spot in the address space; the opcode
>acts as a disambiguator.

???  Not following.

>The PMMU approach sidesteps this issue by rigidly defining where
>(in the physical and virtual address spaces) a new page CAN begin.
>It's bank-switching-on-steroids...
>
>[IIRC, I had previously concluded that variable sizes were impractical
>for reasons like this]

The problem is that you're thinking only about the protection aspect
... it's the subdivision management of the address space that is made
slow and difficult if you allow mapping arbitrarily sized regions.

You have to separate the concerns to do either one efficiently.

That's why pure segment-only MMUs quickly were superceded by
combination page+segment units with segmenting relegated to protection
while paging handled address space.  And now many CPUs don't even
bother with segments any more.

George

Reply by Clifford Heath ●June 28, 20202020-06-28

On 28/6/20 1:21 am, David Brown wrote:
> On 27/06/2020 09:32, Don Y wrote:
>> On 6/26/2020 11:33 PM, Bernd Linsel wrote:
>>> On 27.06.2020 06:45, Don Y wrote:
>>>> Are there any processors/PMMUs for which the following would be true
>>>> (nonzero)?
>>>>
>>>> (pagesize - 1) & pagesize
>>>
>>> That would imply that the page size is not an integral power of 2.
>>
>> Yes, that was the point of the question.
> 
> If that was the point, why didn't you write that?

David, it is Don Y you're addressing.

Reply by ●June 28, 20202020-06-28

On Sat, 27 Jun 2020 18:35:57 -0400, George Neuner
<gneuner2@comcast.net> wrote:

>On Sat, 27 Jun 2020 15:10:22 -0700, Don Y
><blockedofcourse@foo.invalid> wrote:
>
>>Hi George,
>>
>>Hope you are keeping well...  bad pandemic response, here; really
>>high temperatures; increasing humidity; and lots of smoke in the
>>air (but really cool "displays" at night!)  :<   Time to make some
>>ice cream and enjoy the ride!  :>
>>
>>On 6/27/2020 2:37 PM, George Neuner wrote:
>>> Hi Don,
>>> On Fri, 26 Jun 2020 21:45:54 -0700, Don Y
>>> <blockedofcourse@foo.invalid> wrote:
>>> 
>>>> Are there any processors/PMMUs for which the following would be
>>>> true (nonzero)?
>>>>
>>>> (pagesize - 1) & pagesize
>>> 
>>> Not anything you can buy.
>>
>>I'm wondering if some of the "classic" designs might scale to newer
>>device geometries better than some of the newer architectures?
>>
>>E.g., supporting ~100 (variable sized) segments concurrently and
>>binding each to a particular "object" (for want of a better word).
>>If the segment management hardware automatically reloads (in a manner
>>similar to the TLBs functionality), then this should yield better
>>(or comparable) performance to the fixed page-size approach (if
>>you assume the fixed pages poorly "fit" the types of "objects"
>>that you are mapping)
>>
>>[I think we discussed this -- or something similar -- a while ago]
>
>About ~10 years ago  8-)
>
>But you asked about "pages" here, which invariably are fixed sized
>entities.  Arbitrarily sized "segments" are a different subject.
>
>If you want a *useful* segmenting MMU, you probably need to design it
>yourself.  Historically there were some units that did it (what I
>would call) right, but none are scalable to modern memory sizes.
>
>Whatever you do, you want the program to work with flat addresses and
>have segmentation applied transparently (like paging) during memory
>access.  You certainly DO NOT want to follow the x86 example of
>exposing segments in the addressing.

The problem with segmented access in x86 is the far too small number
of segment registers. In addition on 8086 the problem was the small
maximum segment size (64 KiB). A small segment size is not a problem
for code, since subroutines are generally much smaller than that, but
data access to a large arrays is a pain.

Segments are nice if you are going to use shared loadable libraries
("DLLs"). Just load it and use original link time addresses, no  need
for fix-ups at load time.

In a single 386 style single code space, loading a shared library
needs fix-ups at load time (it is not always possible to make
everything position independent). Also if two libraries are linked for
the same virtual address, at least the other library needs to be
rebased at a different virtual address to avoid the conflict.

Making fix-ups into the code, means that the fixed page becomes dirty
and can't be shared by multiple processed in the system, by ether
making a copy of the whole library and making fix-ups to the private
copy or at least store the dirty pages in the process specific page
file.

In a good segmented system (with sufficient segment registers) can
directly share the same library in multiple processes. Since all code
pages are read-only, no need to store it to a page file if running out
of memory,

Reply by Don Y ●June 28, 20202020-06-28

On 6/27/2020 10:01 PM, George Neuner wrote:
>>> If you want a *useful* segmenting MMU, you probably need to design it
>>> yourself.  Historically there were some units that did it (what I
>>> would call) right, but none are scalable to modern memory sizes.
>>>
>>> Whatever you do, you want the program to work with flat addresses and
>>> have segmentation applied transparently (like paging) during memory
>>> access.  You certainly DO NOT want to follow the x86 example of
>>> exposing segments in the addressing.
>>
>> Agreed.  Segments were meant to address a different problem.
>>
>> OTOH, exposing them to the instruction set removes any potential
>> ambiguity if two (or more) "general purpose" segments could
>> overlap at a particular spot in the address space; the opcode
>> acts as a disambiguator.
> 
> ???  Not following.

In a large, flat address space, it is conceivable that "general purpose"
segments could overlap.  So, in such an environment, an address presented
to the memory subsystem would have to resolve to SOME particular physical
address, "behind" the segment hardware.  The hardware would have to resolve
any possible ambiguities.  (how do you design the HARDWARE to prevent
ambiguities from arising without increasing its complexity even more??).

If, instead, the segments are exposed to the programmer, then the
choice of opcode determines which segment (hardware) is consulted
to resolve the reference(s).  Any "overlap" becomes unimportant.

>> The PMMU approach sidesteps this issue by rigidly defining where
>> (in the physical and virtual address spaces) a new page CAN begin.
>> It's bank-switching-on-steroids...
>>
>> [IIRC, I had previously concluded that variable sizes were impractical
>> for reasons like this]
> 
> The problem is that you're thinking only about the protection aspect
> ... it's the subdivision management of the address space that is made
> slow and difficult if you allow mapping arbitrarily sized regions.

Perhaps you missed:

    'You still have a "packing problem" but with a virtual address space
    per process, you'd only have to address the "objects" with which a
    particular process interacted in any particular address space.
    And, that binding (for PIC) could be done at compile time *or*
    load time (the latter being more flexible) -- or even RUN-time!'

You have N "modules" in a typical application.  The linkage editor mashes
them together into a single binary to be loaded, ensuring that they don't
overlap each other (d'uh!).  Easy-peasy.

You have the comparable problem with each segment representing a
discrete "object" being made to coexist disjointedly in a single
address space.

If the "objects" never change, over time, then this is no harder to
address than the linkage editor problem (assuming any segment can
being at any location and have any size).  Especially for PIC.

But, if segments can be added/removed/resized dynamically, then
you're essentially dealing with the same sort of fragmentation
problem that arises in heap management AND the same sort of
algorithm choices for selecting WHERE to create the next requested
segment (unless you pass that off to the application to handle
as IT knows what its current and future needs will be).

> You have to separate the concerns to do either one efficiently.
> 
> That's why pure segment-only MMUs quickly were superceded by
> combination page+segment units with segmenting relegated to protection
> while paging handled address space.  And now many CPUs don't even
> bother with segments any more.

The advantage that fixed size (even if there is a selection of sizes
to choose from) pages offers is each page has a particular location
into which it fits.  You don't have to worry that some *other* page
partially overlaps it or that it will overlap another.

But, with support for different (fixed) page sizes -- and attendant
performance consequences thereof -- the application needs to hint
the OS on how it plans/needs to use memory in order to make optimum
use of memory system bandwidth.  Silly for the OS to naively choose
a page size for a process based on some crude metric like "size of
object".  That can result in excessive resources being bound that
aren't *typically* USED by that object -- fault in those portions AS
they are needed (why do I need a -- potentially large -- portion of
the object residing in mapped memory if it is only accessed very
infrequently?)

OTOH, a finer-grained choice (allowing smaller pieces of the object
to be mapped at a time) reduces TLB reach as well as consuming OTHER
resources (e.g., TLB misses) for an object with poor locality of
reference (here-a-hit, there-a-hit, everywhere-a-hit-hit...)

So, there needs to be a conversation between the OS and the application
regarding how, best, to map the application onto the hardware -- with
"suitable" defaults in place for applications that aren't aware of
the significance of these issues.  This is particularly true if the
application binary can be hosted on different hardware -- or
MIGRATED to different hardware while executing!

Obviously makes sense to design that API in a way that is only as
general as it needs to be; WHY SUPPORT POSSIBILITIES THAT DON'T EXIST?
(Or, that aren't *likely* to exist in COTS hardware?)  IOW, you
can KNOW that:
     ASSERT( !( (pagesize - 1) & pagesize ) )
for all supported "pagesize", and code accordingly!

Paraphrasing:  "Make something as simple as it can be -- and no simpler"

[Time to check the daily briefing on the fire and then go out and take
a look at it...]

Reply by Don Y ●June 28, 20202020-06-28

On 6/27/2020 11:24 PM, upsidedown@downunder.com wrote:
> The problem with segmented access in x86 is the far too small number
> of segment registers. In addition on 8086 the problem was the small
> maximum segment size (64 KiB). A small segment size is not a problem
> for code, since subroutines are generally much smaller than that, but
> data access to a large arrays is a pain.

The problem with segments is they are a hack to work-around a previous
constraint that was arbitrarily imposed on CPU architectures.  When will
we find a 32b space insufficient to represent run-time objects?  (it's
already insufficient for filesystems)

> Segments are nice if you are going to use shared loadable libraries
> ("DLLs"). Just load it and use original link time addresses, no  need
> for fix-ups at load time.

Note that you can get, effectively, the same capability by putting the
object (.so) in a separate (virtual) address space.  But, then incur
the costs of IPC for all references.

[Alpha did this for it's notion of passive objects requiring ins and
outs to be located in special accompanying pages passed to the object]

> In a single 386 style single code space, loading a shared library
> needs fix-ups at load time (it is not always possible to make
> everything position independent). Also if two libraries are linked for
> the same virtual address, at least the other library needs to be
> rebased at a different virtual address to avoid the conflict.
> 
> Making fix-ups into the code, means that the fixed page becomes dirty
> and can't be shared by multiple processed in the system, by ether
> making a copy of the whole library and making fix-ups to the private
> copy or at least store the dirty pages in the process specific page
> file.
> 
> In a good segmented system (with sufficient segment registers) can
> directly share the same library in multiple processes. Since all code
> pages are read-only, no need to store it to a page file if running out
> of memory,

But any management scheme requires a fast cache for the parameters
pertinent to the objects being managed by the hardware in THIS process
instance.  When does storing a tuple (logical start, physical start, size)
outweigh the savings of using *1* segment (per object) over *many* pages
(per object)?  I.e., if an object is always small enough to fit in a page,
then a single TLB entry is sufficient to manage it with the page size
"hard-wired".

You can do the same in a paged system by mapping a single copy of the
object into each consumer's address space, as needed.  Fixups and
"local data" can be deliberately situated in a separate page(s) that
accompanies the object -- but is uniquely instantiated for each
consumer (instead of being shared).

The "code" page(s) can be discarded when physical memory is scarce
IF they can be reloaded from their original media (disk, flash, etc.)