On 10/31/2021 21:21, antispam@math.uni.wroc.pl wrote:
> Dimiter_Popoff <dp@tgi-sci.com> wrote:
>>
>> Now this is where the worst fit allocation strategy becomes the game
>> changer. A newly created file is almost certainly contiguously
>> allocated; fragmentation occurs when it is appended to (and the
>> space past its last block has been already allocated in the meantime).
>> I think I saw somewhere (never really got interested) that mainstream
>> operating systems of today do just first fit - which means once you
>> delete a file, no matter how small, its space will be allocated as
>> part of the next request etc., no idea why they do it (if they do so,
>> my memory on that is not very certain) in such a primitive manner
>> but here they are.
> 
> There was some research and first fit turned out to be pretty good.
> But it is implemented slightly differently: there is moving pointer,
> search advances it.  So, after deallocation you wait till moving
> pointer arrives to the hole.  Way better than immediately filling
> hole: there is reasonable chance that holes will coalece into
> bigger free area.  Another point is that you refuse allocation
> when disc is too full (say at 95% utilization).  The two things
> together mean that normally fragmentation is not a problem.
> 

This might be somewhat better than plain first fit but it can be no
match to worst fit. They probably don't do worst fit because they
must have huge amounts of memory to dig through due to poor
filesystem design (they have to go through 32 bits or more for
each cluster).
In DPS the CAT is one bit per cluster which makes worst fit doable
without any problems like that. I have enhanced it a little
(many years ago) noticing that the partition could come to a
point where it has two almost equally sized empty parts; now you append
to a file, it gets some clusters from one of these; next time you
append to it the *other* one is the largest and it takes from it... :).
So now dps first checks if there are empty clusters to allocate
in succession of these already allocated to the file, if none
then it does worst fit.

Dimiter_Popoff <dp@tgi-sci.com> wrote:
> 
> Now this is where the worst fit allocation strategy becomes the game
> changer. A newly created file is almost certainly contiguously
> allocated; fragmentation occurs when it is appended to (and the
> space past its last block has been already allocated in the meantime).
> I think I saw somewhere (never really got interested) that mainstream
> operating systems of today do just first fit - which means once you
> delete a file, no matter how small, its space will be allocated as
> part of the next request etc., no idea why they do it (if they do so,
> my memory on that is not very certain) in such a primitive manner
> but here they are.

There was some research and first fit turned out to be pretty good.
But it is implemented slightly differently: there is moving pointer,
search advances it.  So, after deallocation you wait till moving
pointer arrives to the hole.  Way better than immediately filling
hole: there is reasonable chance that holes will coalece into
bigger free area.  Another point is that you refuse allocation
when disc is too full (say at 95% utilization).  The two things
together mean that normally fragmentation is not a problem.

-- 
                              Waldek Hebisch

On 10/25/2021 5:23 PM, Richard Damon wrote:
> On 10/25/21 7:19 AM, Don Y wrote:
>> On 10/18/2021 1:46 PM, Don Y wrote:
>>>> I think this will give you plenty of an idea how to go about it.
>>>> Once you know the limit you can run at some reasonable figure
>>>> below it and be happy. Getting more precise figures about all
>>>> that is neither easy nor will it buy you anything.
>>>
>>> I suspect "1" is going to end up as the "best compromise".  So,
>>> I'm treating this as an exercise in *validating* that assumption.
>>> I'll see if I can salvage some of the performance monitoring code
>>> from the sanitizer to give me details from which I might be able
>>> to ferret out "opportunities".  If I start by restricting my
>>> observations to non-destructive synthetic loads, then I can
>>> pull a drive and see how it fares in a different host while
>>> running the same code, etc.
>>
>> Actually, '2' turns out to be marginally better than '1'.
>> Beyond that, its hard to generalize without controlling some
>> of the other variables.
>>
>> '2' wins because there is always the potential to make the
>> disk busy, again, just after it satisfies the access for
>> the 1st thread (which is now busy using the data, etc.)
>>
>> But, if the first thread finishes up before the second thread's
>> request has been satisfied, then the presence of a THIRD thread
>> would just be clutter.  (i.e., the work performed correlates
>> with the number of threads that can have value)
> 
> Actually, 2 might be slower than 1, because the new request from the second 
> thread is apt to need a seek, while a single thread making all the calls is 
> more apt to sequentially read much more of the disk.

Second thread is another instance of first thread; same code, same data.
So, it is just "looking ahead" -- i.e., it WILL do what the first thread
WOULD do if the second thread hadn't gotten to it, first!  The strategy
is predicated on tightly coupled actors so they each can see what
the other has done/is doing.

If the layout of the object that thread 1 is processing calls for
a seek, then thread 2 will perform that seek as if thread 1 were to
do it "when done chewing on the past data".

If the disk caches data, then thread 2 will reap the benefits of
that cache AS IF it was thread 1 acting later.

Etc.

A second thread just hides the cost of the data processing.

> The controller, if not given a new chained command, might choose to 
> automatically start reading the next sector of the cylinder, which could be 
> likely the next one asked for.
> 
> The real optimum is likely a single process doing asynchronous requests queuing 
> up a series of requests, and then distributing the data as it comes in to 
> processing threads to do what ever crunching needs to be done.
> 
> These threads then send the data to a single thread that does asynchronous 
> writes of the results.
> 
> You can easily tell if the input or output processes are I/O bound or not and 
> use that to adjust the number of crunching threads in the middle.
>

On 10/25/21 7:19 AM, Don Y wrote:
> On 10/18/2021 1:46 PM, Don Y wrote:
>>> I think this will give you plenty of an idea how to go about it.
>>> Once you know the limit you can run at some reasonable figure
>>> below it and be happy. Getting more precise figures about all
>>> that is neither easy nor will it buy you anything.
>>
>> I suspect "1" is going to end up as the "best compromise".&nbsp; So,
>> I'm treating this as an exercise in *validating* that assumption.
>> I'll see if I can salvage some of the performance monitoring code
>> from the sanitizer to give me details from which I might be able
>> to ferret out "opportunities".&nbsp; If I start by restricting my
>> observations to non-destructive synthetic loads, then I can
>> pull a drive and see how it fares in a different host while
>> running the same code, etc.
> 
> Actually, '2' turns out to be marginally better than '1'.
> Beyond that, its hard to generalize without controlling some
> of the other variables.
> 
> '2' wins because there is always the potential to make the
> disk busy, again, just after it satisfies the access for
> the 1st thread (which is now busy using the data, etc.)
> 
> But, if the first thread finishes up before the second thread's
> request has been satisfied, then the presence of a THIRD thread
> would just be clutter.&nbsp; (i.e., the work performed correlates
> with the number of threads that can have value)

Actually, 2 might be slower than 1, because the new request from the 
second thread is apt to need a seek, while a single thread making all 
the calls is more apt to sequentially read much more of the disk.

The controller, if not given a new chained command, might choose to 
automatically start reading the next sector of the cylinder, which could 
be likely the next one asked for.

The real optimum is likely a single process doing asynchronous requests 
queuing up a series of requests, and then distributing the data as it 
comes in to processing threads to do what ever crunching needs to be done.

These threads then send the data to a single thread that does 
asynchronous writes of the results.

You can easily tell if the input or output processes are I/O bound or 
not and use that to adjust the number of crunching threads in the middle.

On 10/18/2021 1:46 PM, Don Y wrote:
>> I think this will give you plenty of an idea how to go about it.
>> Once you know the limit you can run at some reasonable figure
>> below it and be happy. Getting more precise figures about all
>> that is neither easy nor will it buy you anything.
> 
> I suspect "1" is going to end up as the "best compromise".  So,
> I'm treating this as an exercise in *validating* that assumption.
> I'll see if I can salvage some of the performance monitoring code
> from the sanitizer to give me details from which I might be able
> to ferret out "opportunities".  If I start by restricting my
> observations to non-destructive synthetic loads, then I can
> pull a drive and see how it fares in a different host while
> running the same code, etc.

Actually, '2' turns out to be marginally better than '1'.
Beyond that, its hard to generalize without controlling some
of the other variables.

'2' wins because there is always the potential to make the
disk busy, again, just after it satisfies the access for
the 1st thread (which is now busy using the data, etc.)

But, if the first thread finishes up before the second thread's
request has been satisfied, then the presence of a THIRD thread
would just be clutter.  (i.e., the work performed correlates
with the number of threads that can have value)

On 10/18/2021 2:46 PM, Dimiter_Popoff wrote:
>> By contrast, the other disks in the machine all see a fair bit of
>> turnover as things get created, revised and deleted.
> 
> Now this is where the worst fit allocation strategy becomes the game
> changer. A newly created file is almost certainly contiguously
> allocated; fragmentation occurs when it is appended to (and the
> space past its last block has been already allocated in the meantime).
> I think I saw somewhere (never really got interested) that mainstream
> operating systems of today do just first fit

Dunno.  There are *lots* of different "file systems".  And, as most
systems use the filesystem for its global namespace, the term is
sadly overloaded (in ways that have nothing to do with storage media).

I don't use large secondary stores in my designs.  Where "big storage"
is required, it is accessed remotely (network) so the problem of
dealing with it is not mine.  E.g., in my current project, there is
no file system exposed; any persistent storage is done via "tables"
that are created, as needed (by the system and its apps) in a remote
database server.

[This lets me delegate the issue of data retention to a specific box.
And, also lets me impart structure to the data that is stored.  So,
for example, I can find the most recent firmware image for a particular
hardware module by just issuing a query and waiting on the result
instead of having to parse a bunch of names, somewhere in a
filesystem/namespace, looking for the one that has the highest
rev level IN its name:  firmware_1.0.1, firmware_1.0.2, firmware_5.3,
firmware 6.1 (typo!  the '_' was omitted -- will the parse algorithm
choke on that omission??)]

> - which means once you
> delete a file, no matter how small, its space will be allocated as
> part of the next request etc., no idea why they do it (if they do so,
> my memory on that is not very certain) in such a primitive manner
> but here they are.

On 10/18/2021 23:46, Don Y wrote:
> On 10/18/2021 9:25 AM, Dimiter_Popoff wrote:
>> The devil is not that black (literally translating a Bulgarian saying)
>> as you see. Worst fit allocation is of course crucial to get to
>> such figures, the mainstream OS-s don't do it and things there
>> must be much worse.
> 
> I think a lot depends on the amount of "churn" the filesystem
> experiences, in normal operation.&nbsp; E.g., the "system" disk on
> the workstation I'm using today has about 800G in use of 1T total.
> But, the vast majority of it is immutable -- binaries, libraries,
> etc.&nbsp; So, there's very low fragmentation (because I "build" the
> disk in one shot, instead of incrementally revising and
> "updating" its contents)

Typically so on most systems, I expect.

> 
> By contrast, the other disks in the machine all see a fair bit of
> turnover as things get created, revised and deleted.

Now this is where the worst fit allocation strategy becomes the game
changer. A newly created file is almost certainly contiguously
allocated; fragmentation occurs when it is appended to (and the
space past its last block has been already allocated in the meantime).
I think I saw somewhere (never really got interested) that mainstream
operating systems of today do just first fit - which means once you
delete a file, no matter how small, its space will be allocated as
part of the next request etc., no idea why they do it (if they do so,
my memory on that is not very certain) in such a primitive manner
but here they are.

======================================================
Dimiter Popoff, TGI             http://www.tgi-sci.com
======================================================
http://www.flickr.com/photos/didi_tgi/

On 10/18/2021 9:25 AM, Dimiter_Popoff wrote:
> The devil is not that black (literally translating a Bulgarian saying)
> as you see. Worst fit allocation is of course crucial to get to
> such figures, the mainstream OS-s don't do it and things there
> must be much worse.

I think a lot depends on the amount of "churn" the filesystem
experiences, in normal operation.  E.g., the "system" disk on
the workstation I'm using today has about 800G in use of 1T total.
But, the vast majority of it is immutable -- binaries, libraries,
etc.  So, there's very low fragmentation (because I "build" the
disk in one shot, instead of incrementally revising and
"updating" its contents)

By contrast, the other disks in the machine all see a fair bit of
turnover as things get created, revised and deleted.

>>> Even on popular filesystems which have long forgotten how to do worst
>>> fit allocation and have to defragment their disks not so infrequently.
>>> But I think they have to access at least 3 locations to get to a file;
>>> the directory entry, some kind of FAT-like thing, then the file.
>>> Unlike dps, where 2 accesses are enough. And of course dps does
>>> worst fit allocation so defragmentating is just unnecessary.
>>
>> I think directories are cached.  And, possibly entire drive structures
>> (depending on how much physical RAM you have available).
> 
> Well of course they must be caching them, especially since there are
> gigabytes of RAM available.

Yes, but *which*?  And how many (how "much")?  I would assume memory
set aside for caching files and file system structures is dynamically
managed -- if a process looks at lots of directories but few files
vs. a process that looks at few directories but many files...

> I know what dps does: it caches longnamed
> directories which coexist with the old 8.4 ones in the same filesystem
> and work faster than the 8.4 ones which typically don't get cached
> (these were done to work well even on floppies, a directory entry
> update writes back only the sector(s , if crossing) it occupies
> etc. Then in dps the CAT (cluster allocation tables) are cached all
> the time (do that for a 500G partition and enjoy reading all the
> 4 megabytes each time the CAT is needed to allocate new space...
> it can be done, in fact the caches are enabled upon boot explicitly
> on a per LUN/partition basis).
> 
>> E.g., my disk sanitizer times each (fixed size) access to profile the
>> drive's performance as well as looking for trouble spots on the media.
>> But, things like recal cycles or remapping bad sectors introduce
>> completely unpredictable blips in the throughput.  So much so that
>> I've had to implement a fair bit of logic to identify whether a
>> "delay" was part of normal operation *or* a sign of an exceptional
>> event.
>>
>> [But, the sanitizer has a very predictable access pattern so
>> there's no filesystem/content -specific issues involved; just
>> process sectors as fast as possible.  (also, there is no
>> need to have multiple threads per spindle; just a thread *per*
>> spindle -- plus some overhead threads)
>>
>> And, the sanitizer isn't as concerned with throughput as the
>> human operator is the bottleneck (I can crank out a 500+GB drive
>> every few minutes).]
> 
> I did something similar many years ago, wneh the largest drive
> a nukeman had was 200 (230 IIRC) megabytes, i.e. in prior to
> magnetoresistive heads came to the world. It did develop
> bad sectors and did not do much internally about it (1993).
> So I wrote the "lockout" command (still available, I see I have
> recompiled it for power, last change 2016 - can't remember if it
> did anything useful, nor why I did that). It accessed sector by
> sector the LUN it was told to and built the lockout CAT on
> its filesystem (LCAT being ORed to CAT prior to the LUN being
> usable for the OS). Took quite some time on that drive but
> did the job back then.

Yes, support for "bad block management" can be done outside the
drive.  In the sanitizer's case, it has to report on whether or not
it was able to successfully "scrub" every PHYSICAL sector that
might contain user data (for some "fussy" users).

So, if it appears that a sector may have been remapped (visible
by a drop in instantaneous access rate), I query the drive's
bad sector statistics to see if I should just abort the process
now and mark the drive to be (physically) shredded -- *if*
it belongs to a "fussy" user.  Regardless (for fussy users),
I will query those stats at the end of the operation to see
if they have changed during the process.

But, again, that's a very specific application with different
prospects for optimization.  E.g., there's no file system
support required as the disk is just a bunch of data blocks
(sectors) having no particular structure nor meaning.  (So,
no need for filesystem code at all!  One can scrub a SPARC
disk just as easily as a Mac!)

>> I'll mock up some synthetic loads and try various thread-spawning
>> strategies to see the sorts of performance I *might* be able
>> to get -- with different "preexisting" media (to minimize my
>> impact on that).
>>
>> I'm sure I can round up a dozen or more platforms to try -- just
>> from stuff I have lying around here!  :>
> 
> I think this will give you plenty of an idea how to go about it.
> Once you know the limit you can run at some reasonable figure
> below it and be happy. Getting more precise figures about all
> that is neither easy nor will it buy you anything.

I suspect "1" is going to end up as the "best compromise".  So,
I'm treating this as an exercise in *validating* that assumption.
I'll see if I can salvage some of the performance monitoring code
from the sanitizer to give me details from which I might be able
to ferret out "opportunities".  If I start by restricting my
observations to non-destructive synthetic loads, then I can
pull a drive and see how it fares in a different host while
running the same code, etc.

On 10/18/2021 18:05, Don Y wrote:
> On 10/17/2021 3:09 PM, Dimiter_Popoff wrote:
>>> You're assuming files are laid out contiguously -- that no seeks are 
>>> needed
>>> "between sectors".
>>
>> This is the typical case anyway, most files are contiguously allocated.
> 
> I'm not sure that is the case for files that have been modified on a 
> medium.

It is not the case for files which are appended to, say logs etc.
But files like that do not make such a high percentage. I looked at a log
file which logs some activities several times a day, it has grown to
52 megabytes (ascii text) for something like 15 months (first entry
is June 1 2020). It is spread over 32 segments, as one would expect.

Then I looked at the directory where I archive emails, 311 files
(one of them being Don.txt :-). Just one or two were two segmented,
the rest were all contiguous.

The devil is not that black (literally translating a Bulgarian saying)
as you see. Worst fit allocation is of course crucial to get to
such figures, the mainstream OS-s don't do it and things there
must be much worse.

 > ....
>> Even on popular filesystems which have long forgotten how to do worst
>> fit allocation and have to defragment their disks not so infrequently.
>> But I think they have to access at least 3 locations to get to a file;
>> the directory entry, some kind of FAT-like thing, then the file.
>> Unlike dps, where 2 accesses are enough. And of course dps does
>> worst fit allocation so defragmentating is just unnecessary.
> 
> I think directories are cached.&nbsp; And, possibly entire drive structures
> (depending on how much physical RAM you have available).

Well of course they must be caching them, especially since there are
gigabytes of RAM available. I know what dps does: it caches longnamed
directories which coexist with the old 8.4 ones in the same filesystem
and work faster than the 8.4 ones which typically don't get cached
(these were done to work well even on floppies, a directory entry
update writes back only the sector(s , if crossing) it occupies
etc. Then in dps the CAT (cluster allocation tables) are cached all
the time (do that for a 500G partition and enjoy reading all the
4 megabytes each time the CAT is needed to allocate new space...
it can be done, in fact the caches are enabled upon boot explicitly
on a per LUN/partition basis).

 > ...
> 
> E.g., my disk sanitizer times each (fixed size) access to profile the
> drive's performance as well as looking for trouble spots on the media.
> But, things like recal cycles or remapping bad sectors introduce
> completely unpredictable blips in the throughput.&nbsp; So much so that
> I've had to implement a fair bit of logic to identify whether a
> "delay" was part of normal operation *or* a sign of an exceptional
> event.
> 
> [But, the sanitizer has a very predictable access pattern so
> there's no filesystem/content -specific issues involved; just
> process sectors as fast as possible.&nbsp; (also, there is no
> need to have multiple threads per spindle; just a thread *per*
> spindle -- plus some overhead threads)
> 
> And, the sanitizer isn't as concerned with throughput as the
> human operator is the bottleneck (I can crank out a 500+GB drive
> every few minutes).]

I did something similar many years ago, wneh the largest drive
a nukeman had was 200 (230 IIRC) megabytes, i.e. in prior to
magnetoresistive heads came to the world. It did develop
bad sectors and did not do much internally about it (1993).
So I wrote the "lockout" command (still available, I see I have
recompiled it for power, last change 2016 - can't remember if it
did anything useful, nor why I did that). It accessed sector by
sector the LUN it was told to and built the lockout CAT on
its filesystem (LCAT being ORed to CAT prior to the LUN being
usable for the OS). Took quite some time on that drive but
did the job back then.

> 
> I'll mock up some synthetic loads and try various thread-spawning
> strategies to see the sorts of performance I *might* be able
> to get -- with different "preexisting" media (to minimize my
> impact on that).
> 
> I'm sure I can round up a dozen or more platforms to try -- just
> from stuff I have lying around here!&nbsp; :>

I think this will give you plenty of an idea how to go about it.
Once you know the limit you can run at some reasonable figure
below it and be happy. Getting more precise figures about all
that is neither easy nor will it buy you anything.

======================================================
Dimiter Popoff, TGI             http://www.tgi-sci.com
======================================================
http://www.flickr.com/photos/didi_tgi/

On 10/17/2021 3:09 PM, Dimiter_Popoff wrote:
>> You're assuming files are laid out contiguously -- that no seeks are needed
>> "between sectors".
> 
> This is the typical case anyway, most files are contiguously allocated.

I'm not sure that is the case for files that have been modified on a medium.
Write once, read multiple MAY support that sort of approach (assuming there
is enough contiguous space for that first write).  But, if you are appending
to a file or overwriting it, I don't know what guarantees you can expect;
there are a LOT of different file systems out there!

I would assume CD9660 would be the friendliest, in this regard.  And, it
is typically immutable so once the tracks are laid, they can persist.

[IIRC, CD9660 also has a "summary VToC, of sorts, so you don't have to
seek to individual subdirectories to find things from the medium's root]

> Even on popular filesystems which have long forgotten how to do worst
> fit allocation and have to defragment their disks not so infrequently.
> But I think they have to access at least 3 locations to get to a file;
> the directory entry, some kind of FAT-like thing, then the file.
> Unlike dps, where 2 accesses are enough. And of course dps does
> worst fit allocation so defragmentating is just unnecessary.

I think directories are cached.  And, possibly entire drive structures
(depending on how much physical RAM you have available).

I still can't see an easy threading strategy that can be applied
without more details of the target hardware/OS, filesystem layout,
specifics of the drives/volumes, etc.

E.g., my disk sanitizer times each (fixed size) access to profile the
drive's performance as well as looking for trouble spots on the media.
But, things like recal cycles or remapping bad sectors introduce
completely unpredictable blips in the throughput.  So much so that
I've had to implement a fair bit of logic to identify whether a
"delay" was part of normal operation *or* a sign of an exceptional
event.

[But, the sanitizer has a very predictable access pattern so
there's no filesystem/content -specific issues involved; just
process sectors as fast as possible.  (also, there is no
need to have multiple threads per spindle; just a thread *per*
spindle -- plus some overhead threads)

And, the sanitizer isn't as concerned with throughput as the
human operator is the bottleneck (I can crank out a 500+GB drive
every few minutes).]

I'll mock up some synthetic loads and try various thread-spawning
strategies to see the sorts of performance I *might* be able
to get -- with different "preexisting" media (to minimize my
impact on that).

I'm sure I can round up a dozen or more platforms to try -- just
from stuff I have lying around here!  :>