Reply by Dimiter_Popoff October 31, 20212021-10-31
On 10/31/2021 21:21, antispam@math.uni.wroc.pl wrote:
> Dimiter_Popoff <dp@tgi-sci.com> wrote: >> >> Now this is where the worst fit allocation strategy becomes the game >> changer. A newly created file is almost certainly contiguously >> allocated; fragmentation occurs when it is appended to (and the >> space past its last block has been already allocated in the meantime). >> I think I saw somewhere (never really got interested) that mainstream >> operating systems of today do just first fit - which means once you >> delete a file, no matter how small, its space will be allocated as >> part of the next request etc., no idea why they do it (if they do so, >> my memory on that is not very certain) in such a primitive manner >> but here they are. > > There was some research and first fit turned out to be pretty good. > But it is implemented slightly differently: there is moving pointer, > search advances it. So, after deallocation you wait till moving > pointer arrives to the hole. Way better than immediately filling > hole: there is reasonable chance that holes will coalece into > bigger free area. Another point is that you refuse allocation > when disc is too full (say at 95% utilization). The two things > together mean that normally fragmentation is not a problem. >
This might be somewhat better than plain first fit but it can be no match to worst fit. They probably don't do worst fit because they must have huge amounts of memory to dig through due to poor filesystem design (they have to go through 32 bits or more for each cluster). In DPS the CAT is one bit per cluster which makes worst fit doable without any problems like that. I have enhanced it a little (many years ago) noticing that the partition could come to a point where it has two almost equally sized empty parts; now you append to a file, it gets some clusters from one of these; next time you append to it the *other* one is the largest and it takes from it... :). So now dps first checks if there are empty clusters to allocate in succession of these already allocated to the file, if none then it does worst fit.
Reply by October 31, 20212021-10-31
Dimiter_Popoff <dp@tgi-sci.com> wrote:
> > Now this is where the worst fit allocation strategy becomes the game > changer. A newly created file is almost certainly contiguously > allocated; fragmentation occurs when it is appended to (and the > space past its last block has been already allocated in the meantime). > I think I saw somewhere (never really got interested) that mainstream > operating systems of today do just first fit - which means once you > delete a file, no matter how small, its space will be allocated as > part of the next request etc., no idea why they do it (if they do so, > my memory on that is not very certain) in such a primitive manner > but here they are.
There was some research and first fit turned out to be pretty good. But it is implemented slightly differently: there is moving pointer, search advances it. So, after deallocation you wait till moving pointer arrives to the hole. Way better than immediately filling hole: there is reasonable chance that holes will coalece into bigger free area. Another point is that you refuse allocation when disc is too full (say at 95% utilization). The two things together mean that normally fragmentation is not a problem. -- Waldek Hebisch
Reply by Don Y October 26, 20212021-10-26
On 10/25/2021 5:23 PM, Richard Damon wrote:
> On 10/25/21 7:19 AM, Don Y wrote: >> On 10/18/2021 1:46 PM, Don Y wrote: >>>> I think this will give you plenty of an idea how to go about it. >>>> Once you know the limit you can run at some reasonable figure >>>> below it and be happy. Getting more precise figures about all >>>> that is neither easy nor will it buy you anything. >>> >>> I suspect "1" is going to end up as the "best compromise". So, >>> I'm treating this as an exercise in *validating* that assumption. >>> I'll see if I can salvage some of the performance monitoring code >>> from the sanitizer to give me details from which I might be able >>> to ferret out "opportunities". If I start by restricting my >>> observations to non-destructive synthetic loads, then I can >>> pull a drive and see how it fares in a different host while >>> running the same code, etc. >> >> Actually, '2' turns out to be marginally better than '1'. >> Beyond that, its hard to generalize without controlling some >> of the other variables. >> >> '2' wins because there is always the potential to make the >> disk busy, again, just after it satisfies the access for >> the 1st thread (which is now busy using the data, etc.) >> >> But, if the first thread finishes up before the second thread's >> request has been satisfied, then the presence of a THIRD thread >> would just be clutter. (i.e., the work performed correlates >> with the number of threads that can have value) > > Actually, 2 might be slower than 1, because the new request from the second > thread is apt to need a seek, while a single thread making all the calls is > more apt to sequentially read much more of the disk.
Second thread is another instance of first thread; same code, same data. So, it is just "looking ahead" -- i.e., it WILL do what the first thread WOULD do if the second thread hadn't gotten to it, first! The strategy is predicated on tightly coupled actors so they each can see what the other has done/is doing. If the layout of the object that thread 1 is processing calls for a seek, then thread 2 will perform that seek as if thread 1 were to do it "when done chewing on the past data". If the disk caches data, then thread 2 will reap the benefits of that cache AS IF it was thread 1 acting later. Etc. A second thread just hides the cost of the data processing.
> The controller, if not given a new chained command, might choose to > automatically start reading the next sector of the cylinder, which could be > likely the next one asked for. > > The real optimum is likely a single process doing asynchronous requests queuing > up a series of requests, and then distributing the data as it comes in to > processing threads to do what ever crunching needs to be done. > > These threads then send the data to a single thread that does asynchronous > writes of the results. > > You can easily tell if the input or output processes are I/O bound or not and > use that to adjust the number of crunching threads in the middle. >
Reply by Richard Damon October 25, 20212021-10-25
On 10/25/21 7:19 AM, Don Y wrote:
> On 10/18/2021 1:46 PM, Don Y wrote: >>> I think this will give you plenty of an idea how to go about it. >>> Once you know the limit you can run at some reasonable figure >>> below it and be happy. Getting more precise figures about all >>> that is neither easy nor will it buy you anything. >> >> I suspect "1" is going to end up as the "best compromise".&nbsp; So, >> I'm treating this as an exercise in *validating* that assumption. >> I'll see if I can salvage some of the performance monitoring code >> from the sanitizer to give me details from which I might be able >> to ferret out "opportunities".&nbsp; If I start by restricting my >> observations to non-destructive synthetic loads, then I can >> pull a drive and see how it fares in a different host while >> running the same code, etc. > > Actually, '2' turns out to be marginally better than '1'. > Beyond that, its hard to generalize without controlling some > of the other variables. > > '2' wins because there is always the potential to make the > disk busy, again, just after it satisfies the access for > the 1st thread (which is now busy using the data, etc.) > > But, if the first thread finishes up before the second thread's > request has been satisfied, then the presence of a THIRD thread > would just be clutter.&nbsp; (i.e., the work performed correlates > with the number of threads that can have value)
Actually, 2 might be slower than 1, because the new request from the second thread is apt to need a seek, while a single thread making all the calls is more apt to sequentially read much more of the disk. The controller, if not given a new chained command, might choose to automatically start reading the next sector of the cylinder, which could be likely the next one asked for. The real optimum is likely a single process doing asynchronous requests queuing up a series of requests, and then distributing the data as it comes in to processing threads to do what ever crunching needs to be done. These threads then send the data to a single thread that does asynchronous writes of the results. You can easily tell if the input or output processes are I/O bound or not and use that to adjust the number of crunching threads in the middle.
Reply by Don Y October 25, 20212021-10-25
On 10/18/2021 1:46 PM, Don Y wrote:
>> I think this will give you plenty of an idea how to go about it. >> Once you know the limit you can run at some reasonable figure >> below it and be happy. Getting more precise figures about all >> that is neither easy nor will it buy you anything. > > I suspect "1" is going to end up as the "best compromise". So, > I'm treating this as an exercise in *validating* that assumption. > I'll see if I can salvage some of the performance monitoring code > from the sanitizer to give me details from which I might be able > to ferret out "opportunities". If I start by restricting my > observations to non-destructive synthetic loads, then I can > pull a drive and see how it fares in a different host while > running the same code, etc.
Actually, '2' turns out to be marginally better than '1'. Beyond that, its hard to generalize without controlling some of the other variables. '2' wins because there is always the potential to make the disk busy, again, just after it satisfies the access for the 1st thread (which is now busy using the data, etc.) But, if the first thread finishes up before the second thread's request has been satisfied, then the presence of a THIRD thread would just be clutter. (i.e., the work performed correlates with the number of threads that can have value)
Reply by Don Y October 18, 20212021-10-18
On 10/18/2021 2:46 PM, Dimiter_Popoff wrote:
>> By contrast, the other disks in the machine all see a fair bit of >> turnover as things get created, revised and deleted. > > Now this is where the worst fit allocation strategy becomes the game > changer. A newly created file is almost certainly contiguously > allocated; fragmentation occurs when it is appended to (and the > space past its last block has been already allocated in the meantime). > I think I saw somewhere (never really got interested) that mainstream > operating systems of today do just first fit
Dunno. There are *lots* of different "file systems". And, as most systems use the filesystem for its global namespace, the term is sadly overloaded (in ways that have nothing to do with storage media). I don't use large secondary stores in my designs. Where "big storage" is required, it is accessed remotely (network) so the problem of dealing with it is not mine. E.g., in my current project, there is no file system exposed; any persistent storage is done via "tables" that are created, as needed (by the system and its apps) in a remote database server. [This lets me delegate the issue of data retention to a specific box. And, also lets me impart structure to the data that is stored. So, for example, I can find the most recent firmware image for a particular hardware module by just issuing a query and waiting on the result instead of having to parse a bunch of names, somewhere in a filesystem/namespace, looking for the one that has the highest rev level IN its name: firmware_1.0.1, firmware_1.0.2, firmware_5.3, firmware 6.1 (typo! the '_' was omitted -- will the parse algorithm choke on that omission??)]
> - which means once you > delete a file, no matter how small, its space will be allocated as > part of the next request etc., no idea why they do it (if they do so, > my memory on that is not very certain) in such a primitive manner > but here they are.
Reply by Dimiter_Popoff October 18, 20212021-10-18
On 10/18/2021 23:46, Don Y wrote:
> On 10/18/2021 9:25 AM, Dimiter_Popoff wrote: >> The devil is not that black (literally translating a Bulgarian saying) >> as you see. Worst fit allocation is of course crucial to get to >> such figures, the mainstream OS-s don't do it and things there >> must be much worse. > > I think a lot depends on the amount of "churn" the filesystem > experiences, in normal operation.&nbsp; E.g., the "system" disk on > the workstation I'm using today has about 800G in use of 1T total. > But, the vast majority of it is immutable -- binaries, libraries, > etc.&nbsp; So, there's very low fragmentation (because I "build" the > disk in one shot, instead of incrementally revising and > "updating" its contents)
Typically so on most systems, I expect.
> > By contrast, the other disks in the machine all see a fair bit of > turnover as things get created, revised and deleted.
Now this is where the worst fit allocation strategy becomes the game changer. A newly created file is almost certainly contiguously allocated; fragmentation occurs when it is appended to (and the space past its last block has been already allocated in the meantime). I think I saw somewhere (never really got interested) that mainstream operating systems of today do just first fit - which means once you delete a file, no matter how small, its space will be allocated as part of the next request etc., no idea why they do it (if they do so, my memory on that is not very certain) in such a primitive manner but here they are. ====================================================== Dimiter Popoff, TGI http://www.tgi-sci.com ====================================================== http://www.flickr.com/photos/didi_tgi/
Reply by Don Y October 18, 20212021-10-18
On 10/18/2021 9:25 AM, Dimiter_Popoff wrote:
> The devil is not that black (literally translating a Bulgarian saying) > as you see. Worst fit allocation is of course crucial to get to > such figures, the mainstream OS-s don't do it and things there > must be much worse.
I think a lot depends on the amount of "churn" the filesystem experiences, in normal operation. E.g., the "system" disk on the workstation I'm using today has about 800G in use of 1T total. But, the vast majority of it is immutable -- binaries, libraries, etc. So, there's very low fragmentation (because I "build" the disk in one shot, instead of incrementally revising and "updating" its contents) By contrast, the other disks in the machine all see a fair bit of turnover as things get created, revised and deleted.
>>> Even on popular filesystems which have long forgotten how to do worst >>> fit allocation and have to defragment their disks not so infrequently. >>> But I think they have to access at least 3 locations to get to a file; >>> the directory entry, some kind of FAT-like thing, then the file. >>> Unlike dps, where 2 accesses are enough. And of course dps does >>> worst fit allocation so defragmentating is just unnecessary. >> >> I think directories are cached. And, possibly entire drive structures >> (depending on how much physical RAM you have available). > > Well of course they must be caching them, especially since there are > gigabytes of RAM available.
Yes, but *which*? And how many (how "much")? I would assume memory set aside for caching files and file system structures is dynamically managed -- if a process looks at lots of directories but few files vs. a process that looks at few directories but many files...
> I know what dps does: it caches longnamed > directories which coexist with the old 8.4 ones in the same filesystem > and work faster than the 8.4 ones which typically don't get cached > (these were done to work well even on floppies, a directory entry > update writes back only the sector(s , if crossing) it occupies > etc. Then in dps the CAT (cluster allocation tables) are cached all > the time (do that for a 500G partition and enjoy reading all the > 4 megabytes each time the CAT is needed to allocate new space... > it can be done, in fact the caches are enabled upon boot explicitly > on a per LUN/partition basis). > >> E.g., my disk sanitizer times each (fixed size) access to profile the >> drive's performance as well as looking for trouble spots on the media. >> But, things like recal cycles or remapping bad sectors introduce >> completely unpredictable blips in the throughput. So much so that >> I've had to implement a fair bit of logic to identify whether a >> "delay" was part of normal operation *or* a sign of an exceptional >> event. >> >> [But, the sanitizer has a very predictable access pattern so >> there's no filesystem/content -specific issues involved; just >> process sectors as fast as possible. (also, there is no >> need to have multiple threads per spindle; just a thread *per* >> spindle -- plus some overhead threads) >> >> And, the sanitizer isn't as concerned with throughput as the >> human operator is the bottleneck (I can crank out a 500+GB drive >> every few minutes).] > > I did something similar many years ago, wneh the largest drive > a nukeman had was 200 (230 IIRC) megabytes, i.e. in prior to > magnetoresistive heads came to the world. It did develop > bad sectors and did not do much internally about it (1993). > So I wrote the "lockout" command (still available, I see I have > recompiled it for power, last change 2016 - can't remember if it > did anything useful, nor why I did that). It accessed sector by > sector the LUN it was told to and built the lockout CAT on > its filesystem (LCAT being ORed to CAT prior to the LUN being > usable for the OS). Took quite some time on that drive but > did the job back then.
Yes, support for "bad block management" can be done outside the drive. In the sanitizer's case, it has to report on whether or not it was able to successfully "scrub" every PHYSICAL sector that might contain user data (for some "fussy" users). So, if it appears that a sector may have been remapped (visible by a drop in instantaneous access rate), I query the drive's bad sector statistics to see if I should just abort the process now and mark the drive to be (physically) shredded -- *if* it belongs to a "fussy" user. Regardless (for fussy users), I will query those stats at the end of the operation to see if they have changed during the process. But, again, that's a very specific application with different prospects for optimization. E.g., there's no file system support required as the disk is just a bunch of data blocks (sectors) having no particular structure nor meaning. (So, no need for filesystem code at all! One can scrub a SPARC disk just as easily as a Mac!)
>> I'll mock up some synthetic loads and try various thread-spawning >> strategies to see the sorts of performance I *might* be able >> to get -- with different "preexisting" media (to minimize my >> impact on that). >> >> I'm sure I can round up a dozen or more platforms to try -- just >> from stuff I have lying around here! :> > > I think this will give you plenty of an idea how to go about it. > Once you know the limit you can run at some reasonable figure > below it and be happy. Getting more precise figures about all > that is neither easy nor will it buy you anything.
I suspect "1" is going to end up as the "best compromise". So, I'm treating this as an exercise in *validating* that assumption. I'll see if I can salvage some of the performance monitoring code from the sanitizer to give me details from which I might be able to ferret out "opportunities". If I start by restricting my observations to non-destructive synthetic loads, then I can pull a drive and see how it fares in a different host while running the same code, etc.
Reply by Dimiter_Popoff October 18, 20212021-10-18
On 10/18/2021 18:05, Don Y wrote:
> On 10/17/2021 3:09 PM, Dimiter_Popoff wrote: >>> You're assuming files are laid out contiguously -- that no seeks are >>> needed >>> "between sectors". >> >> This is the typical case anyway, most files are contiguously allocated. > > I'm not sure that is the case for files that have been modified on a > medium.
It is not the case for files which are appended to, say logs etc. But files like that do not make such a high percentage. I looked at a log file which logs some activities several times a day, it has grown to 52 megabytes (ascii text) for something like 15 months (first entry is June 1 2020). It is spread over 32 segments, as one would expect. Then I looked at the directory where I archive emails, 311 files (one of them being Don.txt :-). Just one or two were two segmented, the rest were all contiguous. The devil is not that black (literally translating a Bulgarian saying) as you see. Worst fit allocation is of course crucial to get to such figures, the mainstream OS-s don't do it and things there must be much worse. > ....
>> Even on popular filesystems which have long forgotten how to do worst >> fit allocation and have to defragment their disks not so infrequently. >> But I think they have to access at least 3 locations to get to a file; >> the directory entry, some kind of FAT-like thing, then the file. >> Unlike dps, where 2 accesses are enough. And of course dps does >> worst fit allocation so defragmentating is just unnecessary. > > I think directories are cached.&nbsp; And, possibly entire drive structures > (depending on how much physical RAM you have available).
Well of course they must be caching them, especially since there are gigabytes of RAM available. I know what dps does: it caches longnamed directories which coexist with the old 8.4 ones in the same filesystem and work faster than the 8.4 ones which typically don't get cached (these were done to work well even on floppies, a directory entry update writes back only the sector(s , if crossing) it occupies etc. Then in dps the CAT (cluster allocation tables) are cached all the time (do that for a 500G partition and enjoy reading all the 4 megabytes each time the CAT is needed to allocate new space... it can be done, in fact the caches are enabled upon boot explicitly on a per LUN/partition basis). > ...
> > E.g., my disk sanitizer times each (fixed size) access to profile the > drive's performance as well as looking for trouble spots on the media. > But, things like recal cycles or remapping bad sectors introduce > completely unpredictable blips in the throughput.&nbsp; So much so that > I've had to implement a fair bit of logic to identify whether a > "delay" was part of normal operation *or* a sign of an exceptional > event. > > [But, the sanitizer has a very predictable access pattern so > there's no filesystem/content -specific issues involved; just > process sectors as fast as possible.&nbsp; (also, there is no > need to have multiple threads per spindle; just a thread *per* > spindle -- plus some overhead threads) > > And, the sanitizer isn't as concerned with throughput as the > human operator is the bottleneck (I can crank out a 500+GB drive > every few minutes).]
I did something similar many years ago, wneh the largest drive a nukeman had was 200 (230 IIRC) megabytes, i.e. in prior to magnetoresistive heads came to the world. It did develop bad sectors and did not do much internally about it (1993). So I wrote the "lockout" command (still available, I see I have recompiled it for power, last change 2016 - can't remember if it did anything useful, nor why I did that). It accessed sector by sector the LUN it was told to and built the lockout CAT on its filesystem (LCAT being ORed to CAT prior to the LUN being usable for the OS). Took quite some time on that drive but did the job back then.
> > I'll mock up some synthetic loads and try various thread-spawning > strategies to see the sorts of performance I *might* be able > to get -- with different "preexisting" media (to minimize my > impact on that). > > I'm sure I can round up a dozen or more platforms to try -- just > from stuff I have lying around here!&nbsp; :>
I think this will give you plenty of an idea how to go about it. Once you know the limit you can run at some reasonable figure below it and be happy. Getting more precise figures about all that is neither easy nor will it buy you anything. ====================================================== Dimiter Popoff, TGI http://www.tgi-sci.com ====================================================== http://www.flickr.com/photos/didi_tgi/
Reply by Don Y October 18, 20212021-10-18
On 10/17/2021 3:09 PM, Dimiter_Popoff wrote:
>> You're assuming files are laid out contiguously -- that no seeks are needed >> "between sectors". > > This is the typical case anyway, most files are contiguously allocated.
I'm not sure that is the case for files that have been modified on a medium. Write once, read multiple MAY support that sort of approach (assuming there is enough contiguous space for that first write). But, if you are appending to a file or overwriting it, I don't know what guarantees you can expect; there are a LOT of different file systems out there! I would assume CD9660 would be the friendliest, in this regard. And, it is typically immutable so once the tracks are laid, they can persist. [IIRC, CD9660 also has a "summary VToC, of sorts, so you don't have to seek to individual subdirectories to find things from the medium's root]
> Even on popular filesystems which have long forgotten how to do worst > fit allocation and have to defragment their disks not so infrequently. > But I think they have to access at least 3 locations to get to a file; > the directory entry, some kind of FAT-like thing, then the file. > Unlike dps, where 2 accesses are enough. And of course dps does > worst fit allocation so defragmentating is just unnecessary.
I think directories are cached. And, possibly entire drive structures (depending on how much physical RAM you have available). I still can't see an easy threading strategy that can be applied without more details of the target hardware/OS, filesystem layout, specifics of the drives/volumes, etc. E.g., my disk sanitizer times each (fixed size) access to profile the drive's performance as well as looking for trouble spots on the media. But, things like recal cycles or remapping bad sectors introduce completely unpredictable blips in the throughput. So much so that I've had to implement a fair bit of logic to identify whether a "delay" was part of normal operation *or* a sign of an exceptional event. [But, the sanitizer has a very predictable access pattern so there's no filesystem/content -specific issues involved; just process sectors as fast as possible. (also, there is no need to have multiple threads per spindle; just a thread *per* spindle -- plus some overhead threads) And, the sanitizer isn't as concerned with throughput as the human operator is the bottleneck (I can crank out a 500+GB drive every few minutes).] I'll mock up some synthetic loads and try various thread-spawning strategies to see the sorts of performance I *might* be able to get -- with different "preexisting" media (to minimize my impact on that). I'm sure I can round up a dozen or more platforms to try -- just from stuff I have lying around here! :>