I currently "verify" my data archive with an algorithm similar to:
ONCE: con 0;
Success: con 0;
Removed, Inaccessible, NoRecord, BadSize, BadHash: con Success+1+iota;
verify(files: list of string ): list of (string, reason)
{
failures := list of (string, reason);
while (files != nil) {
theFile := hd files;
(size, signature, result) := examine_file(theFile);
(Size, Signature, Result) := query_DBMS(theFile);
do {
if (NotFound == result)
{Result = Removed; break;}
if (BadRead == result)
{Result = Inaccessible; break;}
// theFile exists!
if (NotFound == Result)
{Result = NoRecord; break;}
if (Size != size)
{Result = BadSize; break;}
if (Signature != signature)
{Result = BadHash; break;}
} while (ONCE);
if (Success != Result) {
failures = (theFile, Result) :: failures;
}
files = tl files;
}
return( failures );
}
[apologies for any typos; too early in the day to be working!]
The problem this has is examine-file() is expensive -- for large
numbers of large files!
I'd like to short-circuit this to provide an early indication of
possible problems. E.g., stat(2) each "theFile" to identify any
that are "Removed" without incurring the cost of examining the
file in its entirety.
[Of course, all files MUST be eventually examined as you can't
verify accessibility (BadRead), actual size (BadSize) or correct
signature (BadHash) without looking at every byte! But, you can
get the simple checks out of the way expeditiously to alert you
to "obvious" problems!]
I can get an approximate indication of "actual size" just by
stat(2)-ing the file, as well -- relying on examine_file() to
later confirm that figure.
The hash, of course, can't be guesstimated without trodding through
the file in its entirety.
The problem with *this* approach is I'd have to lock the entire
object to ensure the integrity of the "early results" agrees with
that of the "later results" (the original approach only requires
taking a lock on the file being examined *while* it is being
examined).
Are there any "tricks" I can use, here? E.g., the equivalent of
CoW would ensure the integrity of my data (for the *original*
instance) in the presence of such a global lock. But, I think only
the Bullet Server would give me any behavior approaching this...?
Also, how wary would one be relying on the files' timestamps as a
(crude) indication of "alteration"? (It seems far too easy to
alter a timestamp without altering content/size so I've been leary
of relying on that as having any "worth").
Checking large numbers of files
Started by ●October 27, 2016
Reply by ●October 27, 20162016-10-27
On 28/10/16 09:23, Don Y wrote:> I currently "verify" my data archive with an algorithm similar to:...> Are there any "tricks" I can use, here?Git is very fast at verifying files. You might find that "git annex" is just right for maintaining your archives. It's built for just this, and more.
Reply by ●October 27, 20162016-10-27
On 10/27/2016 3:48 PM, Clifford Heath wrote:> On 28/10/16 09:23, Don Y wrote: >> I currently "verify" my data archive with an algorithm similar to: > ... >> Are there any "tricks" I can use, here? > > Git is very fast at verifying files. You might find that > "git annex" is just right for maintaining your archives. > It's built for just this, and more.There's more to it than just verifying a file hasn't changed (I can do that by comparing its current hash -- pick your favorite algorithm -- to a stored hash). There are innumerable tools for doing that, synchronizing different copies on different volumes (even if a volume is offline), etc. The point of my post was to avoid computing hashes of terabytes of files when I can discover obvious changes by looking at file sizes, dates, names, etc. OTOH, as certain metadata can be too easily corrupted (e.g., dates), incorporating them in a strategy can result in too many false positives: "The timestamp on 'foo' has changed. I know you 'verified' the contents of the entire file YESTERDAY, but the changed timestamp suggests you might want to reverify it, today!" [I also want the "process" to automatically pause and resume as I asynchronously add and remove volumes -- possibly idling for weeks or months before having access to a volume, again]
Reply by ●October 27, 20162016-10-27
On 28.10.2016 г. 02:48, Don Y wrote:> On 10/27/2016 3:48 PM, Clifford Heath wrote: >> On 28/10/16 09:23, Don Y wrote: >>> I currently "verify" my data archive with an algorithm similar to: >> ... >>> Are there any "tricks" I can use, here? >> >> Git is very fast at verifying files. You might find that >> "git annex" is just right for maintaining your archives. >> It's built for just this, and more. > > There's more to it than just verifying a file hasn't changed > (I can do that by comparing its current hash -- pick your > favorite algorithm -- to a stored hash). There are innumerable > tools for doing that, synchronizing different copies on > different volumes (even if a volume is offline), etc. > > The point of my post was to avoid computing hashes of terabytes > of files when I can discover obvious changes by looking at file > sizes, dates, names, etc. > > OTOH, as certain metadata can be too easily corrupted (e.g., dates), > incorporating them in a strategy can result in too many false > positives: "The timestamp on 'foo' has changed. I know you > 'verified' the contents of the entire file YESTERDAY, but the > changed timestamp suggests you might want to reverify it, today!"Hi Don, why can't you trust the timestamp? Obviously I can speak about dps only where a timestamp in a directory entry will only get modified if the file has been modified itself but I see no reason this would differ on other systems so I am curious. Apart from the obvious timestamp and file size you could perhaps also check the file allocation data for changes/errors - and, as you obviously know, from there on you will have to read the entire file, what else can you do. [under DPS I run a "repair" utility which goes through all directories to rebuild the unit CAT (cluster allocation table), takes under a minute for a 200G partition. I don't remember a file having been corrupted somehow without some other accompanying mess on the disk which repair would not catch. BTW it took me maybe 20 years to catch the cause of the occasional space leak... (was negligible but repair would catch it and fix it, IIRC the cause was some rounding error the deallocate function was making)]. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/> > [I also want the "process" to automatically pause and resume as > I asynchronously add and remove volumes -- possibly idling > for weeks or months before having access to a volume, again]
Reply by ●October 27, 20162016-10-27
On 28/10/16 10:48, Don Y wrote:> On 10/27/2016 3:48 PM, Clifford Heath wrote: >> On 28/10/16 09:23, Don Y wrote: >>> I currently "verify" my data archive with an algorithm similar to: >> ... >>> Are there any "tricks" I can use, here? >> >> Git is very fast at verifying files. You might find that >> "git annex" is just right for maintaining your archives. >> It's built for just this, and more. > > There's more to it than just verifying a file hasn't changed > (I can do that by comparing its current hash -- pick your > favorite algorithm -- to a stored hash). There are innumerable > tools for doing that, synchronizing different copies on > different volumes (even if a volume is offline), etc. > > The point of my post was to avoid computing hashes of terabytes > of files when I can discover obvious changes by looking at file > sizes, dates, names, etc. > > OTOH, as certain metadata can be too easily corrupted (e.g., dates), > incorporating them in a strategy can result in too many false > positives: "The timestamp on 'foo' has changed. I know you > 'verified' the contents of the entire file YESTERDAY, but the > changed timestamp suggests you might want to reverify it, today!" > > [I also want the "process" to automatically pause and resume as > I asynchronously add and remove volumes -- possibly idling > for weeks or months before having access to a volume, again]I assume that you thought it was quicker to respond like this than it was to, for example, go and actually read about "git annex"? Sorry, I responded to try to help, not enter a prolonged discussion. I know there are many ways to address this problem. It seems you'd rather talk about it than actually do that though. Not me.
Reply by ●October 27, 20162016-10-27
Hi Dimiter, [Still 35C! Cripes!!] On 10/27/2016 5:17 PM, Dimiter_Popoff wrote:>> OTOH, as certain metadata can be too easily corrupted (e.g., dates), >> incorporating them in a strategy can result in too many false >> positives: "The timestamp on 'foo' has changed. I know you >> 'verified' the contents of the entire file YESTERDAY, but the >> changed timestamp suggests you might want to reverify it, today!" > > why can't you trust the timestamp? Obviously I can speak about dps > only where a timestamp in a directory entry will only get modified > if the file has been modified itself but I see no reason this would > differ on other systems so I am curious.First, timestamps on "containers" (e.g., things like "folders") don't mean anything. If /foo/A and /foo/B are "being tracked" but other contents of /foo are not, then the timestamp of /foo can change even though NEITHER A nor B have been altered! Also, I access/mount the volumes from a variety of different hosts and "maintain" the files with a variety of different utilities. The clocks on all of these hosts aren't guaranteed to be synchronized. So, you can't infer chronology from the timestamps. And, the mechanisms that I use to move files around may not preserve original timestamps, accurately. E.g., I may move files onto a volume over FTP -- using a client that doesn't propagate the timestamp of the *original* file. So, that instance of the file now has a different timestamp from the original FROM WHICH IT WAS COPIED. Yet, I *know* they are identical, in contents.> Apart from the obvious timestamp and file size you could perhaps > also check the file allocation data for changes/errors - and, as you > obviously know, from there on you will have to read the entire > file, what else can you do.Yes. I'm looking to rearrange the tests in my current algorithm to give me earlier indications of problems. I.e., I would much rather wait a few seconds to discover that a file NAME is not present where it should be (at some random point in the file system hierarchy) than to have to wait an hour to discover it AFTER the previous 600,000 files have had their *contents* verified! [Imagine a scenario where a volume is failing; you'd much rather know about a file that is NOT present on a "backup copy" than be reassured that the contents of all the OTHER files are intact; you KNOW you will lose data due to the missing file but MIGHT not lose anything from those other files if you've not yet wasted time checking them!]> [under DPS I run a "repair" utility which goes through all directories > to rebuild the unit CAT (cluster allocation table), takes under > a minute for a 200G partition. I don't remember a file having been > corrupted somehow without some other accompanying mess on the disk > which repair would not catch. BTW it took me maybe 20 years to > catch the cause of the occasional space leak... (was negligible but > repair would catch it and fix it, IIRC the cause was some rounding > error the deallocate function was making)].The volumes I'm addressing reside on a variety of different machines. In the past, I would periodically (i.e., subject to the reliability of my meatware memory) power up each RAID5 array and copy its contents to /dev/null (i.e., force the array to read everything and SILENTLY repair any errors it encountered). When I moved to RAID1, I'd repeat the exercise comparing the contents of one volume to its mirror. Etc. After that, I blurred the boundaries between volumes, file systems, etc. I relied on my own discipline to know that *this* portion of *this* file hierarchy is mirrored in some particular other place. So, I would MANUALLY do the comparisons (rsync, etc.) "whenever I remembered to do so". Now, I am automating that so I don't have to remember what is where; let the RDBMS keep track of all that stuff. Let *it* know if a file has changed by using its data to verify each file's contents. So, I can have: /VolumeA/.../expenses/2015/taxes/federal /VolumeB/.../2015/expenses/taxes/federal /VolumeC/.../2016/personal/federal/taxes ALL be different instances of the same "object" (same size and hashes but different names and "containers"). If a routine examination of /VolumeC/.../2016/personal/federal/taxes notes that its hash/size has changed, then I can be alerted. I can be *advised* that this MIGHT be the same object as the other two mentioned above. OTOH, it may not (depends on the strength of the hash *and* my intent for that file -- i.e., I may just be using it as a template to create my *2016* tax records and the reason it is NOW different is because I have explicitly changed it!) As a result, I have to intervene and indicate IF and HOW a file's previous contents should be "recovered". This scheme lets me just bring hosts on-line and mount volumes as I see fit -- without having to take deliberate action to "inform something" that I want the contents of these volumes checked (if they were checked 2 hours ago, why would I want to waste effort CASUALLY checking them *again*??) So, when I power up a workstation and it exports some particular SMB/NFS share, *if* that share has content that is important to this "archive system", then the archive system will surreptitiously sneak in and check those contents, IF NECESSARY. And, if I happen to power down that host before it has a chance to finish checking, it will remember how much progress it made and where it should resume along with WHEN such a check would be "overdue". In that case, it will email me a notice telling me which volumes I need to mount (not *where* they need to be mounted) so that it can verify the associated contents. [When I was using tape for backups, it was a chore to remember which media needed to be recycled, which needed to be retensioned, which files needed to be backed-up, etc. This handles all of that for me without forcing me to take deliberate steps (beyond powering up a suitable host and placing the necessary media in a compatible drive/transport)] It also makes it easier for me to decide which media I can discard and/or erase. E.g., I discarded two spindles of recorded optical media last week and KNOW that I have sufficient copies of the files on all of those media to not be at risk by that action: for file in list_of_files number = count_instances(file) if (number < N) echo "$file is precious!" > log WITHOUT having to dig through stacks of media wondering what's on each one! [I have two PC's that I'll be scrapping this week. I'll let this system catalog their disks. Then, have *it* tell me if there is anything that I need to rescue before scrapping them]
Reply by ●October 27, 20162016-10-27
On 10/27/2016 5:40 PM, Clifford Heath wrote:> On 28/10/16 10:48, Don Y wrote: >> On 10/27/2016 3:48 PM, Clifford Heath wrote: >>> On 28/10/16 09:23, Don Y wrote: >>>> I currently "verify" my data archive with an algorithm similar to: >>> ... >>>> Are there any "tricks" I can use, here? >>> >>> Git is very fast at verifying files. You might find that >>> "git annex" is just right for maintaining your archives. >>> It's built for just this, and more. >> >> There's more to it than just verifying a file hasn't changed >> (I can do that by comparing its current hash -- pick your >> favorite algorithm -- to a stored hash). There are innumerable >> tools for doing that, synchronizing different copies on >> different volumes (even if a volume is offline), etc. >> >> The point of my post was to avoid computing hashes of terabytes >> of files when I can discover obvious changes by looking at file >> sizes, dates, names, etc. >> >> OTOH, as certain metadata can be too easily corrupted (e.g., dates), >> incorporating them in a strategy can result in too many false >> positives: "The timestamp on 'foo' has changed. I know you >> 'verified' the contents of the entire file YESTERDAY, but the >> changed timestamp suggests you might want to reverify it, today!" >> >> [I also want the "process" to automatically pause and resume as >> I asynchronously add and remove volumes -- possibly idling >> for weeks or months before having access to a volume, again] > > I assume that you thought it was quicker to respond like this than > it was to, for example, go and actually read about "git annex"?Yeah, the web page is still OPEN in my browser. So, you know what ASSuming does...> Sorry, I responded to try to help, not enter a prolonged discussion. > I know there are many ways to address this problem. It seems you'd > rather talk about it than actually do that though. Not me.You seem to ASSume other people don't do their homework before posting questions. I've NOT gone into a description of the innumerable ways git annex FAILS to address my problem. Just as your answer failed to address the question I posed!
Reply by ●October 27, 20162016-10-27
On 28.10.2016 г. 04:50, Don Y wrote:> Hi Dimiter, > > [Still 35C! Cripes!!]Hi Don, we switched from walking around in shorts to winter uniform and house heating almost a month ago...> First, timestamps on "containers" (e.g., things like "folders") > don't mean anything. If /foo/A and /foo/B are "being tracked" > but other contents of /foo are not, then the timestamp of /foo > can change even though NEITHER A nor B have been altered!Well yes, the last modification date of a directory may be impractical to be kept up to date. I made this optional (by changing the directory file type) some 20+ years ago... But the file "last modification date" is pretty stable, can't see it change for no good reason. I often do copy with the option to overwrite only files which are older than the source to be copied, has not bitten me so far.> > Also, I access/mount the volumes from a variety of different hosts > and "maintain" the files with a variety of different utilities. > The clocks on all of these hosts aren't guaranteed to be synchronized. > So, you can't infer chronology from the timestamps.Hmmm, when copying a file its timestamp should normally be copied as well so local clocks should not matter re file contents.> And, the mechanisms that I use to move files around may not > preserve original timestamps, accurately. > > E.g., I may move files onto a volume over FTP -- using a client > that doesn't propagate the timestamp of the *original* file.This is an issue as old as ftp is, yes... If you use GET with ftp the file timestamp can come correctly across (MLST). Still no standard way to propagate the correct timestamp when you "STOR".> So, that instance of the file now has a different timestamp > from the original FROM WHICH IT WAS COPIED. Yet, I *know* > they are identical, in contents.Well yes, using ftp "put" does that indeed - but I can't think of another cause so it is soft of avoidable.> > [Imagine a scenario where a volume is failing; you'd much rather know > about a file that is NOT present on a "backup copy" than be reassured > that the contents of all the OTHER files are intact; you KNOW you > will lose data due to the missing file but MIGHT not lose anything > from those other files if you've not yet wasted time checking them!]Well searching the whole tree for a particular name would be painful but still faster than copying it all. I personally tend to know which file belongs where so locating it is easy - but this is not much help if you look into an archive you are not familiar with. Dimiter
Reply by ●October 28, 20162016-10-28
>> [Still 35C! Cripes!!] > > we switched from walking around in shorts to winter uniform and > house heating almost a month ago...We're still running the air conditioning. Unusual to be using it this long. I think the *average* high temp for October, this year, was 33C. I know we've had 20+ days above 90F (and usually WELL above) and it's only the 27th...>> First, timestamps on "containers" (e.g., things like "folders") >> don't mean anything. If /foo/A and /foo/B are "being tracked" >> but other contents of /foo are not, then the timestamp of /foo >> can change even though NEITHER A nor B have been altered! > > Well yes, the last modification date of a directory may be > impractical to be kept up to date. I made this optional (by changing > the directory file type) some 20+ years ago...And, even if ALL of the directory's contents are "being tracked", you can do: mv * ../X mv ../X/* . to effectively leave teh directory "as it was" -- yet the timestamp is modified.> But the file "last modification date" is pretty stable, can't see > it change for no good reason. I often do copy with the option to > overwrite only files which are older than the source to be copied, > has not bitten me so far.Pull the disk out of your computer and put it in another computer (which is what you effectively do when you access it over the wire *or* physically move an external medium). Or, "touch *" -- none of the files' contents/names have been altered but all of their timestamps have. Which timestamp is now correct: the timestamp of the touched file(s) or the "original" timestamps (that have been lost)? Remember, any application that accesses the files can also dick with the timestamps. *You* can vouch for all of YOUR applications (having written them all). But, *I* am not keen on trying to quantify that behavior for those, here.>> Also, I access/mount the volumes from a variety of different hosts >> and "maintain" the files with a variety of different utilities. >> The clocks on all of these hosts aren't guaranteed to be synchronized. >> So, you can't infer chronology from the timestamps. > > Hmmm, when copying a file its timestamp should normally be copied > as well so local clocks should not matter re file contents.Yes, if you are copying it on a *local* machine. But, what happens to the timestamp of an object that you fetch from a web server? :>>> And, the mechanisms that I use to move files around may not >> preserve original timestamps, accurately. >> >> E.g., I may move files onto a volume over FTP -- using a client >> that doesn't propagate the timestamp of the *original* file. > > This is an issue as old as ftp is, yes... If you use GET with > ftp the file timestamp can come correctly across (MLST). Still > no standard way to propagate the correct timestamp when you "STOR".So, do I "certify" every web/ftp/scp/fxp/etc client? Or, manually verify the timestamps are "approximately" correct (i.e., 201610272110 is PROBABLY wrong for a local file)? One of the graphical file synchronization tools I use highlights differences with color. It's not uncommon for me to see *mirrored* filestores claiming to have multiple "discrepancies". On closer examination, it's almost always a timestamp error. FAT12 forces timestamps to a 2 second granularity (my PROM programmer still uses floppies as does at least one of my logic analyzers and a DSO) -- something to remember when comparing files on floppies to the original "archive images"!>> So, that instance of the file now has a different timestamp >> from the original FROM WHICH IT WAS COPIED. Yet, I *know* >> they are identical, in contents. > > Well yes, using ftp "put" does that indeed - but I can't think > of another cause so it is soft of avoidable. > >> [Imagine a scenario where a volume is failing; you'd much rather know >> about a file that is NOT present on a "backup copy" than be reassured >> that the contents of all the OTHER files are intact; you KNOW you >> will lose data due to the missing file but MIGHT not lose anything >> from those other files if you've not yet wasted time checking them!] > > Well searching the whole tree for a particular name would be painful > but still faster than copying it all. I personally tend to know which > file belongs where so locating it is easy - but this is not much help > if you look into an archive you are not familiar with.I have one half of the dataset readily available in the DBMS: I know which files should exist at each level of the hierarchy, their sizes, names, containers, and signatures -- BEFORE the drive is even on-line. So, I can just walk through the file hierarchy on the medium and build an unordered list of pathnames/filenames. Then, sort this and compare it to the same sorted list from the DBMS. Superficial differences (names, dates, sizes) will stand out (the DBMS query can find them all for me faster than I could find them myself as it can exploit indexes its precomputed). The real costs come in: - verifying the *actual* size of the file (i.e., a disk error may effectively truncate it or otherwise rendered parts of it inaccessible) - verifying the actual signature (hash) of the file So, I should "do the easy checks" first (for ALL contents) instead of processing files COMPLETELY, but one-at-a-time -- *if* I want to expeditiously locate potential problems. (I just think timestamps would result in me chasing problems that weren't actually problems) Check your mail for more detail (look at Subject lines as I've been changing email accounts, lately)
Reply by ●October 28, 20162016-10-28
For a very crude check you can check that the file size hasn't changed in addition to file time stamp. Then, you could possibly compute a fast hash for each file and use it as the first line of verification instead of using more expensive hash algorithms. If you just need to check if the file has been altered a simple 64-bit CRC might be usable. You should of course evaluate the implications of using hash algorithms that can produce collisions and that are not considered safe and robust. Some file systems provide hooks that can monitor file alterations. Creating an observer which will monitor any file changes, you will be notified if any file should be altered. Br, Kalvin







