>>>> Wanna know the IT professionals approach to all this? Install >>>> a giant filestore that is up 24/7. Use a redundant technology >>>> (RAID, ZFS, etc.) Copy EVERY medium onto it. >>> >>> OK so far - anything else is madness. >> >> No. It's not acceptable, to me, to leave everything online and accessible >> given that it is rarely accessed. It invites failure -- even if that >> failure is just human error (cuz its so easy and "natural" to >> access EVERYTHING) > > Harddisks fail on the shelf too. I haven't seen statistics, but I have > seen it in practice. Starting up disks that have been off for a long > time involves a certain risk, and clearly when the archive is offline > there is no way to check its integrity.Sure. And disks in offsite storage and fireproof safes aren't guaranteed to spin up, either! Having a system that prompts you to let *it* examine offline media (because it doesn't have hands and arms) gives you that "check". Have you never had to discipline yourself to "retension" tapes, periodically, to avoid print-through?> Most human error is easily avoided by making things read-only normally, > and only re-mounting read-write when you actually need to. (There are, > of course, other forms of human error - it is bounded only by your > imagination!)If you need to update the medium, the write protect doesn't help. The only protection is discipline and/or a transactional filesystem. [Or, an approach similar to the bullet server]>>>> Put everything >>>> under a VCS. >>> >>> Everything that can sensible go under a version control system (all >>> development files, in particular), should be in a VCS system. That is >>> not for redundancy, availability, etc. - it is so that you have control >>> of your development process, you can see who did what, when, and why, >>> and you can roll back changes as needed. >> >> That doesn't apply to things that don't change. What's the revision >> history of this MP3 file? Or, this technical journal? Should I >> try to mimic the history of third party applications -- ensure I >> keep copies of every release so I can move back and forth among them? > > The main thing is to keep /your/ data and files in the VCS. Fixed files > and third-party files with no history (at least, none of relevance to > you) don't have to go in a VCS. But even then, it is often convenient. > We regularly have things like important datasheets or tools as part of > a repository for projects. It makes the repositories bigger than > necessary, but it means we can take a blank computer, do a check-out of > the project, and have everything in place.I have all of my tools (including OS install disks) and documentation "archived". I don't put the system, itself, under VCS as I have no need to track every change to a development system ("Gee, I changed the wallpaper, I'd better check this in...") I track "original (IP) objects" that *will* be changing so that I can move back and forth through time to see what my schematic looked like yesterday. Or, what the documentation looked like two weeks ago, before we started mucking with it. But, my 2015 taxes don't need to be under VCS; they are what they are and will always be that way. If I revise them, then there will be new forms that will be ADDED to the document set; the originals will persist as they were. Likewise, a photo that I took is "just a file". If I use that photo in a publication and start massaging it, then *that* instance will be tracked in the VCS -- so I can recreate the photo as it was before I started "photoshopping" it. Different files/data have different tracking, control and integrity needs. That video of a kitten chasing its tail can disappear and I won't even miss it!>> Instead, I only track *my* development efforts and their history. >> >> Products that I inherit or "embrace"/adopt I leave in the VCS they were >> developed under -- assuming that information is part of the file dump >> made available by those projects. (its too painful to try to convert >> from one to another) > > Sometimes that is the best, if there is useful history in the old VCS. > Sometimes it is best to transfer the latest version into your main VCS, > and leave the old one only for reference.I only do this when I "adopt" something AS IF it was my own creation. It is usually very important for me to be able to walk backwards through time to see when and why a particular feature was added or an implementation changed. If I switch VCS's, then there is a point in time beyond which I can't move (at least, not easily). If I'm chasing down something like that, I'm probably far more focused on my detective work than the fact that I am rapidly approaching a point in time where the current VCS mechanism will fall down.>>>> Add scripts to automate scrubbing and the notifications >>>> outlined above. >>> >>> Such scripts will be part of a normal RAID installation. >> >> So, your RAID array lets you tell it to scrub .JPG's at a >> different rate than .BMP's? Or, different portions of >> the file hierarchy at different rates? > > Why would I want to do that? Do you think the sector storing a jpg file > is likely to decay at a different rate from a sector storing a bmp file?No. But I probably won't want to waste time scrubbing files that are less important to me! (replace .BMP with .C and reread the question). Remember, I'm *not* keeping all of this stuff on-line. So, resources directed to preserving (duplication, scrubbing, etc.) information that is of lesser interest redirects those resources from things that are MORE important. A RAID array treats everything with the same degree of importance; it has no way of knowing that your financial records are more important than your "fall foliage" photos.> A sensible time-frame for scrubbing is anywhere between a week and 3 > months. How long the scrub will take depends on the size of the array > and the priority you give it. But even with a very big array and very > low priority, it will usually be done fine over a weekend.In my case, I have to physically mount all the data. How do I "scrub" a spindle of CD's? If they were "printed" media, I may just spread that effort out over the course of a few years: any time the system "sees" such a medium (cuz I am accessing it), it can undertake to examine its contents. The scrubbing doesn't burden me with the task of loading CD's into drives all day long. [Why keep the CD's? Copy them onto magnetic disk and then discard them. But, now you've discarded a BACKUP medium! Instad, you can decide that mass printed CD's probably only need to be checked every few years. And, can probably be skipped! Especially if you have something that is automatically verifying them every time it encounters one: "Ah, that CD was intact as of January 2016..."]>>>> Install VCS clients on all hosts. >>> >>> For anything using the VCS, yes. >> >> And for those things that don't? > > Then don't. Was that a rhetorical question?So, different solutions for different problems/data?>>>> Ensure all >>>> users are disciplined to check things in and out of the repository >>>> as required and WHEN required. >>> >>> For anything using the VCS, yes. >>> >>> Again, that is independent of your archiving or storage system - it is >>> basic good practice for development work. >> >> Its associated with "development work". Do you keep your photographs in >> a VCS? Your music? Text books? etc. >> >> Note this paragraph was introduced with: >> "Wanna know the IT professionals approach to all this?" >> They wouldn't consider whether they were storing photos and songs >> OR "the company books". One size fits all. What other solution do >> they have to offer? ("You're responsible for your own files"?) > > I use a straight file system, shared by samba, and with dated snapshots > (also accessible as straight file systems) for files that are not in a VCS.In my case, those simple filesystems can ALSO implement redundancy, across (and within) volumes. That's the whole point of this! If I want to keep 3 copies of my 2015 taxes in three different places on *2* different spindles, I can access any of those instances AS IF they were one in the same. And, this "system" ensures that they are not corrupted; that they are, in fact, the same thing (even on different volumes and having different names). If I wanted to look at those tax forms *now*, I would know (from memory) where ONE copy was located (or, what it was called, etc). A query to the database tells me where all other instances of it are located (it knows this because it has hinted to me that these instances MIGHT be the same object -- size+hash -- and I confirmed that; so, now it has an extra piece of information about them as a set). If I happen to be seated at a workstation that it indicates as having an instance, I can access it directly. If an instance is not available on that workstation but exists on another, I can power up that workstation to access it. OTOH, if another instance exists on a remote volume, then THAT would be the easier path (its already spinning, just access it over the wire!). *I* decide which instance is best to pursue. If, for example, I planned on using that "other" workstation later today, then I might just elect to leave myself a note to fetch the taxes file at that time. It gives me the flexibility of NOT having a unified (redundant) file store yet having "NUMA" access to one.>> The point of my discussion is to note how the solution is not >> paired with the problem. Just like using high performance >> drives that spin 24.7 -- and are accessed for minutes per month >> is a silly solution. > > Who said "high performance" drives? I would recommend "red" or "NAS" > harddrives with high capacity per dollar ratings. IMHO, if someone > thinks they need fast SAS harddrives, you have misunderstood the problem > or the solution - they are better off with more cheap drives (for > redundancy), SSD's (if speed really is that relevant), or more RAM (to > /actually/ give you greater speed in most cases).High performance drives are needed in some of my workstations. Why install LOW performance drives? You're not seeing the sum total of all these files as the "archive" that I am maintaining. Isn't a VM essentially an archival element? Likely created to support some image of a system (that may never have existed)? Why shouldn't I be able to monitor and catalog (and backup!) the files that are part of that system? Or, part of C:\Playpen on *this* host? Why should I have to move those files onto some other machine/appliance just to keep track of them?>> But, hey, if you've got the money and the manpower to throw at it, >> more power to you! I've seen people paid to dig holes and fill them >> back in! > > The setup I would use would be about $200 for an HP Microserver, and > another $300 for 4 large SATA disks, plus perhaps an hour's work for > putting it together and installing the system. But then, I am not an IT > consultant, and my IT management work is on the side of my main embedded > development work.I increase my archive with a ~$200 additional investment each year. This year, that will add another 8-10GB of capacity (of which, I will only access ~80%). So, if I ALLOCATE that to storing redundant copies of things, it gives me an additional ~3-4GB this year. If I am willing to gamble on the reliability of the drives I can store twice that amount. [E.g., I snapshot my production database each day. But, I sure don't need to keep redundant copies of each of those snapshots! I alternate between two volumes so a volume failure means I lose all the "odd" (or "even" days). <shrug> Do I want to spend twice as much money to be able to maintain (at some degree of uncertainty) access to ALL snapshots? How often am I going to reinstall an old snapshot?]>>>> Limit which users can access each >>>> object and perform actions on the repository. Put in place procedures >>>> whereby staff can request additional capabilities for specific users. >>> >>> No, give everyone root access to each machine, and let them use whatever >>> devices they want for whatever purpose. Tell them it's fine to lend >>> company laptops to their kids. >>> >>> Of /course/ you need procedures in place for giving the right people >>> access to the right systems! You are trying to run a business, not a >>> stag party. >> >> An IT department would spend more energy trying to decide who should >> NOT have access to some "department data" than they would trying to >> facilitate the access of those who *should*. > > Your logic seems to be "An IT department would recommend X. All IT > departments are incompetent bordering on evil. Therefore X is bad." Is > that a fair summary?No. I'm saying IT departments have preconceived notions of how things should be done and who should be able to do them. They are notorious for NOT understanding their customers but, rather, expecting their customers (users) to adapt to their policies. And, because they have the keys to the kingdom, there's usually no way to do otherwise!> I am sure there are IT departments, and IT managers, who consider > Dilbert to be their guide. And maybe you have been unlucky and seen the > worst cases. But I think most do what they can to provide a useful > service and let people access the data as they need.My experience with IT departments in recent memory is that of folks who haven't a clue as to the underlying technologies but try to use blanket policies to address what (my devices) can and can't be allowed to do within the confines of their infrastructure. And, when I've worked around that -- e.g., by setting up an isolated network -- they are incensed at the fact that there is "computer tech" that they have no influence/control over. [In these cases, it pays to have a VP as your client so he can take the case to the top of the foodchain: "Well, IT claimed we couldn't install these devices on *THEIR* network (taking great pains to stress "their"). So, we created an isolated network to ensure these devices wouldn't interfere with THEIR network. Maybe THEY can suggest an alternate solution? But, these devices WILL be deployed, here..."]>> I've seen companies where free standing computers were expensed just to >> get around the rules of their own IT departments. SWMBO tracked >> multimillion dollar construction projects in an MSAccess database >> that she designed -- while the IT department and outside vendors >> tried (and failed) to develop similar functionality ofer the course >> of several years. When she retired, their ability to track those >> projects disappeared. > > Using MSAccess is like holding your car together with chewing gum and > duct tape. It may appear to work for a while, but it is not a solution. > (That does not mean that the "professional" solution was any better - > it could easily have been far worse in a different direction.)When there is no other alternative, what do you do? Prior to her assuming the job, they resorted to 4x6" index cards to track construction expenses. (read that statement again -- and my description of her job; then look at the date on your calendar!) After she put her solution in place, the boss would ask her an arbitrary question: "how much has CompanyX billed us for ProjectY and what do we have left in the entire project's budget for decorating expenses?" Within the hour, he'd have an answer with detailed "backup" to take to HIS boss. Prior to her system, they couldn't tell you what any project cost! They just made wild estimates to get approvals and no one was ever held accountable for the actual costs --- because no one knew what they were! Once her system was in place, folks got so accustomed to having the information available at their fingertips that all of the supervisors left her database "open" on their workstations ALL DAY! She delivered her schema, forms and reports to the New Commercial Vendor hired to reimplement the accounting infrastructure. I.e., here's a working solution, albeit on an unfortunate platform. What better "specification" can you create? THIS WORKS. DUPLICATE IT (on your big, fancy CAPABLE system). Nope. Instead, they eventually coerced the company to alter their business practices to match their existing product offering. Despite the fact that it was a horrible match! (the folks making the decisions don't understand the issues!)>>>> Budget a healthy amount to replace drives that are kept spinning >>>> despite only be accessed "rarely". Increase budget for electricity >>>> to keep them all spinning -- along with additional cooling, etc. >>> >>> A spinning disk takes a couple of watts. If the electricity for that >>> shows up in your budget, you have /big/ problems. >> >> Now that would depend on the type of disk, interface and how much >> time it spends "active". > > Then let's be generous and say 3 W, taking the SATA interface into account. > > Note that I am talking about /new/ big, normal-speed SATA drives - not > ancient high-rev parallel SCSI devices.My servers use SAS drives. 5T in one box and 4T in another. The Sun boxen have no alternatives: FC-AL and SCA (port the SPARC OS's to VM's? How much of my time for that??)>> My FC-AL drives draw about 19W. The same is true for my SAS drives. >> I can replace the FC drives -- if I want to discard my Sun workstations >> and the tools that are only supported on them. I can simmilarly >> replace the SAS drives if I want to dump the server that hosts my VM's. > > My recommendation is to dump the old hardware, unless you really need it > for support of particular old projects.I no longer need to support client projects, there. But, I'd also have to abandon work I've been doing under that environment (some tools just aren't portable to Wintel without sources)>> But, that will mean finding another suitable box, installing the >> software and porting all the VM's. I can buy a lot of electricity for >> the price of several hours (days?) of my time! > > You don't need VMs for a file server or VCS server, nor do you need any > software beyond the installation CD/USB for Debian, Red Hat, Ubuntu > Server, FreeBSD, or whatever suits your preferences. And if you /do/ > want VMs, Linux containers (or FreeBSD jails) is often plenty - and > available out of the box, for a cost of almost nothing in disk space and > ram.Context: moving my VM server to another box (just so I can BUY newer disk drives) means moving all those VMs off the existing media where they currently reside. And, to do that to save on electricity costs? for a box that is only running when those VM's need to be accessed? It was a huge investment to port all the guest OS's and applications to that machine -- which I had to purchase just to be able to discard some older machines (spend time and money to save space?). Now, replace it AGAIN to save power? Why not wait until next year's models are available -- or the year after -- so I don't have to KEEP "upgrading" just to STAY STILL?>>>> Create multiple backups if disinclined to resort to alternative >>>> backup media. Put a copy off-site. Ensure adequate fire protection. >>> >>> Seems reasonable to me. >> >> Do *you* have a halon extinguisher in YOUR bedroom? Or, anywhere >> in your house? > > I have off-site backup copies of the data. That protects the /data/ > from fire, which is the whole point.So, I have to keep my systems on-line (potentially hacked) just to keep a path to that remote site accessible (even if I keep a local copy)? If I have to contend with a fire, here, the data is probably the least of my concerns -- *if* it is lost in the incident. I know a lot of people. I've known of exactly one who experienced a fire in their home: because the fire department FAILED to completely extinguish a fire that started in her vehicle. When it reignited, the house was a total loss. And, the *city* paid for its replacement! Risk always has to be evaluated in terms of likelihoods and "expected values" (costs). My biggest "risk" is *me* accidentally deleting something. Or, dropping something (media). Or, failing to follow some particular procedure to ensure data integrity. That's why I design solutions that address THOSE risks and not the less likely ones (a client could go bankrupt; getting my "next payment" might require hiring an attorney! So, arrange for the payments to be small enough and frequent enough that I can eat the loss instead of having to throw more money after it to guard against that "risk") "Yes, it's a NICE project! Potentially very profitable. But, I'd be assuming most of the risk if I agreed to your terms so I'll just 'no bid'. Sorry. (I really am!)"> And I don't keep my server in my bedroom - it lives in the cellar.We don't have cellars. Or attics. Anything stored or operated in the garage will lead to rapid failure (at 110-140F "cold aisle" for much of the year). And, as this isn't a "man cave", any kit CAN'T be sitting out where it is an eye sore (women tend to be "funny" about this stuff!)>> And the reason IT departments should have effective backup procedures. >> Yet, I can bump into any number of folks in different businesses >> and industries and hear horror stories re: their IT departments and >> "simple" things like this! > > You don't need to tell me about what other people get wrong - you just > need to get it right yourself.I need to be appreciative of the solutions that others come up with; especially those who are SUPPOSED to be fluent in this. OTOH, I nedd to be able to recognize how and why their solutions might not be applicable to my circumstances. We were chatting with some friends over the holidays a few years back. And, lamenting how hard it is to find "qualified help" (handymen) -- everyone CLAIMS they have certain abilities until you actually SEE examples of those! <shudder> Friends commented that they use a-well-known-national-chain for their needs; local firms contracted by said national-chain. We were openly doubtful of their recommendation but made note of it. As we delved into their past experiences with these locally referred contractors, they rambled on and on about all the problems they had had over the years! "And, you're RECOMMENDING these folks to us???" Understand the capabilities and limitations and assumptions of the folks whose ideas you're potentially embracing.>>>> SWMBO would come home with similar horror stories from her >>>> colleagues almost weekly: "They -- IT -- have got a frigging ARMY >>>> wandering around the place. What the hell do they do if they can't >>>> even address this BASIC need?" She quickly learned to keep her own >>>> backups archived locally to safeguard against their folly! >>> >>> I can't quite figure out what you are running here. Is it a home, or a >>> company, or an antique computer museum and graveyard? Is your wife an >>> employee? If not, why is she relevant to your company IT setup? >> >> I am a sole proprietor. I am *my* IT department. > > I thought IT departments were evil and incompetent? > > I still don't know if you have a company or a computer museum at home.I have three "museum pieces": an ASR-33, a Compaq Portable 386 (for its ISA bus) and a Sun Voyager. It's debatable whether the U60 or SB2000 are technically museum pieces as they are probably the "most recent" bits of hardware that will still run Solaris/SPARC. You can likewise argue that the Unisite is an antique -- though it is still supported even though folks now use FLASH, instead. Of course, you can also take the "18 month update cycle" view and consider everything older than 18 months as antique or obsolescent. (and any car older than ~7 years)>> I'm looking for a cheap way of getting what I need without having >> to change MY development/business style to coincide with one IMPOSED >> by some other "solution". >> >> Is there some TECHNICAL reason that my approach won't work? >> Or, does it just offend your sensibilities? > > I didn't realise you /had/ an approach to the problem. I thought you > were just asking here for the impossible, then complaining about what > "IT professionals" would do instead, and finding arguments against any > useful suggestions you got.I have a system that behaves as I've described. My initial question was VERY SPECIFIC and didn't REQUIRE any of this extra detail to be provided. But, the inevitable "why do you want to do that" always creeps up -- and I am too respectful to IGNORE such questions as "unimportant" in the context of the original question. I presented pseudocode for my current "file verification" algorithm. Then, explained its drawbacks (i.e., you can't quickly answer the question "Has ANYTHING changed, here?" even when it is fairly obvious that SOME things have -- because those filenames are missing or because their sizes have been altered. I *clearly* understand that you can't come up with an exhaustive list of ALL changes until you have checked all of these things AND verified all file hashes -- as is evident in the code I presented. The LIKELY suggestions I anticipated were: - reorder the tests to expose the "something HAS failed" condition when it is obvious from metadata (what is (1+2/3*7%8^9)*0?) - replace the hash function with one that operates at a higher rate (though the media access rate is the effective bottleneck) - use the timestamp to indicate when a change has occurred I believe I have addressed each of these in earlier replies without "needing" to understand HOW I am applying this algorithm. No one has suggested anything different. [why not cast all the data onto nickel-plated disks and store the data that way? Then, never have to worry about changes! I.e., valid answer to a DIFFERENT question; as is "Use RAID"]> But if you have figured out what you need to do for your system, that's > great. > >> >>> setup. Scrap the lot of it is the only professional answer. So that >>> leaves you as an amateur to play with your own systems in whatever way >>> you want, and to shout and rant at anyone who is willing to try and >>> offer useful suggestions. >> >> As expected, instead of answering the original question I ASKED, this >> discussion has digressed into the realm of "why do you want to do that". > > Well, yes - that is how people usually help others in a newsgroup. We > give suggestions about alternative approaches. This is particularly the > case when the OP appears to be asking for the impossible ("How do I > check the integrity of an entire file without accessing the entire > file?").No, the fault is yours for misreading the question: The problem this [sample implementation] has is examine-file() is expensive -- for large numbers of large files! I'd like to short-circuit this to provide an EARLY INDICATION OF POSSIBLE PROBLEMS. E.g., stat(2) each "theFile" to identify any that are "Removed" without incurring the cost of examining the file in its entirety. I frankly can't understand how this is NOT incredibly specific and obvious. It is clear IN MY INITIAL POST, that I understood that you can't say "ALL FILES ARE INTACT" until you have examined all aspects of all files (given that you don't have a golden master against which to compare byte-for-byte). So, any reply that says "you can't know that everything is intact without examining everything" is a nonsense reply; I've already said that in explaining what I am, instead, seeking.> Whenever someone has specific ideas that seem hard, the best > approach is to take a step back and try to look at the whole picture and > see things from a different angle.No, that's a way of saying either of the following: - Silly boy, you clearly don't know what you're talking about! - Hmmm, I have such an incredibly limited imagination that I can't possibly see how your question could be pertinent!> This is an open newsgroup. I am not a professional consultant that you > have hired to consider your setup in detail and attempt to some up with > a new archiving method to suit your needs. I am someone who knows more > than most in this group about storage systems, and does more IT > administration than most in this group, and I am sharing some ideas from > that. Maybe they help you, maybe not - and maybe the discussions that > people have here are of interest to others and not you. > >> And, because my criteria differ from those others can wrap their heads >> around, the fault is obviously mine. > > There are a number of issues here. Your descriptions of your criteria > have changed under way, giving more information that was relevant from > the start.How has my PROBLEM STATEMENT changed? If I publish all of my source code for you to examine in excruciating detail, would it help you answer the question I posed? Instead, you redirect your questions and comments to issues that I'vee not put on the table as "negotiable". Would you like, also, to advise me on my choice of implemntation languages? Or, the time of day that I devote to work? Or, the foodstuffs that I consume? I'm sure someone could rationalize "helping" with each of these issues. But, I didn't ask about ANY of them. And, was respectful and consderate in offering ever more detailed explanations in response to questions and challenges -- instead of simply dismissing them as "not pertinent" to the EXPLICIT question I posed. Seriously. (seriously!) How do you deal with specifications from clients/employers? Do you make your boss justify every requirement in those documents? Do you insist that he share the company's business strategy with you in detail -- so you can advise them of "mistakes" that you THINK they might be making? Or, about to make if you blindly did what you were told? Do you insist on interviewing potential customers and doing your own assessment of THEIR needs? Maybe you'll come up with some clever new feature or design approach. Or, maybe you'll just be wasting other folks' time and patience! [Rethink that paragraph and run through your past interactions with your employers/clients. Seriously!] I assume my clients/employers know their businesses better than me. And, I realize they owe me nothing beyond the agreed upon terms IF I chose to take on the job. If their requirements seem crazy to me or unrealistic, I politely beg off. I have no desire to figure out what they REALLY need if its not readily discernible from what they've exposed to me. When my MD recommends a course of treatment for me, I ask "why" -- because I want to know what the "problem" he's trying to solve. I don't expect an education on anatomy, physiology or disease processes. I don't expect him to educate me on all of the potential options available and have him justify why he has selected a particular one. Similarly, if he asks me a specific question, I don't demand an explanation of his motivation for asking. If he inquires as to my "sex life", I assume he NEEDS to know that information. If I didn't think him competent and responsible, I wouldn't work with him!> Your writing about it is very verbose and unstructured - it > is extremely difficult to get a good understanding of what you have and > what you need.How would you abbreviate my description -- to someone who hasn't (and will never!) read the above? Remember, you have to be TERSE and STRUCTURED in your description. Sort of like writing a bit of pseudocode and then commenting on its shortcomings! (gee, what a clever idea!) My writings are verbose and unstructured because I am trying to answer questions that do little to clarify the ORIGINAL QUESTION. You might be curious as to what I'm doing. I might be curious as to what a CLIENT is doing. Or, you may not be able to wrap your head around it (as I might not be able to wrap my head around a client's requirements). I work on a lot of interesting projects and take unique approaches to many problems. I'm tired of having to waste time DEFENDING these things when it does little to get ME to the information/advice I seek. "Mary has 3 oranges..." is a perfect summary of the behavior of these respondents. Colleagues who watch these exchanges ask me why I even bother asking: "They won't be able to add anything to your own solution so why bother?"> You have a very strong idea that you are different from > everyone else - that your needs are different, and solutions are > different - and you prefer to stick to that uniqueness concept instead > of listening to how other people handle the storage and considering how > you could use known working solutions. > > (I realise I am not being entirely complementary to you here. I have > great respect for your knowledge and experience - but I think it is best > to be honest about how you appear in this thread.)Sorry, I'm just tired of this nonsense. I have been steadily redirecting my conversations to other venues where this sort of "Why oranges? Why not apples?" isn't such a pervasive aspect of replies. from the decreasing volume of traffic, here, it seems like others have already come to that conclusion! So, a promise: I'll post less and NOT respond to requests for clarifications that I think unnecessary. I'm not trying to be rude. But, I'm also not getting anything by trying to be cooperative or instructive. If you don't know that Mary is left with 2 oranges, then I don't plan on explaining it! If you are concerned with her possible affections towards Bobby, ask HER!>> So, I should just adopt the discipline of not straying from a strict >> focus on my original question(s). When asked "why", I should NOT be >> considerate and offer an explanation; instead, "just because" should >> suffice. If you can't answer with that sort of rationale, then >> you won't be able to answer with a *detailed* one. >> >> "Mary has 3 oranges. She gives one to Johnny. How many oranges does >> Mary have?" >> >> "Why oranges? Why not apples? What kind of oranges? Are they juice >> oranges or eating oranges? Are they ripe, yet? Why did she give one >> to Johnny? Why *just* one? Why not two? Or, all three? Is she sweet >> on Johnny? What about Bobby? ..." >> >> All the while missing the obvious answer: two. > > If there was an obvious "answer" here, you would have found it yourself > and not started the thread in the first place.I'll keep that in mind before I ask other questions. It seems that that is a recurrent pattern: I've already DONE all the analysis and little can be gained from posting questions -- so don't!>> (sigh) Makes me wonder if any of you can work from a specification >> or if you need your hands held all the way from inception to deliverables! > > I obviously can't answer for others, but /I/ can work from a > specification. But this is not development from specification - this is > a brainstorming meeting from before the start of the project.No, it's a project that is working but has found an opportunity for enhancing the implementation to address other questions that could be asked of the tracked objects. Take your RAID solution and ask the question: How can I get notification that SOMETHING *has* changed in this array before going through the entire array and checking it exhaustively? "Gee, why do you want to know that?"
Checking large numbers of files
Started by ●October 27, 2016
Reply by ●November 1, 20162016-11-01
Reply by ●November 1, 20162016-11-01
On 01/11/16 12:36, Don Y wrote:>>>>> Wanna know the IT professionals approach to all this? Install >>>>> a giant filestore that is up 24/7. Use a redundant technology >>>>> (RAID, ZFS, etc.) Copy EVERY medium onto it. >>>> >>>> OK so far - anything else is madness. >>> >>> No. It's not acceptable, to me, to leave everything online and >>> accessible >>> given that it is rarely accessed. It invites failure -- even if that >>> failure is just human error (cuz its so easy and "natural" to >>> access EVERYTHING) >> >> Harddisks fail on the shelf too. I haven't seen statistics, but I have >> seen it in practice. Starting up disks that have been off for a long >> time involves a certain risk, and clearly when the archive is offline >> there is no way to check its integrity. > > Sure. And disks in offsite storage and fireproof safes aren't guaranteed > to spin up, either! > > Having a system that prompts you to let *it* examine offline media > (because it doesn't have hands and arms) gives you that "check". > Have you never had to discipline yourself to "retension" tapes, > periodically, to avoid print-through?I have used tapes - and I have long ago realised that the way to have reliable backups is by having it handled automatically. If you have humans in the loop, you will /always/ have cases when copies are skipped, re-tensioning is delayed, you have run out of blank DVDs, you don't have time to swap the tapes, etc., etc. Disk space is cheap. Put your copies online. Keep your backup server running (with its disks in idle, if you want).> >> Most human error is easily avoided by making things read-only normally, >> and only re-mounting read-write when you actually need to. (There are, >> of course, other forms of human error - it is bounded only by your >> imagination!) > > If you need to update the medium, the write protect doesn't help. > The only protection is discipline and/or a transactional filesystem. >Your snap-shots should be write-protected. And it is not particularly hard to have your archives mostly read-only, and only enable write access for those occasions when you want to update them (if that's the way you want to go).>>>>> Add scripts to automate scrubbing and the notifications >>>>> outlined above. >>>> >>>> Such scripts will be part of a normal RAID installation. >>> >>> So, your RAID array lets you tell it to scrub .JPG's at a >>> different rate than .BMP's? Or, different portions of >>> the file hierarchy at different rates? >> >> Why would I want to do that? Do you think the sector storing a jpg file >> is likely to decay at a different rate from a sector storing a bmp file? > > No. But I probably won't want to waste time scrubbing files that are > less important to me! (replace .BMP with .C and reread the question).The answer is the same.> > Remember, I'm *not* keeping all of this stuff on-line. So, resources > directed to preserving (duplication, scrubbing, etc.) information that > is of lesser interest redirects those resources from things that are > MORE important.I believe you have started this with a fixed idea about off-line archiving that is not helpful - it is severely restricting your ability to make a simple, clean solution and forcing you to introduce separations and manual work. What you should realise is that it does not matter if you treat your less important files in the same way as the important ones. It does not matter if you treat your rarely accessed files in the same way as you treat your high-use files. If you have a "working set" of 2 TB files of project data from the last few years, you want that on-line in an accessible, reliable manner with frequent backups, RAID, etc. And you also may have 1 TB of project data from older projects that you will rarely use. And 100 GB of even older data that you expect never to read again. And 100 GB of data that you have really don't care about, but can't be bothered sorting out or throwing away. What do you lose by putting it /all/ in a single RAID system (plus backups)? It means you need 3 TB worth of disk space rather than 2 TB worth of disk space. So /what/? What do you gain by having complicated layered systems with some parts backed up on external hard disks, others on tape, some using optical storage, some on this RAID array from the last century, some on your new box, some with automatic duplication, some with manual duplication, some integrity checking done online, some integrity checking done with manual scripts run once a month? Your only benefit is job security - you can never be fired, because no one will have a clue how it all works if you were to disappear. The pennies you save on disk space or electricity bills are multiplied a thousand-fold in effort and extra complications. I have files from projects I did 20 years ago, in folders on my disk in my current computer. For most of those projects, the last time I looked at the files was 20 years ago - yet I am quite happy having them on the same disk and the same filesystem as projects I am doing at the moment.> > A RAID array treats everything with the same degree of importance; > it has no way of knowing that your financial records are more > important than your "fall foliage" photos.Exactly. And why care? Treat it all as important, and stop worrying.> >> A sensible time-frame for scrubbing is anywhere between a week and 3 >> months. How long the scrub will take depends on the size of the array >> and the priority you give it. But even with a very big array and very >> low priority, it will usually be done fine over a weekend. > > In my case, I have to physically mount all the data. How do I > "scrub" a spindle of CD's? If they were "printed" media, I > may just spread that effort out over the course of a few years: > any time the system "sees" such a medium (cuz I am accessing it), > it can undertake to examine its contents. The scrubbing doesn't > burden me with the task of loading CD's into drives all day long. >You don't need to scrub your CD's if you don't use CD's to hold your data!> [Why keep the CD's? Copy them onto magnetic disk and then discard > them. But, now you've discarded a BACKUP medium! Instad, you can > decide that mass printed CD's probably only need to be checked > every few years. And, can probably be skipped! Especially if you > have something that is automatically verifying them every time it > encounters one: "Ah, that CD was intact as of January 2016..."] > >>>>> Install VCS clients on all hosts. >>>> >>>> For anything using the VCS, yes. >>> >>> And for those things that don't? >> >> Then don't. Was that a rhetorical question? > > So, different solutions for different problems/data?Yes, when it makes sense because it is better for the /user/. You don't need to go out of your way to make things "easier" for a computer or your equipment - your computer is quite happy mindlessly scrubbing the same raid data every week without complaining that it is not really necessary for these particular files. But /users/ should have the access that works best for them.
Reply by ●November 1, 20162016-11-01
On 01/11/16 12:36, Don Y wrote: Don, I am sorry I can't make any more useful responses here. Trying to read and analyse all you write is simply too much - it would be too time consuming to put together all the information you have written in the various posts and combine it into a structured form, and then to try to figure out what you are /actually/ asking for. If anything I have written in this thread has been helpful to you, then great. If it has not, well, feel free to forget it. And if you have specific questions about the kind of technological solutions I recommend, then ask away.
Reply by ●November 2, 20162016-11-02
On 11/1/2016 7:16 AM, David Brown wrote:> On 01/11/16 12:36, Don Y wrote: > > Don, I am sorry I can't make any more useful responses here. Trying to > read and analyse all you write is simply too much - it would be too time > consuming to put together all the information you have written in the > various posts and combine it into a structured form, and then to try to > figure out what you are /actually/ asking for.The fault is mine: I should not have answered the "what are you trying to do" query. Instead, responded with "please reread my initial post. It contains everything you need to solve the problem I am posing."> If anything I have written in this thread has been helpful to you, then > great. If it has not, well, feel free to forget it. And if you have > specific questions about the kind of technological solutions I > recommend, then ask away.I'm not looking for another way of solving my "application". My solution perfectly fits my needs and constraints. It lets me adopt the storage, backup, scrubbing, redundancy, etc. policies that I want *as* I want them. My initial implementation simply suffered from yielding certain classes of results after lengthy delays. How can I speed up this calculation: x = (1+2*3/4%5^6)*0 Ans: rewrite as: x = 0*(1+2*3/4%5^6)







