EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

Disk imaging strategy

Started by Don Y November 2, 2014
On 02/11/14 23:30, Don Y wrote:
> Hi David, > > On 11/2/2014 3:16 PM, David Brown wrote: >> On 02/11/14 18:42, Don Y wrote: >>> On 11/2/2014 10:09 AM, glen herrmannsfeldt wrote: >>> >>>>> A naive approach to this would be to plumb dd to a compressor >>>>> -- running both OUTSIDE the native OS. But, for large/dirty >>>>> volumes, this gives you an unacceptably large resulting image >>>>> -- because you end up having to store "discarded data" which >>>>> could potentially be HUGE (consider a large volume that has >>>>> seen lots of write/delete cycles) esp in comparison with the >>>>> actual precious data! >>>> >>>> For older disks, that are usually relatively small, that is >>>> probably the best choice. >>> >>> The problem comes with newer disks. E.g., I keep ~1T on each >>> workstation and *only* drag "current projects" onto them counting >>> on the file servers to maintain most of my stuff "semi-offline". >>> >>> So, you can easily have just 5 or 10% of the disk "in use" but >>> 90% of it "dirty". Dirt doesn't compress nearly as well as >>> "virgin media" :> >> >> How about replacing the 1 TB harddisk with an 80 MB SSD? Then you >> enforce the rule that only a small amount is on the disk locally, >> you can use a simple dd with compression, and everything works >> faster. > > How does that give me 1T of storage? >
It doesn't give you 1 TB on each machine - that's the point. Keep your main data safe in some big system (with raid, backups, etc., in whatever way you see fit) and just have the software and /necessary/ working sets on the local machines. So instead of having 5-10% of 1 TB "in use" and the rest "dirty" or "semi-offline", you have 90% of 100 MB in use. Such a system won't suit everyone, but when it works, it simplifies backup and machine independence significantly. It also makes it a lot easier to track versions and "current" data, instead of having local copies and server copies that are a bit different - you have /only/ server copies and tracked backups of them.
Hi David,

On 11/2/2014 5:08 PM, David Brown wrote:
> On 02/11/14 23:30, Don Y wrote: >> On 11/2/2014 3:16 PM, David Brown wrote: >>> On 02/11/14 18:42, Don Y wrote: >>>> On 11/2/2014 10:09 AM, glen herrmannsfeldt wrote: >>>> >>>>>> A naive approach to this would be to plumb dd to a compressor >>>>>> -- running both OUTSIDE the native OS. But, for large/dirty >>>>>> volumes, this gives you an unacceptably large resulting image >>>>>> -- because you end up having to store "discarded data" which >>>>>> could potentially be HUGE (consider a large volume that has >>>>>> seen lots of write/delete cycles) esp in comparison with the >>>>>> actual precious data! >>>>> >>>>> For older disks, that are usually relatively small, that is >>>>> probably the best choice. >>>> >>>> The problem comes with newer disks. E.g., I keep ~1T on each >>>> workstation and *only* drag "current projects" onto them counting >>>> on the file servers to maintain most of my stuff "semi-offline". >>>> >>>> So, you can easily have just 5 or 10% of the disk "in use" but >>>> 90% of it "dirty". Dirt doesn't compress nearly as well as >>>> "virgin media" :> >>> >>> How about replacing the 1 TB harddisk with an 80 MB SSD? Then you >>> enforce the rule that only a small amount is on the disk locally, >>> you can use a simple dd with compression, and everything works >>> faster. >> >> How does that give me 1T of storage? > > It doesn't give you 1 TB on each machine - that's the point. Keep your main > data safe in some big system (with raid, backups, etc., in whatever way you see > fit) and just have the software and /necessary/ working sets on the local > machines. So instead of having 5-10% of 1 TB "in use" and the rest "dirty" or > "semi-offline", you have 90% of 100 MB in use.
That's exactly what *I* do: The problem comes with newer disks. E.g., I keep ~1T on each workstation and *only* drag "current projects" onto them counting on the file servers to maintain most of my stuff "semi-offline". Executables (and their documentation, support, etc.) are typically in the ~100G ballpark. The balance of the 1T is for whatever documents and "originals", libraries, etc. that I happen to be working on at the time. [dynamically loading executables from an off-line store just doesn't work on many machines. And, none of this would work for a student's laptop!] The point of my 1T example (pick ANY number for "total system capacity") is that most of the sectors can be "dirty" -- have "seen" data at some point in the past -- so you can't assume that "empty" would mean "compress readily" (as would be the case for a solution that was FS *aware*!) I.e., the advantage of a FS-aware approach is you know which portions of the medium are significant -- "worth preserving" -- so the balance can compress to take NO space in the image.
> Such a system won't suit everyone, but when it works, it simplifies backup and > machine independence significantly. It also makes it a lot easier to track > versions and "current" data, instead of having local copies and server copies > that are a bit different - you have /only/ server copies and tracked backups of > them.
That's exactly what I do -- though I may keep multiple branches on the machine while I am working on it so I can spin down the archive until I really need to check something back *in* (assuming I *will* do so)
On Sun, 02 Nov 2014 15:36:11 -0700, Don Y <this@is.not.me.com> Gave us:

>On 11/2/2014 3:13 PM, Hul Tytus wrote: >> With an unprotected system like MSDOS booted on a floppy or a flash disk, >> a disk editor can copy the sectors on one partition to another. Simtel\msdos >> has those editors, I believe, but searching for a simtel site is required. >> The simplest procedure is to format the first half of a disk and use the >> other half for the backup image. > >What I currently do is similar -- except no DOS, etc. (just write a boot >loader that effectively does the decompress & copy without the overhead >of a "real OS"). > >Not using compression is highly wasteful of disk space (for the "restore >image"). If the image is to co-reside on the medium with the live data, >then it'd be nice not to have to "throw away" half the medium for this >"feature" > >E.g., the laptops that I build for students tend to have ~160G drives >that I can cut into a "system" partition (which I want to be able to >restore on-demand) as well as a "data" partition (which I will leave >to the student to maintain... if their data gets clobbered, that's >THEIR problem; at least the machine will still be runnable after >recovery)
Have them each get one of these. Look at the frequently bought together section, and also there are USB enclosures. They could even boot and run from the detachable drive, and back up to the internal, fully bootable mirror, and compressed backup volume(s). That way, a guy could put his dead laptop down, and put his drive into another student's laptop, and boot up *HIS* own system and finish working until his charges back up. Each laptop could even be further configured to have a guest partition and back up session data for guest sessions there. That one would be tougher though as hardware IDs would have to be utilized. Nifty like... Like... nifty, man. Run from the detachable, and back up to the laptop itself. A person would protect his "system on a leash" like any other valued item, such as a wallet. This also makes the laptop itself a bit more ubiquitous for the student, should it get stolen. He gets a new one, and keeps running, while "the system" hunts down the stolen job, which carries non erasable ID info in a couple places on it, which are even electronically accessible. MAC ID Of course, as you know, and others too. The ones the NSA would like to use to backdoor you, plant evidence, gather it back up... use it against you, etc. You know their drill. Just ask Monica's friend.
On Sun, 02 Nov 2014 18:32:06 -0800, DecadentLinuxUserNumeroUno
<DLU1@DecadentLinuxUser.org> Gave us:

>On Sun, 02 Nov 2014 15:36:11 -0700, Don Y <this@is.not.me.com> Gave us: > >>On 11/2/2014 3:13 PM, Hul Tytus wrote: >>> With an unprotected system like MSDOS booted on a floppy or a flash disk, >>> a disk editor can copy the sectors on one partition to another. Simtel\msdos >>> has those editors, I believe, but searching for a simtel site is required. >>> The simplest procedure is to format the first half of a disk and use the >>> other half for the backup image. >> >>What I currently do is similar -- except no DOS, etc. (just write a boot >>loader that effectively does the decompress & copy without the overhead >>of a "real OS"). >> >>Not using compression is highly wasteful of disk space (for the "restore >>image"). If the image is to co-reside on the medium with the live data, >>then it'd be nice not to have to "throw away" half the medium for this >>"feature" >> >>E.g., the laptops that I build for students tend to have ~160G drives >>that I can cut into a "system" partition (which I want to be able to >>restore on-demand) as well as a "data" partition (which I will leave >>to the student to maintain... if their data gets clobbered, that's >>THEIR problem; at least the machine will still be runnable after >>recovery) > > > Have them each get one of these. Look at the frequently bought >together section, and also there are USB enclosures. > > They could even boot and run from the detachable drive, and back up to >the internal, fully bootable mirror, and compressed backup volume(s). > > That way, a guy could put his dead laptop down, and put his drive into >another student's laptop, and boot up *HIS* own system and finish >working until his charges back up. Each laptop could even be further >configured to have a guest partition and back up session data for guest >sessions there. That one would be tougher though as hardware IDs would >have to be utilized. > > > Nifty like... Like... nifty, man. > > Run from the detachable, and back up to the laptop itself. A person >would protect his "system on a leash" like any other valued item, such >as a wallet. > > This also makes the laptop itself a bit more ubiquitous for the >student, should it get stolen. He gets a new one, and keeps running, >while "the system" hunts down the stolen job, which carries non erasable >ID info in a couple places on it, which are even electronically >accessible. MAC ID Of course, as you know, and others too. The ones the >NSA would like to use to backdoor you, plant evidence, gather it back >up... use it against you, etc. > > You know their drill. Just ask Monica's friend.
Ooops... forgot the link. http://www.amazon.com/Transcend-MTS400-Solid-State-TS256GMTS400/dp/B00KLTPUG4
On Sun, 02 Nov 2014 12:02:44 -0700, Don Y <this@is.not.me.com> wrote:


>If the machine can access the medium, then what do I care about the >hardware interface?
The problem with dd is that there is no actual guarantee that what you read will work if written back. Even under the "raw" block devices there is a lot of translation going on.
>For every (?) machine, there are certain assumptions that can be made >at IPL: one of which is that it can retrieve some number of bytes >from the boot volume. What you load is largely a matter of your >discretion.
More like a matter of "necessity". You get exactly 1 block - anything beyond that is your own responsibility.
>> Raw block access is OS agnostic, but it has a lot of drawbacks: it >> happily replicates damaged filesystems and a naive image can't be >> restored to a smaller volume even if the volume could hold the live >> data. > >Point isn't to restore to arbitrary media but, rather, to the medium >from which it came.
The media itself may be damaged. It's less worrisome now due to sector remapping, but there's still a possibility that the restore may not work.
>As to "damaged filesystems", deal with that BEFORE you make the image!
You can't fix damaged media.
>Note, also, that it isn't a "backup" mechanism but, rather, a "restore" >mechanism. I.e., once the image is created, it's only ever *restored* >(if you want to update the image, it's an expensive operation)
Calling it an "emergency partition" rather than a "backup" is just semantics.
>> Also, quite a few filesystems shortcut writes of all zeros and just >> mark the affected blocks empty in their headers. So to be safe you >> have to fill write some other value. > >All you need to do is ensure that whatever you write is very compressible. >Much more so than "unconstrained DEADBEEF". E.g., you could tailor your >compressor to recognize the 512 byte sequence: > "123234u349tuepdfjg;skjdgpa9sufwrtd....sdklfsopriujh" >and replace it with a one byte "sector is empty" code (where "empty" >really means "contains the aforementioned 512 byte sequence")
And also means "doesn't need to be restored". George
Hi George,

On 11/2/2014 8:11 PM, George Neuner wrote:
> On Sun, 02 Nov 2014 12:02:44 -0700, Don Y <this@is.not.me.com> wrote: >> If the machine can access the medium, then what do I care about the >> hardware interface? > > The problem with dd is that there is no actual guarantee that what you > read will work if written back. Even under the "raw" block devices > there is a lot of translation going on.
dd(1) is an abbreviation for "low level access to block device". I'm not running a UN*X (or any other OS).
>> For every (?) machine, there are certain assumptions that can be made >> at IPL: one of which is that it can retrieve some number of bytes >>from the boot volume. What you load is largely a matter of your >> discretion. > > More like a matter of "necessity". You get exactly 1 block - anything > beyond that is your own responsibility.
Yup.
>>> Raw block access is OS agnostic, but it has a lot of drawbacks: it >>> happily replicates damaged filesystems and a naive image can't be >>> restored to a smaller volume even if the volume could hold the live >>> data. >> >> Point isn't to restore to arbitrary media but, rather, to the medium >>from which it came. > > The media itself may be damaged. It's less worrisome now due to > sector remapping, but there's still a possibility that the restore may > not work.
If the medium is damaged, then it needs to be replaced. I.e., there is no need for a "restore" to work when the hardware doesn't.
>> As to "damaged filesystems", deal with that BEFORE you make the image! > > You can't fix damaged media.
Exactly. In the "students" case, the chances are the restore will be necessitated from their system getting munged with spyware, downloaded cruft, etc. Expecting/requiring me (or someone else) to rebuild their system because they were irresponsible is silly. Give them the ability to do it... and, the COST of doing it (i.e., the potential to lose anything that THEY don't explicitly save before the restore)
>> Note, also, that it isn't a "backup" mechanism but, rather, a "restore" >> mechanism. I.e., once the image is created, it's only ever *restored* >> (if you want to update the image, it's an expensive operation) > > Calling it an "emergency partition" rather than a "backup" is just > semantics.
There's a difference in expectations. I do "backups" all the time. I *seldom* do "restores". This mechanism is intended for folks who never do *backups* but often/sometimes do "restores"! (i.e., *I* will never be building a new "image" for a machine once it has left my hands)
>>> Also, quite a few filesystems shortcut writes of all zeros and just >>> mark the affected blocks empty in their headers. So to be safe you >>> have to fill write some other value. >> >> All you need to do is ensure that whatever you write is very compressible. >> Much more so than "unconstrained DEADBEEF". E.g., you could tailor your >> compressor to recognize the 512 byte sequence: >> "123234u349tuepdfjg;skjdgpa9sufwrtd....sdklfsopriujh" >> and replace it with a one byte "sector is empty" code (where "empty" >> really means "contains the aforementioned 512 byte sequence") > > And also means "doesn't need to be restored".
Well, *likely* doesn't need to be restored (depends on how unique that string can be -- "Copyright 2014 Microsoft" would probably be a bad choice...)
On Sun, 02 Nov 2014 10:42:55 -0700, Don Y <this@is.not.me.com> wrote:

>On 11/2/2014 10:09 AM, glen herrmannsfeldt wrote: > >>> A naive approach to this would be to plumb dd to a compressor -- running >>> both OUTSIDE the native OS. But, for large/dirty volumes, this gives you >>> an unacceptably large resulting image -- because you end up having to store >>> "discarded data" which could potentially be HUGE (consider a large volume >>> that has seen lots of write/delete cycles) esp in comparison with the >>> actual precious data! >> >> For older disks, that are usually relatively small, that is probably >> the best choice. > >The problem comes with newer disks. E.g., I keep ~1T on each workstation >and *only* drag "current projects" onto them counting on the file servers >to maintain most of my stuff "semi-offline". > >So, you can easily have just 5 or 10% of the disk "in use" but 90% of it >"dirty". Dirt doesn't compress nearly as well as "virgin media" :> > >>> [I'd like to be able to store the image on a (set of) optical media and/or >>> an unused "partition" somewhere] >> >>> I.e., without knowledge of the specific filesystem(s) involved, you don't >>> know how to recognize live data from deleted data. >> >>> The *hack* that I am currently evaluating is to invoke a trivial executable >>> UNDER THE NATIVE OS that simply creates large "blank" (i.e., highly >>> compressible) files until the volume is "full", then unlinks them all. >>> Doing this while the system is reasonably quiescent isn't guaranteed to >>> "vacuum" all available space but would make a big dent in it (if the >>> system is brought down shortly thereafter). >> >> Be sure to fsck or chkdsk first. > >Yes, of course. The point is that I am willing to expend a fair bit of >effort -- including "unscripted" actions -- to get the initial "master" >disk image "Correct". But, want most of that effort to be in the native OS >instead of having to implement hooks for every conceivable file system type.
FWIW, Windows has had an option ("/w") on the "cipher" command to wipe all unused areas on an NTFS volume since the W2K days, and it will do that on a live volume. I'm not sure what it leaves in the empty space, but at least a decade ago it took several passes (writing zero, writing ones, writing random numbers, etc.) to all of the unallocated space on the volume. For your purposes* it would hopefully not have that random data pass as the last one. I also don't know if that ever worked on non-NTFS volumes. IIRC, this was an add-on you had to download from MS in W2K, and part of the standard installation of *some* XP versions (Pro and server, I think), and has been standard on all Vista, Win7 and Win8 versions. *The purpose of the command/option is to prevent people from recovering data from deallocated space, not prepping a volume for (image) compression.
On 11/2/2014 10:11 PM, Robert Wessel wrote:
> On Sun, 02 Nov 2014 10:42:55 -0700, Don Y <this@is.not.me.com> wrote: > >> On 11/2/2014 10:09 AM, glen herrmannsfeldt wrote: >> >>>> A naive approach to this would be to plumb dd to a compressor -- running >>>> both OUTSIDE the native OS. But, for large/dirty volumes, this gives you >>>> an unacceptably large resulting image -- because you end up having to store >>>> "discarded data" which could potentially be HUGE (consider a large volume >>>> that has seen lots of write/delete cycles) esp in comparison with the >>>> actual precious data! >>> >>> For older disks, that are usually relatively small, that is probably >>> the best choice. >> >> The problem comes with newer disks. E.g., I keep ~1T on each workstation >> and *only* drag "current projects" onto them counting on the file servers >> to maintain most of my stuff "semi-offline". >> >> So, you can easily have just 5 or 10% of the disk "in use" but 90% of it >> "dirty". Dirt doesn't compress nearly as well as "virgin media" :> >> >>>> [I'd like to be able to store the image on a (set of) optical media and/or >>>> an unused "partition" somewhere] >>> >>>> I.e., without knowledge of the specific filesystem(s) involved, you don't >>>> know how to recognize live data from deleted data. >>> >>>> The *hack* that I am currently evaluating is to invoke a trivial executable >>>> UNDER THE NATIVE OS that simply creates large "blank" (i.e., highly >>>> compressible) files until the volume is "full", then unlinks them all. >>>> Doing this while the system is reasonably quiescent isn't guaranteed to >>>> "vacuum" all available space but would make a big dent in it (if the >>>> system is brought down shortly thereafter). >>> >>> Be sure to fsck or chkdsk first. >> >> Yes, of course. The point is that I am willing to expend a fair bit of >> effort -- including "unscripted" actions -- to get the initial "master" >> disk image "Correct". But, want most of that effort to be in the native OS >> instead of having to implement hooks for every conceivable file system type. > > FWIW, Windows has had an option ("/w") on the "cipher" command to wipe > all unused areas on an NTFS volume since the W2K days, and it will do > that on a live volume.
Hmmm... interesting!
> I'm not sure what it leaves in the empty space, but at least a decade > ago it took several passes (writing zero, writing ones, writing random > numbers, etc.) to all of the unallocated space on the volume. For > your purposes* it would hopefully not have that random data pass as > the last one. I also don't know if that ever worked on non-NTFS > volumes. > > IIRC, this was an add-on you had to download from MS in W2K, and part > of the standard installation of *some* XP versions (Pro and server, I > think), and has been standard on all Vista, Win7 and Win8 versions.
I will look for it out of curiosity.
> *The purpose of the command/option is to prevent people from > recovering data from deallocated space, not prepping a volume for > (image) compression.
That makes sense. Poor man's approach to a self-scrubbing filesystem. I think I really want to pursue a more universal strategy. E.g., I can "fill" the unused areas on my NAS devices by mounting the shares/exports and playing the create/fill/unlink game with the same sort of results as on a local filesystem. Regardless of level of RAID in place, etc. Writing files is a relatively common activity (for storage devices! :> ). Anything beyond that gets if-fy... [Of course, for the NAS *appliances* I'd then have to physically remove the drive and install it somewhere that I could run the imaging executable]
On 2014-11-02, Don Y <this@is.not.me.com> wrote:

 [app that files the disk with a file full of zeroes]
> > Then, dd | compress (on bare iron).
> Again, not ideal but probably the best bang for the least buck?
yeah, unless you can find a mount option or equivalent that does "overwrite with zeros on unlink" filling the free space with zeroes could take a lot of time if you have a lot free. -- umop apisdn
On 02.11.2014 &#1075;. 17:25, Don Y wrote:
> Hi, > > I'm writing a bit of code to image disk contents REGARDLESS OF THE > FILESYSTEM(s) contained thereon. > > This doesn't have to be "ideal" (defined as "effortless", "minimal > image size", etc.) but should be pretty close. > > It is not intended to be performed often -- "write once, read multiple" > (i.e., RESTORE *far* more often than IMAGE). > > The challenge comes in the filesystem(s) neutral aspect. E.g., I > should be able to image a disk containing FAT32, NTFS, FFSv1/2, QFS, > individual RAID* volumes, little/BIG endian, etc. -- with the same > executable!
Hi Don, since obviously there is no common solution to all filesystems (unless you want to copy the entire medium which is impractical), your best bet is to go minimialistic about it. Recognize which file system this is, then find your way to allocated space and store it in some indexed format - such that you can subsequently recover it. On some filesystems it will be easier than on others - e.g. on DPS you will need to locate a file in the root directory, unitcat.syst, which is a bitmap of the allocated clusters; and you have to read logic block 0 to see how large the "device" (i.e. partition) is, what block size does it assume and how many blocks are there per cluster. On FAT it will be easier I think (no need to do root or any directory). But you can't get around this minimum I suppose. Then there are not that many filesystems in mass use anyway (I think George Neuner already said that), so the effort will not be that huge. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
The 2026 Embedded Online Conference