EmbeddedRelated.com
Forums

Embedding a Checksum in an Image File

Started by Rick C April 19, 2023
On 4/22/23 10:07 AM, Brian Cockburn wrote:
> Rick, > >> So far, it looks like a simple checksum is the way to go. Include the checksum and the 2's complement of the checksum (in locations that were zeros), and the checksum will not change. > > How will the checksum 'not change'? It will be different for every build won't it? > > Cheers, Brian.
He means the checksum of the file for a given build after the modification will be the same as the checksum of the file before the modification.
On 22/04/2023 02:29, Don Y wrote:
> On 4/21/2023 7:50 AM, David Brown wrote: >> On 21/04/2023 13:39, Don Y wrote: >>> On 4/21/2023 3:43 AM, David Brown wrote: >>>>> Note that you want to choose a polynomial that doesn't >>>>> give you a "win" result for "obviously" corrupt data. >>>>> E.g., if data is all zeros or all 0xFF (as these sorts of >>>>> conditions can happen with hardware failures) you probably >>>>> wouldn't want a "success" indication! >>>> >>>> No, that is pointless for something like a code image.  It just adds >>>> needless complexity to your CRC algorithm. >>> >>> Perhaps you've forgotten that you don't just use CRCs (secure hashes, >>> etc.) >>> on "code images"? >> >> No - but "code images" is the topic here. > > So, anything unrelated to CRC's as applied to code images is off limits... > per order of the Internet Police"? >
No, it's fine to discuss them - threads on Usenet often wander, and that's often good. (At least, that's my opinion - some people get their knickers in a twist if people stray from answering their original question.) But you have to assume that people are on topic unless it's clear that the topic is being expanded. We were discussing CRC's for code images, and so it is appropriate to take advantage of the features of code images. If you want to expand and talk about other uses of CRC's, I've no problem with that - but you need to say so.
> If *all* you use CRCs for is checking *a* code image at POST, you're > wasting a valuable resource. > > Do you not think data/parameters need to be safeguarded?  Program images? > Communication protocols?
Sure. Many things need integrity checks. And CRC's are flexible enough to be useful in many circumstances.
> > Or, do you develop yet another technique for *each* of those?
Sometimes, yes. CRC's are, as I wrote, flexible. But they don't cover everything. Maybe you need a specific type of check to match existing protocols or requirements. Maybe you want forward error correction, not just error detection. Maybe you are guarding against malicious interference. Maybe you are guarding against different kinds of errors - CRC's are great for spotting a few damaged bits, but a poor choice if the risk is dropped bytes in transmission. But often CRC's will be a first choice, because they are simple and effective in a wide range of uses.
> >> However, in almost every case where CRC's might be useful, you have >> additional checks of the sanity of the data, and an all-zero or >> all-one data block would be rejected.  For example, Ethernet packets >> use CRC for integrity checking, but an attempt to send a packet type 0 >> from MAC address 00:00:00:00:00:00 to address 00:00:00:00:00:00, of >> length 0, would be rejected anyway. > > Why look at "data" -- which may be suspect -- and *then* check its CRC? > Run the CRC first.  If it fails, decide how you are going to proceed > or recover. >
That is usually the order, yes. Sometimes you want "fail fast", such as dropping a packet that was not addressed to you (it doesn't matter if it was received correctly but for someone else, or it was addressed to you but the receiver address was corrupted - you are dropping the packet either way). But usually you will run the CRC then look at the data. But the order doesn't matter - either way, you are still checking for valid data, and if the data is invalid, it does not matter if the CRC only passed by luck or by all zeros.
> ["Data" can be code or parameters] > > I treat blocks of "data" (carefully arranged) with individual CRCs, > based on their relative importance to the operation.  If the CRC is > corrupt, I have no idea *where* the error lies -- as it could > be anything in the checked block.  So, one has to (typically) > restore some defaults (or, invoke a reconfigure operation) which > recreates *a* valid dataset. > > This is particularly useful when power to a device can be > removed at arbitrary points in time (or, some other abrupt > crash).  Before altering anything in a block, take deliberate > steps to invalidate the CRC, make your changes, then "fix" > the CRC.  So, an interrupted process causes the CRC to fail > and remedial action taken. > > Note that replacing a FLASH image (mostly code) falls under > such a mechanism. >
That's all standard stuff. (Maybe it's new to some people in this group - although most of the regular posters here are experienced embedded developers, it's nice to think there might be some people reading these posts and learning!) If you have the space in your flash, eeprom, etc., then it is also common to have two slots for your configuration data or code. You don't "invalidate" anything - you keep a version counter with your data, and write your new data to the slot with the oldest version. When your system starts, it checks both slots - and uses the one with the newest version for which the CRC check passes.
>> I can't think of any use-cases where you would be passing around a >> block of "pure" data that could reasonably take absolutely any value, >> without any type of "envelope" information, and where you would think >> a CRC check is appropriate. > > I append a *version specific* CRC to each packet of marshalled data > in my RMIs.  If the data is corrupted in transit *or* if the > wrong version API ends up targeted, the operation will abend > because we know the data "isn't right".
Using a version-specific CRC sounds silly. Put the version information in the packet.
> > I *could* put a header saying "this is version 4.2".  And, that > tells me nothing about the integrity of the rest of the data. > OTOH, ensuring the CRC reflects "4.2" does -- it the recipient > expects it to be so.
Now you don't know if the data is corrupted, or for the wrong version - or occasionally, corrupted /and/ the wrong version but passing the CRC anyway. Unless you are absolutely desperate to save every bit you can, your system will be simpler, clearer, and more reliable if you separate your purposes.
> >>>>> You can also "salt" the calculation so that the residual >>>>> is deliberately nonzero.  So, for example, "success" is >>>>> indicated by a residual of 0x474E.  :> >>>> >>>> Again, pointless. >>>> >>>> Salt is important for security-related hashes (like password >>>> hashes), not for integrity checks. >>> >>> You've missed the point.  The correct "sum" can be anything. >>> Why is "0" more special than any other value?  As the value is >>> typically meaningless to anything other than the code that verifies >>> it, you couldn't look at an image (or the output of the verifier) >>> and gain anything from seeing that obscure value. >> >> Do you actually know what is meant by "salt" in the context of hashes, >> and why it is useful in some circumstances?  Do you understand that >> "salt" is added (usually prepended, or occasionally mixed in in some >> other way) to the data /before/ the hash is calculated? > > What term would you have me use to indicate a "bias" applied to a CRC > algorithm?
Well, first I'd note that any kind of modification to the basic CRC algorithm is pointless from the viewpoint of its use as an integrity check. (There have been, mostly historically, some justifications in terms of implementation efficiency. For example, bit and byte re-ordering could be done to suit hardware bit-wise implementations.) Otherwise I'd say you are picking a specific initial value if that is what you are doing, or modifying the final value (inverting it or xor'ing it with a fixed value). There is, AFAIK, no specific terms for these - and I don't see any benefit in having one. Misusing the term "salt" from cryptography is certainly not helpful.
> >> I have not given the slightest indication to suggest that "0" is a >> special value.  I fully agree that the value you get from the checking >> algorithm does not have to be 0 - I already suggested it could be >> compared to the stored value.  I.e., your build your image file as >> "data ++ crc(data)", at check it by re-calculating "crc(data)" on the >> received image and comparing the result to the received crc.  There is >> no necessity or benefit in having a crc run calculated over the >> received data plus the received crc being 0. >> >> "Salt" is used in cases where the original data must be kept secret, >> and only the hashes are transmitted or accessible - by adding salt to >> the original data before hashing it, you avoid a direct correspondence >> between the hash and the original data.  The prime use-case is to stop >> people being able to figure out a password by looking up the hash in a >> list of pre-computed hashes of common passwords. > > See above. > >>> OTOH, if the CRC yields something familiar -- or useful -- then >>> it can tell you something about the image.  E.g., salt the algorithm >>> with the product code, version number, your initials, 0xDEADBEEF, etc. >> >> You are making no sense at all.  Are you suggesting that it would be a >> good idea to add some value to the start of the image so that the >> resulting crc calculation gives a nice recognisable product code? >> This "salt" would be different for each program image, and calculated >> by trial and error.  If you want a product code, version number, etc., >> in the program image (and it's a good idea), just put these in the >> program image! > > Again, that tells you nothing about the rest of the image!
Again, you are making no sense - not to me, anyway. If you want something in the image to tell you about the image, add such metadata - versions, dates, whatever. If you want an integrity check of the image, make one - such as appending a CRC. Trying to combine these two orthogonal tasks into one is not going to be good for either purpose.
> See the RMI desciption.
I'm sorry, I have no idea what "RMI" is or where it is described. You've mentioned that abbreviation twice, but I can't figure it out.
> > [Note that the OP is expecting the checksum to help *him* > identify versions:  "Just put these in the program image!"  Eh?]
No. The OP is looking for a way to be sure that two program images are the same. He wants to be sure that if he (or whoever makes the image) forgets to update the version number when making a change to the software, the difference between the images is easily detectable or identifiable without doing a byte-for-byte compare of the images. The answer to that is a hash of some sort - and a CRC of appropriate size is a simple hash that will work well against mistakes (but not necessarily malicious changes). But a hash will not give you a version number. It will let you see that two images are different, but it will not tell you that one of them is version 1.20.304 and the other is 1.21.308. What he will see is that if two files say they are version 1.20.304, but are actually different, someone has screwed up - the CRC hash makes such checks possible without having to read through the entire images.
> >>>>>> So now you have a new extended block   |....data....|crc| >>>>>> >>>>>> Now if you compute a new CRC on the extended block, the resulting >>>>>> value /should/ come out to zero. If it doesn't, either your data or >>>>>> the original CRC value appended to it has been changed/corrupted. >>>>> >>>>> As there is usually a lack of originality in the algorithms >>>>> chosen, you have to consider if you are also hoping to use >>>>> this to safeguard the *integrity* of your image (i.e., >>>>> against intentional modification). >>>> >>>> "Integrity" has nothing to do with the motivation for change. >>>> /Security/ is concerned with intentional modifications that >>>> deliberately attempt to defeat /integrity/ checks.  Integrity is >>>> about detecting any changes. >>>> >>>> If you are concerned about the possibility of intentional malicious >>>> changes, >>> >>> Changes don't have to be malicious. >> >> Accidental changes (such as human error, noise during data transfer, >> memory cell errors, etc.) do not pass integrity tests unnoticed. > > That's not true.  The role of the 8test* is to notice these.  If the test > is blind to the types of errors that are likely to occur, then it CAN'T > notice them.
I assumed it was unnecessary to say that an integrity test needs to be appropriate for the type of data and transfer in question.
> > A CRC (hash, etc.) reduces a large block of data to a small bit of > data.  So, by definition, there are multiple DIFFERENT sets of data that > map to the same CRC/hash/etc.  (2^(data_size-CRC-size))
Correct. That's why you need to pick an appropriate size for your CRC. For a telegram of a dozen bytes, an 8-bit CRC is probably fine. For a program image, a 32-bit CRC is usually more appropriate - a one in four billion chance of an undetected error is reasonable for most uses. If you want to be more paranoid, go for 64-bit CRC - you should now be far more worried about meteors wiping out humanity than undetected errors. (More commonly, if a 32-bit CRC is not enough, it's because you have security concerns - so switch to a SHA hash.)
> > E.g., simply summing the values in a block of memory will yield "0" > for ANY condition that results in the block having identical values > for ALL members, if the block size is a power of 2.  So, a block > of 0xFF, 0x00, 0xFE, 0x27, 0x88, etc. will all yield the same sum. > Clearly a bad choice of test! >
Correct. That's why simple sums are not usually considered very good integrity tests. A CRC has a spreading effect. Every bit in the data contributes with approximately equal weight to every bit in the CRC. This is a common feature for good hash functions.
> OTOH, "salting" the calculation so that it is expected to yield > a value of 0x13 means *those* situations will be flagged as errors > (and a different set of situations will sneak by, undetected).
And that gives you exactly /zero/ benefit. You run your hash algorithm, and check for the single value that indicates no errors. It does not matter if that number is 0, 0x13, or - often more conveniently - the number attached at the end of the image as the expected result of the hash of the rest of the data.
> The trick (engineering) is to figure out which types of > failures/faults/errors are most common to occur and guard > against them.
Yes, that is absolutely the case. And CRC's have the convenience of being particularly good at certain kinds of errors that are feasible in a lot of data transmissions. But they are not ideal for everything, and other kinds of checks can be better when you know more about the realistic errors.
> >> To be more accurate, the chances of them passing unnoticed are of the >> order of 1 in 2^n, for a good n-bit check such as a CRC check. >> Certain types of error are always detectable, such as single and >> double bit errors.  That is the point of using a checksum or hash for >> integrity checking. >> >> /Intentional/ changes are a different matter.  If a hacker changes the >> program image, they can change the transmitted hash to their own >> calculated hash.  Or for a small CRC, they could change a different >> part of the image until the original checksum matched - for a 16-bit >> CRC, that only takes 65,535 attempts in the worst case. > > If the approach used is "typical", then you need far fewer attempts to > produce a correct image -- without EVER knowing where the CRC is stored. >
It is difficult to know what you are trying to say here, but if you believe that different initial values in a CRC algorithm makes it harder to modify an image to make it pass the integrity test, you are simply wrong.
>> That is why you need to distinguish between the two possibilities.  If >> you don't have to worry about malicious attacks, a 32-bit CRC takes a >> dozen lines of C code and a 1 KB table, all running extremely >> efficiently.  If security is an issue, you need digital signatures - >> an RSA-based signature system is orders of magnitude more effort in >> both development time and in run time. > > It's considerably more expensive AND not fool-proof -- esp if the > attacker knows you are signing binaries.  "OK, now I need to find > WHERE the signature is verified and just patch that "CALL" out > of the code".
I'm not sure if that is a straw-man argument, or just showing your ignorance of the topic. Do you really think security checks are done by the program you are trying to send securely? That would be like trying to have building security where people entering the building look at their own security cards.
> >>> I altered the test procedure for a >>> piece of military gear we were building simply to skip some lengthy >>> tests that I *knew* would pass (I don't want to inject an extra 20 >>> minutes of wait time >>> just to get through a lengthy test I already know works before I can get >>> to the test of interest to me, now. >>> >>> I failed to undo the change before the official signoff on the device. >>> >>> The only evidence of this was the fact that I had also patched the >>> startup message to say "Go for coffee..." -- which remained on the >>> screen for the duration of the lengthy (even with the long test >>> elided) procedure... >>> >>> ..which alerted folks to the fact that this *probably* wasn't the >>> original image.  (The computer running the test suite on the DUT had >>> no problem accepting my patched binary) >> >> And what, exactly, do you think that anecdote tells us about CRC >> checks for image files?  It reminds us that we are all fallible, but >> does no more than that. > > That *was* the point.  Because the folks who designed the test computer > relied on common techniques to safeguard the image.
There was a human error - procedures were not good enough, or were not followed. It happens, and you learn from it and make better procedures. The fault was in what people did, not in an automated integrity check. It is completely unrelated.
> > The counterfeiting example I cited indicates how "obscurity/secrecy" > is far more effective (yet you dismiss it out-of-hand).
No, it does nothing of the sort. There is no connection at all.
> >>>> CRC's alone are useless.  All the attacker needs to do after >>>> modifying the image is calculate the CRC themselves, and replace the >>>> original checksum with their own. >>> >>> That assumes the "alterer" knows how to replace the checksum, how it >>> is computed, where it is embedded in the image, etc.  I modified the >>> Compaq >>> portable mentioned without ever knowing where the checksum was store >>> or *if* it was explicitly stored.  I had no desire to disassemble the >>> BIOS ROMs (though could obviously do so as there was no "proprietary >>> hardware" limiting access to their contents and the instruction set of >>> the processor is well known!). >>> >>> Instead, I did this by *guessing* how they would implement such a check >>> in a bit of kit from that era (ERPOMs aren't easily modified by malware >>> so it wasn't likely that they would go to great lengths to "protect" the >>> image).  And, if my guess had been incorrect, I could always reinstall >>> the original EPROMs -- nothing lost, nothing gained. >>> >>> Had much experience with folks counterfeiting your products and making >>> "simple" changes to the binaries?  Like changing the copyright notice >>> or splash screen? >>> >>> Then, bringing the (accused) counterfeit of YOUR product into a >>> courtroom >>> and revealing the *hidden* checksum that the counterfeiter wasn't >>> aware of? >>> >>> "Gee, why does YOUR (alleged) device have *my* name in it -- in addition >>> to behaving exactly like mine??" >>> >>> [I guess obscurity has its place!] >> >> Security by obscurity is not security.  Having a hidden signature or >> other mark can be useful for proving ownership (making an intentional >> mistake is another common tactic - such as commercial maps having a >> few subtle spelling errors). But that is not security. > > Of course it is!  If *you* check the "hidden signature" at runtime > and then alter "your" operation such that an altered copy fails > to perform properly, then then you have secured it. >
That is not security. "Security" means that the program that starts the updated program checks the /entire/ image according to its digital signature, and rejects it /entirely/ if it does not match. What you are talking about here is the sort of cat-and-mouse nonsense computer games producers did with intentional disk errors to stop copying. It annoys legitimate users and does almost nothing to hinder the bad guys.
> Would you want to use a check-writing program if the account > balances it maintains were subtly (but not consistently) > incorrect?
Again, you make no sense. What has this got to do with integrity checks or security?
> > OTOH, if the (altered) program threw up a splash screen and > said "Unlicensed copy detected" and refused to operate, the > "program" is still "secured" -- but, now you've provided an > easy indicator of whether or not the security has been > defeated. > > We started doing this in the heyday of video (arcade) gaming; > a counterfeiter would have a clone of YOUR game on the market > (at substantially reduced prices) in a matter of *weeks*. > As Operators have no foreknowledge of which games will be > moneymakers and which will be "90 day wonders" (literally, > no longer played after 90 days of exposure!), what incentive > to pay for a genuine article? > > If all a counterfeiter had to do was alter the copyright > notice (even if it was stored in some coded form), or alter > some graphics (name of game, colors/shapes of characters) > that's *no* impediment -- given how often and quickly > it could be done. > > Games would not just look at their images during POST > but, also, verify that routineX() had some particular > side-effect that could be tested, etc.  Counterfeiters > would go to lengths to ensure even THESE tests would pass. > > Because the game would *complain*, otherwise!  (so, keep > looking for more tests until the game stops throwing an > alarm). > > OTOH, if you *hide* the checks in the runtime and alter > the game's performance subtly by folding expected values > into key calculations such that values derived from > altered code differ, you can annoy the player:  "why did > my guy just turn blue and run off the edge of the screen?" > An annoyed player stops putting money into a game. > A game that doesn't earn money -- regardless of how > inexpensive it was to purchase -- quickly teaches the > Owner not to invest in such "buggy" games. > > This is much better than taking the counterfeiter to court and > proving the code is a copy of yours!  (and, "FlyByNight > Games Counterfeiters" simply closes up shop and opens up, > next door) > > And, because there is no "drop dead" point in the code or > the games behavior, the counterfeiter never knows when > he's found all the protection mechanisms. > > Checking signatures, CRCs, licensing schemes, etc. all are used > in a "drop dead" fashion so considerably easier to defeat. > Witness the number of "products" available as warez... >
Look, it is all /really/ simple. And the year is 2023, not 1973. If you want to check the integrity of a file against accidental changes, a CRC is usually fine. If you want security, and to protect against malicious changes, use a digital signature. This must be checked by the program that /starts/ the updated code, or that downloaded and stored it - not by the program itself!
>>> Use a non-secret approach and you invite folks to alter it, as well. >>> >>>> Using non-standard algorithms for security is a simple way to get >>>> things completely wrong.  "Security by obscurity" is very rarely the >>>> right answer. In reality, good security algorithms, and good >>>> implementations, are difficult and specialised tasks, best left to >>>> people who know what they are doing. >>>> >>>> To make something secure, you have to ensure that the check >>>> algorithms depend on a key that you know, but that the attacker does >>>> not have. That's the basis of digital signatures (though you use a >>>> secure hash algorithm rather than a simple CRC). >>> >>> If you can remove the check, then what value the key's secrecy?  By your >>> criteria, the adversary KNOWS how you are implementing your security >>> so he knows exactly what to remove to bypass your checks and allow his >>> altered image to operate in its place. >>> >>> Ever notice how manufacturers don't PUBLICLY disclose their security >>> hooks (without an NDA)?  If "security by obscurity" was not important, >>> they would publish these details INVITING challenges (instead of >>> trying to limit the knowledge to people with whom they've officially >>> contracted). >> >> Any serious manufacturer /does/ invite challenges to their security. >> >> There are multiple reasons why a manufacturer (such as a semiconductor >> manufacturer) might be guarded about the details of their security >> systems. They can be avoiding giving hints to competitors.  Maybe they >> know their systems aren't really very secure, because their keys are >> too short or they can be read out in some way. >> >> But I think the main reasons are often: >> >> They want to be able to change the details, and that's far easier if >> there are only a few people who have read the information. > > So, a legitimate customer is subjected to arbitrary changes in > the product's implementation? >
Yes. It may come as a shock to you, but welcome to the real world.
>> They don't want endless support questions from amateurs. > > Only answer with a support contract.
Oh, sure - the amateurs who have some of the information but not enough details, skill or knowledge to get things working will /never/ fill forums with questions, complaints or bad reviews that bother your support staff or scare away real sales.
> >> They are limited by idiotic government export restrictions made by >> ignorant politicians who don't understand cryptography. > > Protections don't always have to be cryptographic.
Correct, but - as with a lot of what you write - completely irrelevant to the subject at hand. Why can't companies give out information about the security systems used in their microcontrollers (for example) ? Because some geriatric ignoramuses think banning "export" of such information to certain countries will stop those countries knowing about security and cryptography.
> The > "Fortress" payphone is remarkably well hardened to direct > physical (brute force) attacks -- money is involved. > Ditto many slot machines (again, CASH money).  Yet, all > have vulnerabilities.  "Expose this portion of the die > to ultraviolet light to reset the memory protection bits" > Etc. > >> Some things benefit from being kept hidden, or under restricted >> access. The details of the CRC algorithm you use to catch accidental >> errors in your image file is /not/ one of them.  If you think hiding >> it has the remotest hint of a benefit, you are doing things wrong - >> you need a /security/ check, not a simple /integrity/ check. >> >> And then once you have switched to a security check - a digital >> signature - there's no need to keep that choice hidden either, because >> it is the /key/ that is important, not the type of lock. > > Again, meaningless if the attacker can interfere with the *enforcement* > of that check.  Using something "well known" just means he already knows > what to look for in your code.  Or, how to interfere with your > intended implementation in ways that you may have not anticipated > (confident that your "security" can't be MATHEMATICALLY broken). >
If the attacker can interfere with the enforcement of the check, then it doesn't matter what checks you have. Keeping the design of a building's locks secret does not help you if the bad guys have bribed the security guard /inside/ the building!
> I had a discussion with a friend who knew just enough about "computers" > to THINK he understood that world.  I mentioned my NOT using ecommerce. > He laughed at me as "naive":  "There's 40 bit encryption on those > connections!  No one is going to eavesdrop on your financial data!" > > [Really, Jerry?  You think, as an OLD accountant, you know more > than I do as a young engineer practicing in that field?  Ok...] > > "Yeah, and are you 100% sure something isn't already *on* your computer > looking at your keystrokes BEFORE they head down that encrypted tunnel?" > > Guess he hadn't really thought out the problem to that level of detail > as his confidence quickly melted away to one of worry ("I wonder if > I've already been hacked??") > > People implementing security almost always focus on the wrong > aspects of the problem and walk away THINKING they can rest easy. > Vulnerabilities are often so blatantly obvious, after the fact, > as to be embarassing:  "You're not supposed to do that!" > "Then, why did your product LET ME?" > > I use *many* layers of security in my current design and STILL > expect them (at least the ones that are accessible) to all > be subverted.  So, ultimately rely on controlling *what* > the devices can do so that, even compromised, they can't > cause undetectable failures or information leaks. > > "Here's my source code.  Here are my schematics.  Here's the > name of the guy who oversees production (bribe him to gain > access to the keys stored in the TPM).  Now, what are you > gonna *do* with all that?" >
The first two should be fine - if people can break your security after looking at your source code or schematics, your security is /bad/. As for the third one, if they can break your security by going through the production guy, your production procedures are bad.
On 22/04/2023 01:56, Brian Cockburn wrote:
> On Saturday, April 22, 2023 at 1:02:28 AM UTC+10, David Brown wrote: >> On 21/04/2023 14:12, Rick C wrote: >>> >>> This is simply to be able to say this version is unique, >>> regardless of what the version number says. Version numbers are >>> set manually and not always done correctly. I'm looking for >>> something as a backup so that if the checksums are different, I >>> can be sure the versions are not the same. >>> >>> The less work involved, the better. >>> >> Run a simple 32-bit crc over the image. The result is a hash of >> the image. Any change in the image will show up as a change in the >> crc. > David, a hash and a CRC are not the same thing.
A CRC is a type of hash - but hash is a more generic term.
> They both produce a > reasonably unique result though. Any change would show in either > (unless as a result of intentional tampering).
Exactly. Thus a CRC is a hash. It is not a cryptographically secure hash, and is not suitable for protecting against intentional tampering. But it /is/ a hash.
On 22/04/2023 05:14, Rick C wrote:
> On Friday, April 21, 2023 at 11:02:28 AM UTC-4, David Brown wrote: >> On 21/04/2023 14:12, Rick C wrote: >>> >>> This is simply to be able to say this version is unique, >>> regardless of what the version number says. Version numbers are >>> set manually and not always done correctly. I'm looking for >>> something as a backup so that if the checksums are different, I >>> can be sure the versions are not the same. >>> >>> The less work involved, the better. >>> >> Run a simple 32-bit crc over the image. The result is a hash of >> the image. Any change in the image will show up as a change in the >> crc. > > No one is trying to detect changes in the image. I'm trying to label > the image in a way that can be read in operation. I'm using the > checksum simply because that is easy to generate. I've had problems > with version numbering in the past. It will be used, but I want it > supplemented with a number that will change every time the design > changes, at least with a high probability, such as 1 in 64k. >
Again - use a CRC. It will give you what you want. You might want to go for 32-bit CRC rather than a 16-bit CRC, depending on the kind of program, how often you build it, and what consequences a hash collision could have. With a 16-bit CRC, you have a 5% chance of a collision after 82 builds. If collisions only matter for releases, and you only release a couple of updates, fine - but if they matter during development builds, you are getting a more significant risk. Since a 32-bit CRC is quick and easy, it's worth using.
On Saturday, April 22, 2023 at 10:07:37 AM UTC-4, Brian Cockburn wrote:
> Rick, > >> Rick, so you want the executable to, as part of its execution, print on the console the 'checksum' of itself? Or do you want to be able to inspect the executable with some other tool to calculate its 'checksum'? For the latter there are lots of tools to do that (your OS or PROM programmer for instance), for the former you need to embed the calculation code into the executable (along with the length over which to calculate) and run this when asked. Neither of these involve embedding the 'checksum' value. > >> And just to be sure I understand what you wrote in a somewhat convoluted way. When you have two binary executables that report the same version number you want to be able to distinguish them with a 'checksum', right? > > > > Yes, I want the checksum to be readable while operating. Calculation code??? Not going to happen. That's why I want to embed the checksum. > Can you expand on what you mean or expect by 'readable while operating' please? Are you planning to use some sort of tool to inspect the executing binary to 'read' this thing, or provoke output to the console in some way like: > > $ run my-binary-thing --checksum > 10FD > $ > > This would be as distinct from: > > $ run my-binary-thing --version > -52 > $
More like $ run my-binary thing Hello, master. Would you like to achieve world domination today?
> No, thank you, can you display the contents of registers 26 and 27 in hex please?
That would be X0FE38
> Thank you.
> > Yes, two compiled files which ended up with the same version number by error. We are using an 8 bit version number, so two hex digits. Negative numbers are lab versions, positive numbers are releases, so 64 of each. > Signed 8-bit numbers range from -128 to +127 (0x80 to 0x7F) so probably a few more than 64.
See? This is why I need the checksum. I make mistakes.
> > ... sometimes, in the lab, the rev number is not bumped when it should be. > > This may be an indicator that better procedures are needed for code review-for-release. And that in independent pair of eyes should be doing the review against an agreed check list.
Or that I need a checksum. This is a lab compile, not a release. Let's try to stay on task.
> > So far, it looks like a simple checksum is the way to go. Include the checksum and the 2's complement of the checksum (in locations that were zeros), and the checksum will not change. > How will the checksum 'not change'? It will be different for every build won't it?
It won't be changed by including the checksum and the complement because they add up to zero. -- Rick C. -+- Get 1,000 miles of free Supercharging -+- Tesla referral code - https://ts.la/richard11209
On Saturday, April 22, 2023 at 11:13:32 AM UTC-4, David Brown wrote:
> On 22/04/2023 05:14, Rick C wrote: > > On Friday, April 21, 2023 at 11:02:28 AM UTC-4, David Brown wrote: > >> On 21/04/2023 14:12, Rick C wrote: > >>> > >>> This is simply to be able to say this version is unique, > >>> regardless of what the version number says. Version numbers are > >>> set manually and not always done correctly. I'm looking for > >>> something as a backup so that if the checksums are different, I > >>> can be sure the versions are not the same. > >>> > >>> The less work involved, the better. > >>> > >> Run a simple 32-bit crc over the image. The result is a hash of > >> the image. Any change in the image will show up as a change in the > >> crc. > > > > No one is trying to detect changes in the image. I'm trying to label > > the image in a way that can be read in operation. I'm using the > > checksum simply because that is easy to generate. I've had problems > > with version numbering in the past. It will be used, but I want it > > supplemented with a number that will change every time the design > > changes, at least with a high probability, such as 1 in 64k. > > > Again - use a CRC. It will give you what you want.
Again - as will a simple addition checksum.
> You might want to go for 32-bit CRC rather than a 16-bit CRC, depending > on the kind of program, how often you build it, and what consequences a > hash collision could have. With a 16-bit CRC, you have a 5% chance of a > collision after 82 builds. If collisions only matter for releases, and > you only release a couple of updates, fine - but if they matter during > development builds, you are getting a more significant risk. Since a > 32-bit CRC is quick and easy, it's worth using.
Or, I might want to go with a simple checksum. Thanks for your comments. -- Rick C. -++ Get 1,000 miles of free Supercharging -++ Tesla referral code - https://ts.la/richard11209
On 22/04/2023 18:56, Rick C wrote:
> On Saturday, April 22, 2023 at 11:13:32 AM UTC-4, David Brown wrote: >> On 22/04/2023 05:14, Rick C wrote: >>> On Friday, April 21, 2023 at 11:02:28 AM UTC-4, David Brown wrote: >>>> On 21/04/2023 14:12, Rick C wrote: >>>>> >>>>> This is simply to be able to say this version is unique, >>>>> regardless of what the version number says. Version numbers are >>>>> set manually and not always done correctly. I'm looking for >>>>> something as a backup so that if the checksums are different, I >>>>> can be sure the versions are not the same. >>>>> >>>>> The less work involved, the better. >>>>> >>>> Run a simple 32-bit crc over the image. The result is a hash of >>>> the image. Any change in the image will show up as a change in the >>>> crc. >>> >>> No one is trying to detect changes in the image. I'm trying to label >>> the image in a way that can be read in operation. I'm using the >>> checksum simply because that is easy to generate. I've had problems >>> with version numbering in the past. It will be used, but I want it >>> supplemented with a number that will change every time the design >>> changes, at least with a high probability, such as 1 in 64k. >>> >> Again - use a CRC. It will give you what you want. > > Again - as will a simple addition checksum.
A simple addition checksum might be okay much of the time, but it doesn't have the resolving power of a CRC. If the source code changes "a = 1; b = 2;" to "a = 2; b = 1;", the addition checksum is likely to be exactly the same despite the change in the source. In general, you will have much higher chance of collisions, though I think it would be very hard to quantify that. Maybe it will be good enough for you. Simple checksums were popular once, and can still make sense if you are very short on program space. But there are good reasons why they fell out of favour in many uses.
> > >> You might want to go for 32-bit CRC rather than a 16-bit CRC, depending >> on the kind of program, how often you build it, and what consequences a >> hash collision could have. With a 16-bit CRC, you have a 5% chance of a >> collision after 82 builds. If collisions only matter for releases, and >> you only release a couple of updates, fine - but if they matter during >> development builds, you are getting a more significant risk. Since a >> 32-bit CRC is quick and easy, it's worth using. > > Or, I might want to go with a simple checksum. > > Thanks for your comments. >
It's your choice (obviously). I only point out the weaknesses in case anyone else is listening in to the thread. If you like, I can post code for a 32-bit CRC. It's a table, and a few lines of C code.
On 2023-04-22, David Brown <david.brown@hesbynett.no> wrote:

> A simple addition checksum might be okay much of the time, but it > doesn't have the resolving power of a CRC. If the source code changes > "a = 1; b = 2;" to "a = 2; b = 1;", the addition checksum is likely to > be exactly the same despite the change in the source. In general, you > will have much higher chance of collisions, though I think it would be > very hard to quantify that.
I remember a long discussion about this a few decades ago. An N bit additive checksum maps the source data into the same hash space as a N-bit crc. Therefore, for two randomly chosen sets of input bits, they both have a 1 in 2^N chance of a collision. I think that means that for random changes to an input set of unspecified properties, they would both have the same chance that the hash is unchanged. However... IIRC, somebody (probably at somewhere like Bell labs) noticed that errors in data transmitted over media like phone lines and microwave links are _not_ random. Errors tend to be "bursty" and can be statistically characterized. And it was shown that for the common error modes for _those_ media, CRCs were better at detecting real-world failures than additive checksum. And (this is also important) a CRC is far, far simpler to implement in hardware than an additive checksum. For the same reasons, CRCs tend to get used for things like Ethernet frames, disc sectors, etc. Later people seem to have adopted CRCs for detecting failures in other very dissimilar media (e.g. EPROMs) where implementing a CRC is _more_ work than an additive checksum. If the failure modes for EPROM are similar to those studied at <wherever> when CRCs were chosen, then CRCs are probably also a good choice for EPROMs despite the additional overhead. If the failure modes for EPROMs are significantly different, then CRCs might be both sub-optimal and unnecessarily expensive. I have no hard data either way, but it was never obvious to me that the arguments people use in favor of CRCs (better at detecting burst errors on transmission media) necessarily applied to EPROMs. That said, I do use CRCs rather than additive checksums for things like EPROM and flash. -- Grant
On Sat, 22 Apr 2023 19:54:54 +0200, David Brown
<david.brown@hesbynett.no> wrote:

>On 22/04/2023 18:56, Rick C wrote: >> On Saturday, April 22, 2023 at 11:13:32?AM UTC-4, David Brown wrote: >>> On 22/04/2023 05:14, Rick C wrote: >>>> On Friday, April 21, 2023 at 11:02:28?AM UTC-4, David Brown wrote: >>>>> On 21/04/2023 14:12, Rick C wrote: >>>>>> >>>>>> This is simply to be able to say this version is unique, >>>>>> regardless of what the version number says. Version numbers are >>>>>> set manually and not always done correctly. I'm looking for >>>>>> something as a backup so that if the checksums are different, I >>>>>> can be sure the versions are not the same. >>>>>> >>>>>> The less work involved, the better. >>>>>> >>>>> Run a simple 32-bit crc over the image. The result is a hash of >>>>> the image. Any change in the image will show up as a change in the >>>>> crc. >>>> >>>> No one is trying to detect changes in the image. I'm trying to label >>>> the image in a way that can be read in operation. I'm using the >>>> checksum simply because that is easy to generate. I've had problems >>>> with version numbering in the past. It will be used, but I want it >>>> supplemented with a number that will change every time the design >>>> changes, at least with a high probability, such as 1 in 64k. >>>> >>> Again - use a CRC. It will give you what you want. >> >> Again - as will a simple addition checksum. > >A simple addition checksum might be okay much of the time, but it >doesn't have the resolving power of a CRC. If the source code changes >"a = 1; b = 2;" to "a = 2; b = 1;", the addition checksum is likely to >be exactly the same despite the change in the source. In general, you >will have much higher chance of collisions, though I think it would be >very hard to quantify that. > >Maybe it will be good enough for you. Simple checksums were popular >once, and can still make sense if you are very short on program space. >But there are good reasons why they fell out of favour in many uses. > >> >> >>> You might want to go for 32-bit CRC rather than a 16-bit CRC, depending >>> on the kind of program, how often you build it, and what consequences a >>> hash collision could have. With a 16-bit CRC, you have a 5% chance of a >>> collision after 82 builds. If collisions only matter for releases, and >>> you only release a couple of updates, fine - but if they matter during >>> development builds, you are getting a more significant risk. Since a >>> 32-bit CRC is quick and easy, it's worth using.
Totally agree ! I stopped using simple checksums years ago. Many processors these days also have a CRC peripheral that makes it easy to use. And I can simply chop that off to 16 bits if I don't want to transmit all 32 bits. OR even 24 bits. boB
>> >> Or, I might want to go with a simple checksum. >> >> Thanks for your comments. >> > > >It's your choice (obviously). I only point out the weaknesses in case >anyone else is listening in to the thread. > >If you like, I can post code for a 32-bit CRC. It's a table, and a few >lines of C code. > > >
On 22/04/2023 22:05, Grant Edwards wrote:
> On 2023-04-22, David Brown <david.brown@hesbynett.no> wrote: > >> A simple addition checksum might be okay much of the time, but it >> doesn't have the resolving power of a CRC. If the source code changes >> "a = 1; b = 2;" to "a = 2; b = 1;", the addition checksum is likely to >> be exactly the same despite the change in the source. In general, you >> will have much higher chance of collisions, though I think it would be >> very hard to quantify that. > > I remember a long discussion about this a few decades ago. An N bit > additive checksum maps the source data into the same hash space > as a N-bit crc. > > Therefore, for two randomly chosen sets of input bits, they both have > a 1 in 2^N chance of a collision. I think that means that for random > changes to an input set of unspecified properties, they would both > have the same chance that the hash is unchanged. > > However... IIRC, somebody (probably at somewhere like Bell labs) > noticed that errors in data transmitted over media like phone lines > and microwave links are _not_ random. Errors tend to be "bursty" and > can be statistically characterized. And it was shown that for the > common error modes for _those_ media, CRCs were better at detecting > real-world failures than additive checksum. And (this is also > important) a CRC is far, far simpler to implement in hardware than an > additive checksum. For the same reasons, CRCs tend to get used for > things like Ethernet frames, disc sectors, etc. > > Later people seem to have adopted CRCs for detecting failures in other > very dissimilar media (e.g. EPROMs) where implementing a CRC is _more_ > work than an additive checksum. If the failure modes for EPROM are > similar to those studied at <wherever> when CRCs were chosen, then > CRCs are probably also a good choice for EPROMs despite the additional > overhead. If the failure modes for EPROMs are significantly different, > then CRCs might be both sub-optimal and unnecessarily expensive. > > I have no hard data either way, but it was never obvious to me that > the arguments people use in favor of CRCs (better at detecting burst > errors on transmission media) necessarily applied to EPROMs. > > That said, I do use CRCs rather than additive checksums for things > like EPROM and flash. >
That's a lot of good points. You are absolutely correct that CRC's are better for the types of errors that are often seen in transmission systems. The person at Bell Labs that you are thinking about is probably Claude Shannon, famous for his quantitive definition of information and work on the information capacity of communication channels with noise. Another thing you can look at is the distribution of checksum outputs, for random inputs. For an additive checksum, you can consider your input as N independent 0-255 random values, added together. The result will be a normal distribution of the checksum. If you have, say, a 100 byte data block and a 16-bit checksum, it's clear that you will never get a checksum value greater than 25500, and that you are much more likely to get a value close to 12750. This kind of clustering means that the 16-bit checksum contains a lot less than 16 bits of information. Real data - program images, data telegrams, etc., - are not fully random and the result is even more clustering and less information in the checksum. Taking the additive checksum over a larger range, then "folding" the distribution back by wrapping the checksum to 8-bit or 16-bit will greatly reduce the clustering. That will help a lot if you have a program image and use a 16-bit additive checksum, but if you need more than "1 in 65536" integrity, it's hard to get. A particular weakness of purely additive checksums is that they only consider the values of the bytes, not their order - re-arranging the order of the same data gives the same additive checksum. CRC's are not as good as more advanced hashes like SHA or MD5. But their distributions are vastly better than additive checksums, and they provide integrity checks for a wider variety of possible errors. Of course, for some uses, an additive checksum might be considered good enough. There's no need to be more complicated than you need to be. But since CRC's are usually very simple and efficient to calculate, they give an option that is a lot better than an additive checksum for little extra cost, while going beyond them to MD5 or SHA involves significantly more effort. (SHA is your first choice if you are protecting against malicious changes.)