Linear Feedback Shift Registers for the Uninitiated, Part XVII: Reverse-Engineering the CRC

Jason Sachs●July 7, 2018●1 comment

Last time, we continued a discussion about error detection and correction by covering Reed-Solomon encoding. I was going to move on to another topic, but then there was this post on Reddit asking how to determine unknown CRC parameters:

I am seeking to reverse engineer an 8-bit CRC. I don’t know the generator code that’s used, but can lay my hands on any number of output sequences given an input sequence.

This is something I call the “unknown oracle” problem. You have a black-box system (the “oracle”) that implements CRC calculation. Maybe it’s a server on the Internet, maybe it’s a computer program for which you don’t have the source code. In any case, you can submit any number of different messages to the oracle, and it will send back a CRC value.

This article is available in PDF format for easy printing

But you don’t like relying on the oracle. It’s pissing you off. Maybe it charges you a fee every time you use it. Or you have to design a system, and it’s a liability to use some third-party software or service you can’t control. Or your arch-nemesis has programmed the oracle to respond to your CRC requests only through a Fran Drescher “The Nanny” Talking Doll which has been hacked with the appropriate voice samples. (“Oh, all right, ya really wanna know? It’s C9 03. Ah ha ha ha ha ha ha ha!”) So you’d like to figure out how to calculate the CRC yourself, and make sure that the oracle’s services are no longer necessary.

Can we do this? Absolutely!

Known Unknowns: Ice Cream Cone Endianness Again

The first thing we need is to figure out what information is required to compute a CRC that matches the oracle’s output.

We know we need to figure out the generating polynomial. But that’s not sufficient. I talked about CRCs in Part XV as well as in an earlier article, where I quoted Ross Williams’s Painless Guide to CRC Error Detection Algorithms, in which he mentions you need

Width of the poly (polynomial).

Value of the poly.

Initial value for the register.

Whether the bits of each byte are reflected before being processed.

Whether the algorithm feeds input bytes through the register or xors them with a byte from one end and then straight into the table.

Whether the final register value should be reversed (as in reflected versions).

Value to XOR with the final register value.

We’re going to take care of most of these. Here’s my list of important parameters:

generating polynomial $ g(x) $
initial value $ p_i(x) $
final value $ p_f(x) $
whether bytes are interpreted as least-significant-bit first or not.
whether the LFSR is shifted forwards or backwards (in finite field math, this amounts to multiplying by $ x $ or $ x^{-1} $, respectively)

The ones we won’t be using are

whether or not the generating polynomial’s coefficients are reversed — this is an internal implementation issue; if I use the polynomial $ g(x) = x^8 + x^4 + x^3 + x^2 + 1 $ (0x11D in hex, 100011101 in binary) in a bit-reversed manner, then it’s the same thing as if I used the polynomial $ g(x) = x^8 + x^6 + x^5 + x^4 + 1 $ (0x171 in hex, 101110001 in binary) in a non-bit-reversed manner. It doesn’t matter whether you order a chocolate strawberry peach vanilla banana pistachio peppermint lemon orange butterscotch ice cream cone, or a butterscotch orange lemon peppermint pistachio banana vanilla peach strawberry chocolate ice cream cone, as long as you and the ice cream vendor have the same idea of which scoop goes on top.

independent choices for whether the input is sent in LSB first, and whether the output is sent in LSB first. This is possible but seems silly. Work with consistent endian-ness.

As I mentioned in part XV, there are two equivalent methods of handling CRCs. One is to feed the input into an LFSR, and the other is a mathematical computation using finite fields.

The LFSR approach is as follows:

Create an LFSR with taps corresponding to the generating polynomial’s coefficients
Initialize the LFSR with the coefficients of $ p_i(x) $, the initial value
For each byte of the message $ M $, clock in the bits into the polynomial — LSB first or MSB first, depending on how you define the CRC. So the message Hi! is either 01001000 01101001 00100001 (MSB first) or 00010010 10010110 10000100 (LSB first).
Clock in $ N $ more zeros (where $ N $ is the number of bits in the LFSR and the degree of the generating polynomial)
Take the LFSR state and XOR it with the coefficients of $ p_f(x) $, the final value, then convert to bytes, in the appropriate bit and byte order.

The finite field approach is to compute $ c(x) = f_m(M(x)) = \left(x^{N+8m}p_i(x) + x^N M(x) + p_f(x)\right) \bmod g(x) $ where $ m $ is the number of bytes in the message $ M(x) $. (Oh, and remember that we’re working in $ GF(2)[x] $, so all the polynomial coefficient math is modulo 2, and both addition and subtraction translate into an XOR.) The bit and byte order polarities just determine how bytes translate into polynomial coefficients.

These approaches are equivalent, and if you don’t understand finite fields you can always stick to plain old LFSRs. But the finite field description of a CRC is going to help us reverse-engineer the CRC parameters.

!Kcab Nrut !Kcab Nrut

The forward vs. backwards option of CRCs is somewhat annoying, and we’re forced to deal with it, because some people decided they were going to right-shift LFSRs rather than left shift. As I mentioned, in finite-field algebra this corresponds to multiplying by $ x^{-1} $ rather than $ x $; in the LFSR, what we do is:

if the least significant bit of the LFSR is 1, we XOR the LFSR contents with the polynomial and right-shift, bringing a 1 bit into the most significant bit
otherwise we right-shift, bringing a 0 bit into the most significant bit

This approach is used with the Ethernet/PNG/PKZIP CRC; if you look at the PNG specification’s sample code, you’ll see this snippet for preparing a table:

/* Make the table for a fast CRC. */
void make_crc_table(void)
{
  unsigned long c;
  int n, k;

  for (n = 0; n < 256; n++) {
    c = (unsigned long) n;
    for (k = 0; k < 8; k++) {
      if (c & 1)
        c = 0xedb88320L ^ (c >> 1);
      else
        c = c >> 1;
    }
    crc_table[n] = c;
  }
  crc_table_computed = 1;
}

Here the polynomial is 0x1db710641 (which is 0xedb88320 left-shifted by 1 with a 1 in the LSB) and the table represents each of the possible 8-bit values multiplied by $ x^{-8} $ in the finite field.

It turns out that there’s an equivalent method of doing this, and that is to use the complementary polynomial (with coefficients reversed) and bit-reverse the entire LFSR contents:

from libgf2.gf2 import GF2QuotientRing

qrfwd = GF2QuotientRing(0x1db710641)
qrrev = GF2QuotientRing(0x104c11db7)

def bit_reverse(x,n):
    y = 0
    for _ in xrange(n):
        y <<= 1
        y |= (x&1)
        x >>= 1
    return y

print('GF2QuotientRing(0x%x), 2^31 shifted right by 32: %08x'
      % (qrfwd.coeffs, qrfwd.rshiftraw(0x80000000,32)))
bits_shifted_left = qrrev.lshiftraw(0x1,32)
print(('GF2QuotientRing(0x%x),     1 shifted left by 32: %08x\n'
      +'                                   ...and bit-reversed: %08x')
      % (qrrev.coeffs, bits_shifted_left, 
         bit_reverse(bits_shifted_left, 32)))

GF2QuotientRing(0x1db710641), 2^31 shifted right by 32: edb88320
GF2QuotientRing(0x104c11db7),     1 shifted left by 32: 04c11db7
                                   ...and bit-reversed: edb88320

So we can dispense with the need to deal with this $ x^{-1} $ business, and instead just calculate the CRC with a reversed polynomial and then reverse the output bits. This does have consequences, though, and we’ll deal with them later.

Affine vs. Linear

One important thing to note is that the general definition of a CRC is an affine function rather than a linear one — a truly linear function would produce zero output for zero input. But instead we have these pesky initial and final values. The function $ f(x) = mx $ is linear; if $ x = a_1x_1 + a_2x_2 $ is a linear superposition of two inputs, then $ f(x) $ is the same linear superposition of the outputs: $ f(x) = a_1f(x_1) + a_2f(x_2) $. Affine functions have an offset, like $ f(x) = mx+b $, and they mess up this linear superposition property a little bit. In the case of the CRC, we add an offset $ x^{N+8m}p_i(x) + p_f(x) $, making it an affine function for any fixed message length $ m $ in bytes. This is interesting because if we have any two messages $ M_1(x) $ and $ M_2(x) $ of length $ m $, and get CRCs $ c_1(x) = f_m(M_1(x)) $ and $ c_2(x) = f_m(M_2(x)) $, then the XOR of the CRCs is related to the XOR of the messages in a much simpler manner, and the initial and final values disappear:

$$c_1(x) + c_2(x) = x^N\left(M_1(x) + M_2(x)\right) \bmod g(x)$$

This also means that if $ z_m(x) = f_m(0) $ is the CRC result of feeding in an all-zero message of length $ m $, then $ \bar{c}(x) = c(x) + z_m(x) = x^N M(x) \bmod g(x) $ — we can get the initial and final values to go away if we compare with an all-zero message of the same length. Let’s see this in action:

from libgf2.util import parity

def crc_bitwise(message, polynomial, init=0, final=0, 
                lsbit_first=True, 
                lsbyte_first=True, verbose=False):
    field = GF2QuotientRing(polynomial)
    n = field.degree
    polytail = polynomial ^ (1<<n)
    # "tail" = all bits of the polynomial except the leading one
    crc = field.lshiftraw(init,n)
    # This method uses the tail to avoid those extra n shifts
    # (we then need to pre-shift the initial value by n)
    for byte in message:
        bytecode = ord(byte)
        if verbose:
            print "%s (0x%02x)" % (repr(byte), bytecode)
        for bitnum in xrange(8):
            i = bitnum if lsbit_first else (7-bitnum)
            bit = (bytecode >> i) & 1
            crc_prev = crc
            crc_prev_double = field.mulraw(2,crc)
            crc = crc_prev_double ^ (polytail if bit else 0)
            if verbose:
                print "%d %04x -> %04x -> %04x" % (bit, crc_prev, crc_prev_double, crc)
    if final:
        if verbose:
            print "%04x ^ %04x = %04x" % (crc, final, crc^final)
        crc ^= final
    # bit reverse within bytes
    if lsbit_first:
        out = 0
        for k in xrange(n):
            out <<= 1
            out ^= (crc >> ((n-1-k)^7)) & 1
        if verbose:
            print "bit-reversed(%04x) = %04x" % (crc, out)
        crc = out
    # byte reverse
    if lsbyte_first:
        out = 0
        for k in xrange((n+7)//8):
            out <<= 8
            out ^= (crc >> (k*8)) & 0xff
        if verbose:
            print "byte-reversed(%04x) = %04x" % (crc, out)   
        crc = out
    return crc

def hexlify(s, separator=' '):
    return separator.join('%02x' % ord(c) for c in s)

def bit_reverse(x,n):
    y = 0
    for _ in xrange(n):
        y <<= 1
        y |= (x&1)
        x >>= 1
    return y

def bit_reverse_in_bytes(x,n):
    """ bit-reverse each byte in x"""
    y = 0
    for k in xrange(n):
        y |= bit_reverse(x>>(8*k), 8) << (8*k)
    return y

def byte_reverse(x,n):
    y = 0
    for _ in xrange(n):
        y <<= 8
        y |= (x&0xff)
        x >>= 8
    return y

def optional_reverser(x,nbits,
                      reverse_bits_in_bytes,
                      reverse_bytes):
    if reverse_bytes:
        if reverse_bits_in_bytes:
            return bit_reverse(x,nbits)
        else:
            return byte_reverse(x,(nbits+7)//8)
    else:
        # bytes are in the same order
        if reverse_bits_in_bytes:
            return bit_reverse_in_bytes(x,(nbits+7)//8)
        else:
            return x

def bkm(bits, sink=None):
    """ 
    Berlekamp-Massey algorithm
    
    sink: used to store intermediate results for debugging or analysis
    """
    m = -1
    c = 1
    b = 1
    L = 0
    fullbits = 0
    nbits = len(bits)
    reversed_bits = 0
    for n in xrange(nbits):
        reversed_bits = (reversed_bits << 1) | (1 if bits[n] else 0)
        d = parity(reversed_bits&c)
        if sink is not None:
            sink.append(dict(n=n, m=m, L=L, b=b, c=c, d=d,
                             reversed_bits=reversed_bits))
        if d == 1:
            # Discrepancy!
            cprev = c
            c ^= b << (n-m)
            if 2*L <= n:
                L = n+1 - L
                b = cprev
                m = n
    if sink is not None:
        sink.append(dict(n=None, m=m, L=L, b=b, c=c))
    return c,L

class MysteryCrc(object):
    def __init__(self, poly, init, final, lsbit_first, lsbyte_first, name=None):
        self.poly = poly
        self.init = init
        self.final = final
        self.lsbit_first = lsbit_first
        self.lsbyte_first = lsbyte_first
        self.name = name
    def __call__(self, message):
        return crc_bitwise(message, self.poly, 
                           init=self.init,
                           final=self.final,
                           lsbit_first=self.lsbit_first,
                           lsbyte_first=self.lsbyte_first)
    def __repr__(self):
        return ('MysteryCrc(poly=0x%x,init=0x%x,final=0x%x,lsbit_first=%s,lsbyte_first=%s%s)' %
                (self.poly, self.init, self.final, self.lsbit_first, self.lsbyte_first,
                 '' if self.name is None else ',name=%s' % repr(self.name)))
    
mystery_crc_1 = MysteryCrc(0x11d, init=0x56, final=0x78, lsbit_first=False, lsbyte_first=False)
mystery_crc_2 = MysteryCrc(0x11d, init=0x56, final=0x78, lsbit_first=True, lsbyte_first=False)
mystery_crc_3 = MysteryCrc(0x169, init=0x89, final=0xff, lsbit_first=True, lsbyte_first=False)
mystery_crc_4 = MysteryCrc(0x1100b, init=0xffff, final=0xffff, lsbit_first=False, lsbyte_first=False)
mystery_crc_5 = MysteryCrc(0x1100b, init=0xffff, final=0xffff, lsbit_first=True, lsbyte_first=False)
mystery_crc_6 = MysteryCrc(0x103dd, init=0xffff, final=0xffff, lsbit_first=False, lsbyte_first=False)
mystery_crc_7 = MysteryCrc(0x103dd, init=0xffff, final=0xffff, lsbit_first=True, lsbyte_first=False)
mystery_crc_8 = MysteryCrc(0x103dd, init=0xffff, final=0xffff, lsbit_first=False, lsbyte_first=True)
mystery_crc_9 = MysteryCrc(0x103dd, init=0xffff, final=0xffff, lsbit_first=True, lsbyte_first=True)

messages = ['\x00\x00',
            '\x00\x01',
            '\x00\x02',
            '\x00\x03',
            '\x55\xaa',
            '\xaa\x01',
            '\xff\xab']
crcs = [mystery_crc_1(msg) for msg in messages]
c0 = crcs[0]
print ("M(x)       c(x) =  z2(x) + cbar(x)")
for message,c in zip(messages,crcs):
    print("%s -> 0x%04x = 0x%04x ^ 0x%04x" % 
          (hexlify(message), c, c0, c ^ c0))
    
print ("")
for i in [1,4]:
    print("0x%04x ^ 0x%04x = 0x%04x" %
          (crcs[i]^c0, crcs[i+1]^c0,
           crcs[i]^crcs[i+1]))

M(x)       c(x) =  z2(x) + cbar(x)
00 00 -> 0x0005 = 0x0005 ^ 0x0000
00 01 -> 0x0018 = 0x0005 ^ 0x001d
00 02 -> 0x003f = 0x0005 ^ 0x003a
00 03 -> 0x0022 = 0x0005 ^ 0x0027
55 aa -> 0x0049 = 0x0005 ^ 0x004c
aa 01 -> 0x0066 = 0x0005 ^ 0x0063
ff ab -> 0x002a = 0x0005 ^ 0x002f

0x001d ^ 0x003a = 0x0027
0x004c ^ 0x0063 = 0x002f

This simplified CRC $ \bar{c}(x) = c(x) + z_m(x) = x^N M(x) \bmod g(x) $ is shown in the code output above as the cbar(x) term on the right. It is linear: we can add messages 00 01 + 00 02 = 00 03 that yield simplified CRCs that add: 001d + 003a = 0027, and add messages 55 aa + aa 01 = ff ab that yield simplified CRCs 004c + 0063 = 002f.

Determining the Polynomial, Part I: Basis Functions

This will give us one way to determine the CRC’s generator polynomial. We can probe the CRC by first feeding in the all-zero message, and then using it to compute $ \bar{c}(x) $ for messages $ M(x) $ which have a single one bit. This ends up computing $ x^{N+a} \bmod g(x) $ for various values of $ a $, which is the position of the one bit in the message. For example, with a four-byte message ($ m=4 $) and $ a=2 $, then the test message is (in hex) 00 00 00 04; with $ a=3 $ the test message is 00 00 00 08. And so on.

import struct

def calc_shift_count(k, lsbit_first=True):
    return (k&0xfff8) + 7 - (k&7) if lsbit_first else k

def probe0(crcfunc, verbose=False):
    zerosamples = ['',
                   '\x00',
                   '\x00\x00',
                   '\x00\x00\x00',
                   '\x00\x00\x00\x00']
    zero_crcs = [crcfunc(msg) for msg in zerosamples]
    if verbose:
        for msg,c in zip(zerosamples,zero_crcs):
            print("[%s] -> %04x" % (hexlify(msg), c))
    return zero_crcs

def probe1(crcfunc, lsbit_first=True, verbose=False, nprobe=32):
    fmt, nbytes = ('>I',4) if nprobe <= 32 else ('>Q',8) 
    onesamples = [struct.pack(fmt,1<<calc_shift_count(k, lsbit_first))
                  for k in xrange(nprobe)]
    c0 = crcfunc('\x00'*nbytes)
    one_crcs = [crcfunc(msg) for msg in onesamples]
    for one, one_crc in zip(onesamples, one_crcs):
        if verbose:
            fmt = "[%s] -> %04x = crc(%s) + %04x lsb=%d"
            fmtargs = (hexlify(one), one_crc,
                       ' '.join('00' for _ in xrange(nbytes)),
                       one_crc ^ c0,
                       (one_crc ^ c0)&1)
            if lsbit_first:
                fmt += ' (rev=%04x)'
                fmtargs += (bit_reverse_in_bytes(fmtargs[-2], 2),)
            print(fmt % fmtargs)
    return c0, one_crcs

zero_crcs = probe0(mystery_crc_1, verbose=True)
c0, one_crcs = probe1(mystery_crc_1, lsbit_first=False, verbose=True)

[] -> 00e8
[00] -> 0093
[00 00] -> 0005
[00 00 00] -> 00a0
[00 00 00 00] -> 0068
[00 00 00 01] -> 0075 = crc(00 00 00 00) + 001d lsb=1
[00 00 00 02] -> 0052 = crc(00 00 00 00) + 003a lsb=0
[00 00 00 04] -> 001c = crc(00 00 00 00) + 0074 lsb=0
[00 00 00 08] -> 0080 = crc(00 00 00 00) + 00e8 lsb=0
[00 00 00 10] -> 00a5 = crc(00 00 00 00) + 00cd lsb=1
[00 00 00 20] -> 00ef = crc(00 00 00 00) + 0087 lsb=1
[00 00 00 40] -> 007b = crc(00 00 00 00) + 0013 lsb=1
[00 00 00 80] -> 004e = crc(00 00 00 00) + 0026 lsb=0
[00 00 01 00] -> 0024 = crc(00 00 00 00) + 004c lsb=0
[00 00 02 00] -> 00f0 = crc(00 00 00 00) + 0098 lsb=0
[00 00 04 00] -> 0045 = crc(00 00 00 00) + 002d lsb=1
[00 00 08 00] -> 0032 = crc(00 00 00 00) + 005a lsb=0
[00 00 10 00] -> 00dc = crc(00 00 00 00) + 00b4 lsb=0
[00 00 20 00] -> 001d = crc(00 00 00 00) + 0075 lsb=1
[00 00 40 00] -> 0082 = crc(00 00 00 00) + 00ea lsb=0
[00 00 80 00] -> 00a1 = crc(00 00 00 00) + 00c9 lsb=1
[00 01 00 00] -> 00e7 = crc(00 00 00 00) + 008f lsb=1
[00 02 00 00] -> 006b = crc(00 00 00 00) + 0003 lsb=1
[00 04 00 00] -> 006e = crc(00 00 00 00) + 0006 lsb=0
[00 08 00 00] -> 0064 = crc(00 00 00 00) + 000c lsb=0
[00 10 00 00] -> 0070 = crc(00 00 00 00) + 0018 lsb=0
[00 20 00 00] -> 0058 = crc(00 00 00 00) + 0030 lsb=0
[00 40 00 00] -> 0008 = crc(00 00 00 00) + 0060 lsb=0
[00 80 00 00] -> 00a8 = crc(00 00 00 00) + 00c0 lsb=0
[01 00 00 00] -> 00f5 = crc(00 00 00 00) + 009d lsb=1
[02 00 00 00] -> 004f = crc(00 00 00 00) + 0027 lsb=1
[04 00 00 00] -> 0026 = crc(00 00 00 00) + 004e lsb=0
[08 00 00 00] -> 00f4 = crc(00 00 00 00) + 009c lsb=0
[10 00 00 00] -> 004d = crc(00 00 00 00) + 0025 lsb=1
[20 00 00 00] -> 0022 = crc(00 00 00 00) + 004a lsb=0
[40 00 00 00] -> 00fc = crc(00 00 00 00) + 0094 lsb=0
[80 00 00 00] -> 005d = crc(00 00 00 00) + 0035 lsb=1

qr = GF2QuotientRing(0x11d)
for k in xrange(16):
    print '%04x' % qr.lshiftraw(1,k+8)

Now, the interesting thing here is that we can take one particular bit of the simplified CRC — say, the least significant bit — and we get an output bit sequence from an LFSR. This means we can use the Berlekamp-Massey algorithm to figure out the polynomial:

lsbs = [(c0^c)&1 for c in one_crcs]
poly1,n = bkm(list(reversed(lsbs)))
print '%x' % poly1

11d

Easy-peasy! The only problem is that we don’t know whether the CRC is LSB-first or not. If it is, then, for example, the probe message for $ a=0 $ is 00 00 00 80 rather than 00 00 00 01, and the probe message for $ a=1 $ is 00 00 00 40 rather than 00 00 00 02. If we don’t know whether the CRC is LSB-first or not, we have two different sequences to plug into the Berlekamp-Massey algorithm. One of them will compute the polynomial, and the other will compute a different polynomial which is of higher degree.

_ = probe0(mystery_crc_2, verbose=True)
_ = probe1(mystery_crc_2, lsbit_first=True, verbose=True)

[] -> 0017
[00] -> 00c9
[00 00] -> 00a0
[00 00 00] -> 0005
[00 00 00 00] -> 0016
[00 00 00 80] -> 00ae = crc(00 00 00 00) + 00b8 lsb=0 (rev=001d)
[00 00 00 40] -> 004a = crc(00 00 00 00) + 005c lsb=0 (rev=003a)
[00 00 00 20] -> 0038 = crc(00 00 00 00) + 002e lsb=0 (rev=0074)
[00 00 00 10] -> 0001 = crc(00 00 00 00) + 0017 lsb=1 (rev=00e8)
[00 00 00 08] -> 00a5 = crc(00 00 00 00) + 00b3 lsb=1 (rev=00cd)
[00 00 00 04] -> 00f7 = crc(00 00 00 00) + 00e1 lsb=1 (rev=0087)
[00 00 00 02] -> 00de = crc(00 00 00 00) + 00c8 lsb=0 (rev=0013)
[00 00 00 01] -> 0072 = crc(00 00 00 00) + 0064 lsb=0 (rev=0026)
[00 00 80 00] -> 0024 = crc(00 00 00 00) + 0032 lsb=0 (rev=004c)
[00 00 40 00] -> 000f = crc(00 00 00 00) + 0019 lsb=1 (rev=0098)
[00 00 20 00] -> 00a2 = crc(00 00 00 00) + 00b4 lsb=0 (rev=002d)
[00 00 10 00] -> 004c = crc(00 00 00 00) + 005a lsb=0 (rev=005a)
[00 00 08 00] -> 003b = crc(00 00 00 00) + 002d lsb=1 (rev=00b4)
[00 00 04 00] -> 00b8 = crc(00 00 00 00) + 00ae lsb=0 (rev=0075)
[00 00 02 00] -> 0041 = crc(00 00 00 00) + 0057 lsb=1 (rev=00ea)
[00 00 01 00] -> 0085 = crc(00 00 00 00) + 0093 lsb=1 (rev=00c9)
[00 80 00 00] -> 00e7 = crc(00 00 00 00) + 00f1 lsb=1 (rev=008f)
[00 40 00 00] -> 00d6 = crc(00 00 00 00) + 00c0 lsb=0 (rev=0003)
[00 20 00 00] -> 0076 = crc(00 00 00 00) + 0060 lsb=0 (rev=0006)
[00 10 00 00] -> 0026 = crc(00 00 00 00) + 0030 lsb=0 (rev=000c)
[00 08 00 00] -> 000e = crc(00 00 00 00) + 0018 lsb=0 (rev=0018)
[00 04 00 00] -> 001a = crc(00 00 00 00) + 000c lsb=0 (rev=0030)
[00 02 00 00] -> 0010 = crc(00 00 00 00) + 0006 lsb=0 (rev=0060)
[00 01 00 00] -> 0015 = crc(00 00 00 00) + 0003 lsb=1 (rev=00c0)
[80 00 00 00] -> 00af = crc(00 00 00 00) + 00b9 lsb=1 (rev=009d)
[40 00 00 00] -> 00f2 = crc(00 00 00 00) + 00e4 lsb=0 (rev=0027)
[20 00 00 00] -> 0064 = crc(00 00 00 00) + 0072 lsb=0 (rev=004e)
[10 00 00 00] -> 002f = crc(00 00 00 00) + 0039 lsb=1 (rev=009c)
[08 00 00 00] -> 00b2 = crc(00 00 00 00) + 00a4 lsb=0 (rev=0025)
[04 00 00 00] -> 0044 = crc(00 00 00 00) + 0052 lsb=0 (rev=004a)
[02 00 00 00] -> 003f = crc(00 00 00 00) + 0029 lsb=1 (rev=0094)
[01 00 00 00] -> 00ba = crc(00 00 00 00) + 00ac lsb=0 (rev=0035)

for lsbit_first in [False, True]:
    c0, one_crcs = probe1(mystery_crc_2, lsbit_first=lsbit_first)
    poly2,n = bkm([(c0^c)&1 for c in reversed(one_crcs)])
    print ('lsbit_first=%5s: computed generator polynomial g(x) = %x'
           % (lsbit_first, poly2))

lsbit_first=False: computed generator polynomial g(x) = c33
lsbit_first= True: computed generator polynomial g(x) = 11d

Berlekamp-Massey requires that for a polynomial of degree $ N $, you need at least $ 2N $ bits of a sequence to ensure that the result is correct. For an 8-bit CRC that means we need to try 17 different messages (all zeros, and 16 one-bit messages); for a 16-bit CRC we need to try 33 different messages.

We’ll show how to reduce the number of messages later, but for now let’s move on, because we have a way to determine the generator polynomial and whether the CRC takes in bits LSB-first or not. Next is the initial value $ p_i(x) $.

Determining the Initial and Final Values

Since we have the generator polynomial, we can use some finite field arithmetic to help figure out the initial value. Let’s feed in a couple of all-zero messages of various lengths, and compute the XOR between each message and the “empty” message of length zero:

zero_crcs = probe0(mystery_crc_1)
for m, z in enumerate(zero_crcs):
    if m == 0:
        print('m=0:            [] -> %04x' % z)
    else:
        print('m=%d: %13s -> %04x = %04x ^ %04x' % 
          (m, '['+' '.join(['00']*m)+']',
           z, zero_crcs[0], z^zero_crcs[0]))

m=0:            [] -> 00e8
m=1:          [00] -> 0093 = 00e8 ^ 007b
m=2:       [00 00] -> 0005 = 00e8 ^ 00ed
m=3:    [00 00 00] -> 00a0 = 00e8 ^ 0048
m=4: [00 00 00 00] -> 0068 = 00e8 ^ 0080

If we go back to the CRC equation $ c(x) = \left(x^{N+8m}p_i(x) + x^N M(x) + p_f(x)\right) \bmod g(x) $, it turns out that $ z_m(x) $, which is the CRC of the all-zero message $ M(x)=0 $ of length $ m $, has the value $ z_m(x) = \left(x^{N+8m}p_i(x) + p_f(x)\right) \bmod g(x) $.

This means that $ z_m(x) + z_0(x) = x^N\left(x^{8m} + 1\right)p_i(x) \bmod g(x) $, and we should be able to determine $ p_i(x) $ from the empty message, which has CRC $ z_0(x), $ and the single-byte message 00, which has CRC $ z_1(x) $:

$$p_i(x) = \frac{z_1(x) + z_0(x)}{x^N(x^8 + 1)} \bmod g(x)$$

qr = GF2QuotientRing(0x11d)
# let's predict zm(x) + z0(x)
init = 0x56
init_shifted = qr.lshiftraw(init,8)
for m in xrange(5):
    print '%x' % (qr.lshiftraw(init_shifted,m*8) ^ init_shifted)

0
7b
ed
48
80

init = qr.divraw(zero_crcs[1]^zero_crcs[0], 0x10100)
print 'initial value = 0x%x' % init

initial value = 0x56

There! Now for the final value we can just solve $ z_0(x) = x^N p_i(x) + p_f(x) $ to get $ p_f(x) = x^N p_i(x) + z_0(x) $ :

final = qr.lshiftraw(init,8) ^ zero_crcs[0]
print 'final value = 0x%x' % final

final value = 0x78

Now let’s tie this all together and automate it:

def reverse_engineer_crc_bkm(crcfunc, nprobe=32):
    """
    Reverse-engineer CRC using Berlekamp-Massey
    """
    zero_crcs = probe0(crcfunc)
    c0_t, one_crcs_t = probe1(crcfunc, lsbit_first=True, nprobe=nprobe)
    poly_t,n_t = bkm([(c0_t^c)&1 for c in reversed(one_crcs_t)])
    c0_f, one_crcs_f = probe1(crcfunc, lsbit_first=False, nprobe=nprobe)
    poly_f,n_f = bkm([(c0_f^c)&1 for c in reversed(one_crcs_f)])
    # We have two choices. See which one is consistent.
    qr = GF2QuotientRing(poly_t)
    polytail_t1 = one_crcs_t[0] ^ c0_t
    polytail_t0 = poly_t ^ (1<<n_t)
    if bit_reverse_in_bytes(polytail_t1, n_t//8) == polytail_t0:
        lsbit_first = True
        lsbyte_first = False
        poly,n = poly_t,n_t
    elif bit_reverse(polytail_t1, n_t) == polytail_t0:
        lsbit_first = True
        lsbyte_first = True
        poly,n = poly_t,n_t
    else:
        polytail_f1 = one_crcs_f[0] ^ c0_f
        polytail_f0 = poly_f ^ (1<<n_f)
        qr = GF2QuotientRing(poly_f)
        poly,n = poly_f,n_f
        lsbit_first = False
        if polytail_f1 == polytail_f0:
            lsbyte_first = False
        elif byte_reverse(polytail_f1, n_f//8) == polytail_f0:
            lsbyte_first = True
        else:
            raise ValueError('Could not find a matching polynomial')
    # Undo the bit/byte reversing, so we can compute the polynomials
    zero_crcs = [optional_reverser(z,n,reverse_bits_in_bytes=lsbit_first,
                                   reverse_bytes=lsbyte_first)
                 for z in zero_crcs]
    qr = GF2QuotientRing(poly)
    
    # Now compute the initial and final values
    u = 0x101 
    init = qr.divraw(zero_crcs[1]^zero_crcs[0],qr.lshiftraw(u,n))
    final = qr.lshiftraw(init,n) ^ zero_crcs[0]
    return poly, lsbit_first, lsbyte_first, init, final

for crcfunc in [mystery_crc_1, mystery_crc_2, mystery_crc_3, 
                mystery_crc_4, mystery_crc_5, mystery_crc_6,
                mystery_crc_7, mystery_crc_8, mystery_crc_9]:
    crcparams = reverse_engineer_crc_bkm(crcfunc)
    print("poly=0x%05x, lsbit_first=%5s, lsbyte_first=%5s, init=0x%04x, final=0x%04x" %
          crcparams)
    print("  detected from %s" % crcfunc)
    if crcparams == (crcfunc.poly, crcfunc.lsbit_first, crcfunc.lsbyte_first,
                     crcfunc.init, crcfunc.final):
        print("  Match!")
    else:
        print("  Mismatch... hmmm")

poly=0x0011d, lsbit_first=False, lsbyte_first=False, init=0x0056, final=0x0078
  detected from MysteryCrc(poly=0x11d,init=0x56,final=0x78,lsbit_first=False,lsbyte_first=False)
  Match!
poly=0x0011d, lsbit_first= True, lsbyte_first=False, init=0x0056, final=0x0078
  detected from MysteryCrc(poly=0x11d,init=0x56,final=0x78,lsbit_first=True,lsbyte_first=False)
  Match!
poly=0x00169, lsbit_first= True, lsbyte_first=False, init=0x0089, final=0x00ff
  detected from MysteryCrc(poly=0x169,init=0x89,final=0xff,lsbit_first=True,lsbyte_first=False)
  Match!
poly=0x1100b, lsbit_first=False, lsbyte_first=False, init=0xffff, final=0xffff
  detected from MysteryCrc(poly=0x1100b,init=0xffff,final=0xffff,lsbit_first=False,lsbyte_first=False)
  Match!
poly=0x1100b, lsbit_first= True, lsbyte_first=False, init=0xffff, final=0xffff
  detected from MysteryCrc(poly=0x1100b,init=0xffff,final=0xffff,lsbit_first=True,lsbyte_first=False)
  Match!
poly=0x103dd, lsbit_first=False, lsbyte_first=False, init=0xffff, final=0xffff
  detected from MysteryCrc(poly=0x103dd,init=0xffff,final=0xffff,lsbit_first=False,lsbyte_first=False)
  Match!
poly=0x103dd, lsbit_first= True, lsbyte_first=False, init=0xffff, final=0xffff
  detected from MysteryCrc(poly=0x103dd,init=0xffff,final=0xffff,lsbit_first=True,lsbyte_first=False)
  Match!
poly=0x103dd, lsbit_first=False, lsbyte_first= True, init=0xffff, final=0xffff
  detected from MysteryCrc(poly=0x103dd,init=0xffff,final=0xffff,lsbit_first=False,lsbyte_first=True)
  Match!
poly=0x103dd, lsbit_first= True, lsbyte_first= True, init=0xffff, final=0xffff
  detected from MysteryCrc(poly=0x103dd,init=0xffff,final=0xffff,lsbit_first=True,lsbyte_first=True)
  Match!

Whee! We did it!

But We Don’t Really Need Berlekamp-Massey After All!

Okay, there’s a shorter approach to getting the generator polynomial. I mentioned in passing in Part XVI that there’s an easy way to get the generator polynomial for Reed-Solomon, which is to pass in a message $ M(x) = 1 $, and then what you get for a remainder is the generator polynomial without the leading monomial term. We can do the same thing with CRCs, computing $ \bar{c}(x) = c(M_0) + c(M_0 + 1) $:

_ = probe0(mystery_crc_1, verbose=True)
print ("M(x)             c(x) = z4(x)            + cbar(x)")
_ = probe1(mystery_crc_1, lsbit_first=False, verbose=True, nprobe=1)

[] -> 00e8
[00] -> 0093
[00 00] -> 0005
[00 00 00] -> 00a0
[00 00 00 00] -> 0068
M(x)             c(x) = z4(x)            + cbar(x)
[00 00 00 01] -> 0075 = crc(00 00 00 00) + 001d lsb=1

We get 1d, which is the “tail” of the polynomial 0x11d, without its leading monomial term. No need to run Berlekamp-Massey!

Technically we can use any message $ M_0 $; it doesn’t have to be the all-zero polynomial, we just need to compute two CRCs, one with the least significant bit flipped.

Now, we also need a way to distinguish the LSB-first and MSB-first cases, and for this, we need a consistency check. It looks like the minimum number of messages to determine the generator polynomial and whether it is LSB-first or MSB-first, is 3:

[00] – zero polynomial
[01] – the difference between this and [00] should be the tail of the generator polynomial, if MSB-first
[80] – the difference between this and [00] should be the tail of the generator polynomial, if LSB-first

When I first looked at this approach, it seemed to me like the magic number was 5, not 3, because we need a way to confirm consistency of MSB-first or LSB-first, but the [01] and [80] messages can each be used for consistency with each other, because one of them produces a value of $ \bar{c}(x) $ that is $ x^7 $ times the other.

Add in the empty message [] for determining the initial and final values, and we should be able to get all the parameters after getting the CRCs for 4 messages. (That is, with one small caveat, which we’ll show in a minute.)

Let’s do it!

from libgf2.gf2 import _gf2exteuc, _gf2divmod

class NoSolutionsError(ValueError):
    pass
class MultipleSolutionsError(ValueError):
    pass

def reverse_engineer_crc(crcfunc, n=None, trace=None, nmax=64):
    """
    Reverse-engineer CRC using only four test polynomials
    (except in rare cases)
    """
    z0 = crcfunc('')
    z1 = crcfunc('\x00')
    zM = crcfunc('\x01')
    zL = crcfunc('\x80')
    
    def get_consistent_qr(n, allow_multiple_solutions=False):
        n8 = (n+7)//8

        # Check for a consistent solution.
        solutions = []
        for lsbit_first, lsbyte_first in [(False,False),
                                          (False,True),
                                          (True,False),
                                          (True,True)]:
            if n8 == 1 and lsbyte_first:
                continue
                # don't reverse bytes for a 1-byte case
            polytail = bit_reverse_in_bytes(zL ^ z1, n8) if lsbit_first else (zM ^ z1)
            othertail = bit_reverse_in_bytes(zM ^ z1, n8) if lsbit_first else (zL ^ z1)
            if lsbyte_first:
                polytail = byte_reverse(polytail, n8)
                othertail = byte_reverse(othertail, n8)
            g = (1 << n) ^ polytail
            qr = GF2QuotientRing(g)
            if othertail == qr.lshiftraw(1, n+7):
                solutions.append((qr, lsbit_first, lsbyte_first))
                
        if not solutions:
            raise NoSolutionsError("Could not find a matching polynomial")
        if allow_multiple_solutions:
            return solutions

        if len(solutions) > 1:
            raise MultipleSolutionsError("Found multiple consistent solutions: %s" % solutions)
            
        return solutions[0]

    autodetect = n is None
    if autodetect:
        n = max(z.bit_length() for z in [z0, z1, zM, zL])
        # minimum size of CRC, we need to double-check this
        while True:
            if trace is not None:
                trace.append(n)
            n8 = (n+7)//8
            try:
                solutions = get_consistent_qr(n, allow_multiple_solutions=True)
            except ValueError:
                n += 1
                if n >= nmax:
                    raise
                else:
                    continue
                    
            if len(solutions) > 1:
                # eliminate polynomials with no unity coefficient
                solutions = [solution for solution in solutions if (solution[0].coeffs & 1)]
                if len(solutions) > 1:
                    raise MultipleSolutionsError("Still have multiple solutions: %s" % solutions)
            qr, lsbit_first, lsbyte_first = solutions[0]
            
            
            # We have a possible solution. Now we just have to run some extra checks
            # on some new patterns:
            # - za = CRC(all-zero message of length n8)
            # - zb = CRC(x^-1)
            #
            # Checks are:
            # - lengths of the CRCs of these two messages
            #   are <= n8 (otherwise we increase candidate n and repeat)
            # - za ^ zb == 1 << (N-1)
            za = crcfunc('\x00' * (n8))
            n_next = za.bit_length()
            if n_next > n:
                n = n_next
                continue

            # Use x^-1 -- its CRC should be za ^ (1 << (N-1))
            onehalf = qr.rshiftraw1(1)
            bytecodes = [(onehalf >> (i-1)*8) & 0xff for i in xrange(n8,0,-1)]
            if lsbit_first:
                bytecodes = [bit_reverse(b,8) for b in bytecodes]
            Mb = ''.join(chr(b) for b in bytecodes)
            zb = crcfunc(Mb)
            n_next = zb.bit_length()
            if n_next > n:
                n = n_next
                continue
                
            cbar = za ^ zb
            
            if lsbyte_first:
                if lsbit_first:
                    if cbar == 1:
                        break
                else:
                    if cbar == 0x80:
                        break
            else:
                if lsbit_first:
                    if cbar == 1 << (16*n8 - 8 - n):
                        break
                else:
                    if cbar == 1 << (n-1):
                        break
                    
            n += 1
            if n > nmax:
                raise ValueError("Can't find solution")
            
    # no autodetect
    else:
        n8 = (n+7)//8
        qr, lsbit_first, lsbyte_first = get_consistent_qr(n)

    # Undo the bit/byte reversing, so we can compute the polynomials
    zero_crcs = [optional_reverser(z,n,reverse_bits_in_bytes=lsbit_first,
                                   reverse_bytes=lsbyte_first)
                 for z in [z0,z1]]

    poly = qr.coeffs

    # Now compute the initial and final values
    a = 0x101 << n
    d,u,v = _gf2exteuc(a,poly)
    cbar = zero_crcs[1] ^ zero_crcs[0]
    e,r = _gf2divmod(cbar,d)
    assert r == 0
    init = qr.mulraw(u,e)
    final = qr.lshiftraw(init,n) ^ zero_crcs[0]
    return poly, lsbit_first, lsbyte_first, init, final

for crcfunc in [mystery_crc_1, mystery_crc_2, mystery_crc_3, 
                mystery_crc_4, mystery_crc_5, mystery_crc_6, 
                mystery_crc_7, mystery_crc_8, mystery_crc_9]:
    crcparams = reverse_engineer_crc(crcfunc)
    crcparams = reverse_engineer_crc_bkm(crcfunc)
    print("poly=0x%05x, lsbit_first=%5s, lsbyte_first=%5s, init=0x%04x, final=0x%04x" %
          crcparams)
    print("  detected from %s" % crcfunc)
    if crcparams == (crcfunc.poly, crcfunc.lsbit_first, crcfunc.lsbyte_first,
                     crcfunc.init, crcfunc.final):
        print("  Match!")
    else:
        print("  Mismatch... hmmm")

poly=0x0011d, lsbit_first=False, lsbyte_first=False, init=0x0056, final=0x0078
  detected from MysteryCrc(poly=0x11d,init=0x56,final=0x78,lsbit_first=False,lsbyte_first=False)
  Match!
poly=0x0011d, lsbit_first= True, lsbyte_first=False, init=0x0056, final=0x0078
  detected from MysteryCrc(poly=0x11d,init=0x56,final=0x78,lsbit_first=True,lsbyte_first=False)
  Match!
poly=0x00169, lsbit_first= True, lsbyte_first=False, init=0x0089, final=0x00ff
  detected from MysteryCrc(poly=0x169,init=0x89,final=0xff,lsbit_first=True,lsbyte_first=False)
  Match!
poly=0x1100b, lsbit_first=False, lsbyte_first=False, init=0xffff, final=0xffff
  detected from MysteryCrc(poly=0x1100b,init=0xffff,final=0xffff,lsbit_first=False,lsbyte_first=False)
  Match!
poly=0x1100b, lsbit_first= True, lsbyte_first=False, init=0xffff, final=0xffff
  detected from MysteryCrc(poly=0x1100b,init=0xffff,final=0xffff,lsbit_first=True,lsbyte_first=False)
  Match!
poly=0x103dd, lsbit_first=False, lsbyte_first=False, init=0xffff, final=0xffff
  detected from MysteryCrc(poly=0x103dd,init=0xffff,final=0xffff,lsbit_first=False,lsbyte_first=False)
  Match!
poly=0x103dd, lsbit_first= True, lsbyte_first=False, init=0xffff, final=0xffff
  detected from MysteryCrc(poly=0x103dd,init=0xffff,final=0xffff,lsbit_first=True,lsbyte_first=False)
  Match!
poly=0x103dd, lsbit_first=False, lsbyte_first= True, init=0xffff, final=0xffff
  detected from MysteryCrc(poly=0x103dd,init=0xffff,final=0xffff,lsbit_first=False,lsbyte_first=True)
  Match!
poly=0x103dd, lsbit_first= True, lsbyte_first= True, init=0xffff, final=0xffff
  detected from MysteryCrc(poly=0x103dd,init=0xffff,final=0xffff,lsbit_first=True,lsbyte_first=True)
  Match!

Now, I’ve handwaved over a few important things.

One is that we need to know $ N $, the size of the CRC in bits. If you know that, we’re done. Otherwise it’s not necessarily obvious from getting CRC values of these four cases. We can get a lower bound for $ N $ by looking at the maximum bit size of the CRC values returned by each of these four test polynomials. Unfortunately, in some cases it doesn’t necessarily tell you what $ N $ is.

A reliable way to verify the generating polynomial $ g(x) $ is correct along with its degree $ N $, is to compute $ \bar{c}(x) = f_m(M_0(x) + H(x)) + f_m(M_0(x)) $ for some arbitrary $ M_0(x) $ (the message of $ m = \lfloor N/8 \rfloor $ zeros will do), and the particular polynomial $ H(x) = x^{-1} \bmod g(x) $, because then $ \bar{c}(x) = x^Nx^-1 = x^{N-1} $ which is a CRC consisting of a 1 bit and $ N-1 $ zero bits.

My reverse_engineer_crc() function does this, increasing the value of $ N $ until we get a verification of this type.

mystery_crc_10 = MysteryCrc(0x1000000af, init=0xde, final=0xad, lsbit_first=False, lsbyte_first=False)

trace_n = []
a,b,c,d,e=reverse_engineer_crc(mystery_crc_10, trace=trace_n)
print('poly=%x lsbit_first=%s lsbyte_first=%s init=%x final=%x' % (a,b,c,d,e))
print("Sequence of N values tried during autodetection: %s" % trace_n)

poly=1000000af lsbit_first=False lsbyte_first=False init=de final=ad
Sequence of N values tried during autodetection: [23, 30, 32]

The second handwavy thing is that we’re assuming that we can tell whether the CRC is LSB-first or not and whether the bytes are reversed or not, because if we just try all four possibilities, only one will work and the rest will not. But I don’t have any proof that this is true, that there aren’t any cases where we can get consistent results for both LSB-first and MSB-first interpretations of the CRC. I’m not going to worry about that. If you find such a counterexample, let me know.

There’s a third handwavy thing; we’ll get to it in a minute.

CRCs in the Wild

So far we’ve been analyzing CRC oracles that we created. That’s a good start, but let’s try using these techniques on real systems that are out there in the wild.

We’ll start with zlib.crc32 in the Python standard library.

import zlib

def crc32_zlib(x):
    return zlib.crc32(x) & 0xffffffff

a,b,c,d,e = reverse_engineer_crc(crc32_zlib)
print('poly=%x lsbit_first=%s lsbyte_first=%s init=%x final=%x' % (a,b,c,d,e))

poly=104c11db7 lsbit_first=True lsbyte_first=True init=46af6449 final=ffffffff

The zlib module’s CRC32 algorithm is the same as the one used in Ethernet, PNGs, and PKZIP. If you look at the PNG code for CRC, you’ll see that it is initialized with 0xffffffff and has a final XOR value of 0xffffffff. Why doesn’t the initial value match?

Part of the reason is a difference in definition. Our definition is that we calculate $ c(x) = \left(x^{N+8m}p_i(x) + x^N M(x) + p_f(x)\right) \bmod g(x) $ — effectively, the initial value $ p_i(x) $ is a phantom input concatenated with the message bytes, and it gets shifted $ N $ times before the message bytes get shifted in. We can avoid this shift (which the PNG code does) by pre-shifting the initial value. If we do this, it can no longer be considered a concatenation with the message bytes, but in the case of the PNG code it does lead to a simple value:

qr = GF2QuotientRing(0x104c11db7)
'%8x' % qr.lshiftraw(0x46af6449,32)

'ffffffff'

The other reason is that we are using a CRC convention where we always shift left, and in transforming the CRC32 algorithm — which is constructed using shift-right operations — into this form, we had to reverse the polynomial, as I mentioned earlier, from 0x1db710641 to 0x104c11db7, and the initial value, depending on what form it is expressed in, can change as a result. I suspect that in the form we use, where we precede the message by phantom content, the initial value does not change if we swap implementations to use a right-shift with a reversed polynomial and reversed output bits, because both implementations produce the same output from the same input — but I’m not completely sure.

Now let’s have some more fun and see what happens when we treat Lammert Bies’s website as the oracle. You can type in stuff by hand if you like, but I’m going to automate it.

import urllib
import urllib2
from IPython.core.display import HTML

def fetch_crcs_from_lammerts_website(msg, use_ascii=False):
    """
    Get CRCs from this URL:
    https://www.lammertbies.nl/comm/info/crc-calculation.php?crc=%22HEXDIGITS%22&method=hex
    where HEXDIGITS is replaced by the message, encoded as hex.
    This returns an HTML table.
    """
    url = 'https://www.lammertbies.nl/comm/info/crc-calculation.php'
    response = urllib2.urlopen(url + '?' + urllib.urlencode(dict(
                                  crc=msg if use_ascii
                                          else ('"' + msg.encode('hex') + '"'),
                                  method='ascii' if use_ascii else 'hex')))
    try:
        return response.read()
    finally:
        response.close()

HTML(fetch_crcs_from_lammerts_website('\x00\x01'))

"0001" (hex)
1 byte checksum	1
CRC-16	0xC0C1
CRC-16 (Modbus)	0x70C0
CRC-16 (Sick)	0x0100
CRC-CCITT (XModem)	0x1021
CRC-CCITT (0xFFFF)	0x0D2E
CRC-CCITT (0x1D0F)	0x94E1
CRC-CCITT (Kermit)	0x8911
CRC-DNP	0xA1C9
CRC-32	0x36DE2269

from bs4 import BeautifulSoup

def parse_rows_from_lammerts_website(html):
    """
    Return the CRC and name from HTML
    in the form used in Lammert's website.
    The first row is the header and the second is a useless checksum.
    """
    soup = BeautifulSoup(html, 'lxml')
    rows = soup.find_all('tr')
    return [(int(row.find_next('td').find_next('td').b.text, 16),
             row.find_next('td').text)
            for row in rows[2:]]

for crc, name in parse_rows_from_lammerts_website(
    fetch_crcs_from_lammerts_website('\x00\x01')):
    print '%x %s' % (crc, name)

c0c1 CRC-16
70c0 CRC-16 (Modbus)
100 CRC-16 (Sick)
1021 CRC-CCITT (XModem)
d2e CRC-CCITT (0xFFFF)
94e1 CRC-CCITT (0x1D0F)
8911 CRC-CCITT (Kermit)
a1c9 CRC-DNP
36de2269 CRC-32

def create_oracle_from_lammerts_website(crcnum, apply_name=False):
    def oracle(msg, include_name=False):
        row = parse_rows_from_lammerts_website(
            fetch_crcs_from_lammerts_website(msg))[crcnum]
        if apply_name:
            oracle.__name__ = str(row[1])
        return row if include_name else row[0]
    return oracle

crc32_lammert = create_oracle_from_lammerts_website(8)

print '%08x' % crc32_lammert('\x00\x01')

36de2269

a,b,c,d,e=reverse_engineer_crc(crc32_lammert)
print("poly=%x, lsbit_first=%s, lsbyte_first=%s, init=%x, final=%x" %
      (a,b,c,d,e))

poly=104c11db7, lsbit_first=True, lsbyte_first=True, init=46af6449, final=ffffffff

Like Robinson Crusoe, It’s Primitive as Can Be

Now back to the third handwavy thing that we’ve glossed over as we reverse-engineered the CRC, and it’s something we will often encounter when working with CRC16 variants.

I’m going to use some code for the 16-bit X.25 CRC that I covered in an earlier article.

class CRC_X25(object):
    def __init__(self):
        poly = 0x1021
        self.poly = poly
        self.brevtable, self.table = self.create_tables()
    def create_tables(self):
        brevtable = []
        # Bit-reverse table for bytes
        table = []
        # Each element j of the table has a value
        # of x^16*p{j} mod poly, where p{j} is
        # the GF(2) polynomial interpretation
        # of the bits of integer j
        for j in xrange(256):
            bv0 = j
            brev = 0
            bval = j << 8
            for k in xrange(8):
                bval = self._left_shift(bval)
                brev <<= 1
                if bv0 & 1:
                    brev |= 1
                bv0 >>= 1
            table.append(bval)
            brevtable.append(brev)
        return tuple(brevtable), tuple(table)
    def _left_shift(self, bval):
        bval <<= 1
        if bval & 0x10000:
            bval ^= 0x10000
            bval ^= self.poly
        return bval
    def calc_bitwise(self, data):
        crc = 0xFFFF
        for b in data:
            bval = ord(b)
            for k in xrange(8):
                crc = self._left_shift(crc)
                if bval & (1 << k):
                    crc ^= self.poly
        crc ^= 0xFFFF
        return chr(self.brevtable[crc >> 8]) + chr(self.brevtable[crc & 0xff])
    def calc_table(self, data):
        crc = 0xFFFF
        for b in data:
            bval = self.brevtable[ord(b)]
            crc = ((crc << 8) & 0xFFFF) ^ self.table[((crc >> 8) ^ bval) & 0x00FF];
        crc ^= 0xFFFF        
        return chr(self.brevtable[crc >> 8]) + chr(self.brevtable[crc & 0xff])
    def __call__(self, data):
        return self.calc_table(data)
    
crc_x25_obj = CRC_X25()

def crc_x25(message):
    crc_binary = crc_x25_obj(message)
    y = 0
    for c in crc_binary:
        y <<= 8
        y ^= ord(c)
    return y

z0 = crc_x25('')
z1 = crc_x25('\x00')
zM = crc_x25('\x01')
zL = crc_x25('\x80')

for z,name in [(z0,'z0'),
               (z1,'z1'),
               (zM,'zM'),
               (zL,'zL')]:
    print '%s=%04x' % (name,z)
    
print("zL ^ z1 = %04x" % (z1^zL))
print("bit-reversed within bytes: %04x" % bit_reverse_in_bytes(z1^zL,2))

z0=0000
z1=78f0
zM=f1e1
zL=7074
zL ^ z1 = 0884
bit-reversed within bytes: 1021

We can get the polynomial with no problem; the tail 0x1021 (entire polynomial 0x11021) is consistent with the generator polynomial $ g(x) = x^{16} + x^{12} + x^5 + 1 $ mentioned in X.25 section 2.2.7.4.

The problem comes when we try to tease out the initial and final values.

g = 0x11021
qr = GF2QuotientRing(g)
n = qr.degree
a = 0x101 
cbar = bit_reverse_in_bytes(z1^z0,2)
init = qr.divraw(cbar,qr.lshiftraw(a,n))
init

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in ()
      4 a = 0x101
      5 cbar = bit_reverse_in_bytes(z1^z0,2)
----> 6 init = qr.divraw(cbar,qr.lshiftraw(a,n))
      7 init

/Users/jmsachs/storage/jms/hobbies/blogs/embedded/repo/scripts/libgf2/libgf2/gf2.pyc in divraw(self, a, b)
    774             If `b` and the characteristic polynomial have a common factor.
    775         """
--> 776         return self.mulraw(a,self.invraw(b))
    777     def div(self, a, b):
    778         """

/Users/jmsachs/storage/jms/hobbies/blogs/embedded/repo/scripts/libgf2/libgf2/gf2.pyc in invraw(self, a)
    730         (r,y,_) = _GF2.exteuc(a,self.coeffs)
    731         if r != 1:
--> 732             raise ValueError('%x and %x are not relatively prime but have a common factor of %x' % (a,self.coeffs,r))
    733         return y
    734     def inv(self, a):

ValueError: 2310 and 11021 are not relatively prime but have a common factor of 3

It turns out that $ g(x) = x^{16} + x^{12} + x^5 + 1 $ is not a primitive polynomial: we can factor $ g(x) = (x+1)(x^{15}+x^{14}+x^{13}+x^{12}+x^4+x^3+x^2+x+1) $ and this $ (x+1) $ factor is a mess. The best we can do to isolate the initial and final values is to XOR the CRC of the empty message with the CRC of the zero message of $ k $ bytes, and this produces values of the form $ \bar{c}(x) = x^N(x^{8k} + 1)p_i(x) $. So we’d like to divide out the $ x^{8k} + 1 $ factor, but no matter what value of $ k $ we choose, it’s divisible by $ x+1 $, so it has no multiplicative inverse modulo $ g(x) $, and we get the error above if we try to calculate it.

(In technical terms: when $ g(x) $ is not a primitive polynomial, then $ GF(2)[x]/g(x) $ is a quotient ring but not a finite field, and not all nonzero elements have multiplicative inverses. In my libgf2 package, libgf2.gf2.GF2QuotientRing can be used to compute in these quotient rings whether or not they are fields, although for methods that involve multiplicative inverses, some of the assumptions break down.)

To get past this, we have to pull out a little bit of grungy algebra again.

We’d like to solve $ \bar{c} = ap_i \bmod g $ for $ p_i $, where $ a = (x^{8k} + 1)x^N $. (I’m dropping the $ (x) $ part for a moment — these are all polynomials) This may or may not have a solution, and if it does, it’s equivalent to finding $ p_i $ and $ z $ such that $ \bar{c} = ap_i + gz $. In Part III we looked at the extended Euclidean algorithm, which, given $ a $ and $ b $, calculates $ u $ and $ v $ such that $ au+bv = \gcd(a,b) $. This looks very similar, and in fact, if our equation does have a solution, we can calculate it as follows:

$$\begin{aligned} d &= \gcd(a,g) = au+gv \cr e &= \bar{c}/d \cr p_i &= ue \end{aligned}$$

with $ d, u, $ and $ v $ coming from the extended Euclidean algorithm, because then we have $ ap_i + gve = aue + gve = de = \bar{c} $.

If $ \bar{c}(x) $ is divisible by $ d(x) $, we have solutions. That’s multiple solutions: we can use $ e(x) = \bar{c}(x)/d(x) + wg(x)/d(x) $ for any $ w $, because in the calculation of $ \bar{c}(x) = d(x)e(x) $ we will find that the $ w $ term is zero $ \bmod g(x) $.

If $ \bar{c}(x) $ is not divisible by $ d(x) $, there are no solutions — and we probably made an error in calculating the CRC.

Enough with the algebra, let’s use it for computation.

from libgf2.gf2 import _gf2mul, _gf2divmod, _gf2exteuc

a = 0x101 << n
d,u,v = _gf2exteuc(a,g)
e,r = _gf2divmod(cbar,d)
assert r == 0
init1 = qr.mulraw(u,e)

# Let's find another possible solution with w=1
gcofactor, r = _gf2divmod(g,d)
assert r == 0
init2 = init1 ^ gcofactor

final1 = z0 ^ qr.lshiftraw(init1, n)
final2 = z0 ^ qr.lshiftraw(init2, n)

crc_x25_dup1 = MysteryCrc(g, init=init1, final=final1, lsbit_first=True, lsbyte_first=False)
crc_x25_dup2 = MysteryCrc(g, init=init2, final=final2, lsbit_first=True, lsbyte_first=False)

for crcfunc in [crc_x25, crc_x25_dup1, crc_x25_dup2]:
    print crcfunc
    for message in ['','Hi', 'Hello', 'Ernie, you have a banana in your ear!']:
        print "  message='%s'" % message
        print "  crc=%04x" % crcfunc(message)

<function crc_x25 at 0x10ab60d70>
  message=''
  crc=0000
  message='Hi'
  crc=2679
  message='Hello'
  crc=2c54
  message='Ernie, you have a banana in your ear!'
  crc=3fc0
MysteryCrc(poly=0x11021,init=0x84cf,final=0xffff,lsbit_first=True,lsbyte_first=False)
  message=''
  crc=0000
  message='Hi'
  crc=2679
  message='Hello'
  crc=2c54
  message='Ernie, you have a banana in your ear!'
  crc=3fc0
MysteryCrc(poly=0x11021,init=0x74d0,final=0xfe0,lsbit_first=True,lsbyte_first=False)
  message=''
  crc=0000
  message='Hi'
  crc=2679
  message='Hello'
  crc=2c54
  message='Ernie, you have a banana in your ear!'
  crc=3fc0

Perfect!

The reverse_engineer_crc() function can do this for us (although it doesn’t calculate alternate solutions):

a,b,c,d,e=reverse_engineer_crc(crc_x25)
print("poly=%x, lsbit_first=%s, lsbyte_first=%s, init=%x, final=%x" %
      (a,b,c,d,e))

poly=11021, lsbit_first=True, lsbyte_first=False, init=84cf, final=ffff

Now let’s try some more from lammertbies.nl:

def reverse_engineer_crc_from_lammerts_website(crcnum, name=None):
    oracle = create_oracle_from_lammerts_website(crcnum, apply_name=name is None)
    a,b,c,d,e=reverse_engineer_crc(oracle)
    oracle_mimic = MysteryCrc(a,lsbit_first=b,lsbyte_first=c,
                              init=d,final=e,
                              name=oracle.__name__ if name is None else name)
    return oracle_mimic

message = "Ernie, you have a banana in your ear!"
for row in xrange(9):
    if row == 2:
        continue   # Row 2 'CRC-16 (Sick)' didn't work for some reason
    crc_mimic = reverse_engineer_crc_from_lammerts_website(row)
    print "row=%d: %s" % (row, crc_mimic)
    print "crc(%s) = %x" % (message, crc_mimic(message))

row=0: MysteryCrc(poly=0x18005,init=0x0,final=0x0,lsbit_first=True,lsbyte_first=True,name='CRC-16')
crc(Ernie, you have a banana in your ear!) = 4d55
row=1: MysteryCrc(poly=0x18005,init=0xeaa8,final=0x0,lsbit_first=True,lsbyte_first=True,name='CRC-16 (Modbus)')
crc(Ernie, you have a banana in your ear!) = bd44
row=3: MysteryCrc(poly=0x11021,init=0x0,final=0x0,lsbit_first=False,lsbyte_first=False,name='CRC-CCITT (XModem)')
crc(Ernie, you have a banana in your ear!) = e2e8
row=4: MysteryCrc(poly=0x11021,init=0x84cf,final=0x0,lsbit_first=False,lsbyte_first=False,name='CRC-CCITT (0xFFFF)')
crc(Ernie, you have a banana in your ear!) = 5641
row=5: MysteryCrc(poly=0x11021,init=0xffff,final=0x0,lsbit_first=False,lsbyte_first=False,name='CRC-CCITT (0x1D0F)')
crc(Ernie, you have a banana in your ear!) = 14e2
row=6: MysteryCrc(poly=0x11021,init=0x0,final=0x0,lsbit_first=True,lsbyte_first=False,name='CRC-CCITT (Kermit)')
crc(Ernie, you have a banana in your ear!) = edaa
row=7: MysteryCrc(poly=0x13d65,init=0x0,final=0xffff,lsbit_first=True,lsbyte_first=False,name='CRC-DNP')
crc(Ernie, you have a banana in your ear!) = 448d
row=8: MysteryCrc(poly=0x104c11db7,init=0x46af6449,final=0xffffffff,lsbit_first=True,lsbyte_first=True,name='CRC-32')
crc(Ernie, you have a banana in your ear!) = c3c2cd3f

There, we figured out all the different varieties. (Except row 2, CRC-16 (Sick), which seemed to find the polynomial $ x^{16} + 1 $ when I tried it, and caused further errors. I’m going to have to let that one go.)

HTML(fetch_crcs_from_lammerts_website(message, use_ascii=True))

"Ernie, you have a banana in your ear!"
1 byte checksum	193
CRC-16	0x4D55
CRC-16 (Modbus)	0xBD44
CRC-16 (Sick)	0x8DB1
CRC-CCITT (XModem)	0xE2E8
CRC-CCITT (0xFFFF)	0x5641
CRC-CCITT (0x1D0F)	0x14E2
CRC-CCITT (Kermit)	0xEDAA
CRC-DNP	0x448D
CRC-32	0xC3C2CD3F

There’s not much else to say here. We’ve shown how to compute the parameters of a CRC just from looking at its output if we control the input. There are also ways to compute the CRC parameters in some cases where the input is constrained — for example, if you can only send messages of the form CA FE BA BE DE AD BE EF xx yy zz ww 60 0D F0 0D but you can choose any values for the xx yy zz ww bytes — although I won’t be covering them, and it’s a good exercise to try after reading this article.

That’s all the time we have for today!

Wrapup

We looked at reverse-engineering CRCs. In gory detail. With finite field algebra. I would like to say this wasn’t that complicated to understand, but there were some subtleties along the way that made the subject… interesting — and I hope it wasn’t too far out of reach to follow.

In a nutshell, here are the areas this article covered:

The parameters needed to characterize a CRC, in particular:
- the generating polynomial $ g(x) $
- whether the bits in each byte are interpreted (and output) least-significant-bit first
- whether the bytes are output least-significant-byte first
- the initial value $ p_i(x) $ prepended to the input data
- the final value $ p_f(x) $ XORed after the input data has been processed
The finite field equivalent of the CRC calculation, namely $ c(x) = f_m(M(x)) = \left(x^{N+8m}p_i(x) + x^N M(x) + p_f(x)\right) \bmod g(x) $ where $ M(x) $ is the input data, expressed as a polynomial with one coefficient for each bit
A simple Python class (MysteryCrc) that can accommodate all these variants — although I should note that I only tested it for polynomials that have a degree of a multiple of 8, so if you are working with a 17-bit CRC don’t expect it to work for you
A CRC is effectively an affine function (which is a pure linear function with a fixed offset) for fixed-length input data
Information can be probed from a CRC “oracle” by using basis functions consisting of all-zero messages (to help eliminate the affine offsets in subsequent computations) along with messages of a fixed length that are all zeros except for a single 1 bit. These single-one-bit messages have a finite field representation of monomials — for example the message that has a 1 bit in bit 14 can be represented as $ x^{14} $. (How you count bits depend on whether the CRC parameters are least-significant-bit first.)
The generating polynomial can be discovered by using a sequence of messages of successive powers of $ x $, and using a single bit of the resulting CRC to form a bit sequence, which can be fed into the Berlekamp-Massey algorithm. This requires $ 2N $ messages, in addition to an all-zero message, to determine a polynomial of degree $ N $.
Once the generating polynomial is known, initial and final values can be determined by submitting the empty message (no bytes) and the single-byte 00 message, and using some fairly simple finite field arithmetic
The two polarity parameters (least-significant-bit first and least-significant-byte first) don’t have any shortcut aside from trying all possibilities and checking consistency.
Berlekamp-Massey isn’t necessary to determine the polynomial; any two messages that differ in the least significant bit are sufficient to find the generating polynomial, by XORing the CRC of each message. Full parameter determination in the case of unknown bit polarity requires at least four messages, the empty message, and the three single-byte messages 00, 01, and 80. There are cases if the CRC length is not known where additional CRCs must be computed to confirm CRC length.
We used some real-world test cases: CRC-32 from zlib.crc32, the X.25 CRC discussed in another article, and the CRCs calculated on Lammert Bies’s website.
We discussed how to deal with generating polynomials that are non-primitive — these may have multiple solutions for initial/final value parameters, and require use of the extended Euclidean algorithm to get around the fact that quotient rings (unlike finite fields) do not have inverses for all nonzero elements.

Next time we’ll show how to find lots of primitive polynomials.

You might also like... (promoted content)

Check out Memfault's New Sandbox!

New Research Report: The State of IoT Software Development

Comments

Comments
Write a Comment

Select to add a comment

[ - ]

Comment by jms_nh●July 8, 2018

For what it's worth, the Linux documentation also has an interesting description of CRC computation (although not on reverse-engineering):

https://github.com/torvalds/linux/blob/master/Docu...

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers:

Choose a Username

E-Mail (Work, School or ieee)

First Name

Last Name

Employer

Job Title

Country

State

Password

Confirm Password

By checking this box, I agree with the terms of use and privacy policy By checking this box, I consent to receive occasional emails from the *Related sites and their partners. I understand that these emails will only contain relevant information and that I can unsubscribe at any time.

Linear Feedback Shift Registers for the Uninitiated, Part XVII: Reverse-Engineering the CRC

Known Unknowns: Ice Cream Cone Endianness Again

!Kcab Nrut !Kcab Nrut

Affine vs. Linear

Determining the Polynomial, Part I: Basis Functions

Determining the Initial and Final Values

But We Don’t Really Need Berlekamp-Massey After All!

CRCs in the Wild

Like Robinson Crusoe, It’s Primitive as Can Be

Wrapup

Sign in

You might also like...

About Jason Sachs

Popular Posts by Jason Sachs

Popular Blogs Series

Free PDF Downloads

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group