Monday, March 26, 2012

A Word on File Size

For those of you who wish to use LMDx for purposes other than detecting accidental errors on fixed-block-size storage devices...

As explained in previous posts, LMD2 and LMD3, which I published in the previous post, act on 32-bit values ("u32" in the code). This means that a file which is a multiple of 4 bytes in size will have the same LMD2 (and LMD3) as another file whose length is 1, 2, or 3 bytes longer, provided that those additional bytes are all 0s.

While this seems bad, it's not actually abnormal. After all, most other digest algos operate on bytes, even though a file may end on a bit boundary. If you want to include the exact size (to the nearest bit, or byte, or whatever), then simply ensure that you integrate it into the LMDx that you're using. For example, you could xor the low bits of the generator seeds (X0 being lower in bit position than C0) with the file's size. Or, you could simply ensure that its size is embedded in a header, which itself falls under LMDx protection. [EDIT: A temptingly simple approach is to set the first bit of the padding, and clear it beyond that bit. However, I think the aforementioned seed xor method is superior. Appending a 1 creates a case in which it's easy to generate the same hash maliciously, or even accidentally.]

Of course, if the file size is implicit based on other content, then you already have a way to know whether it has been accidentally truncated or expanded, independent of LMDx.

UPDATE: Likewise, LMD4, LMD5, and LMD6 operate on blocks of 2^12 bytes, so files which do not have this granularity may have the same hash due to similar 0 padding, with the same solution as for LMD2 above. (The hash of a null file is the hash of a block of 0s.)

Lastly, why does this font suck? 0 (zero) comes out like O (the letter "oh"). Ah, the mysteries of Blogger...

No comments:

Post a Comment