Protecting your Digital Ass(ets) Part 1: Checksums & Digital Errors Debunked

The switch from film and analog video into digital video technologies has had a great freeing effect in every area of motion content creation. It’s created industries that exist today that didn’t exist a decade ago. With the quality of video that you can get on today’s cameras, the cost of acquisition and the skill needed for operation in many cases has dropped significantly. This accessibility has opened many opportunities and challenges for professionals new and old as they work to get established and reestablished in the world of digital content creation.  

We’re not here to talk about all of the issues. Instead, we want to focus on a single issue that many independent production companies face: how do you protect your digital assets?

A drawback to digital video production is its technological complexity.  It’s often hard to understand what each of the pieces is actually doing, and that can lead to industry wide misunderstandings relating to these complexities. One of these misunderstandings is the practice of checksumming digital data after acquisition as the primary means of data verification and data protection.

In theory, it sounds smart: use an MD5 or SHA-1 checksum to make sure that what you’ve copied is the same as what was on the digital magazine.  But in actual practice, checksums are at best a waste of time and resources, and at worst a red herring that gives a false sense of security, pointing away from the checks and protections that should be happening to our footage instead. 

This is the first part of a three part series on protecting your digital assets.  In this part we’re going to explore the whys and the hows of checksumming: How did checksumming become common practice?  What does checksumming actually do?  and Why isn’t it needed?

The Legacy of Film

A short time ago we had a crew give us a call and ask for help figuring out what was wrong with the footage they’d copied off their camera cards the previous day.  The first clip would open, but none of the other files did.  “We don’t understand," they said, "we ran the checksums and they told us everything was okay.” Something had gone terribly wrong, and they couldn't figure out how or why.

Shooting on film was a relatively simple process and easy to understand: the camera closes its shutter, pulls an unexposed frame into place behind the gate, and open the shutter to expose it. Repeat that 24 times per second and you’re recording a movie.

With a simple piece of technology, and a hundred years of practice, the risks associated with storing images on film were pretty well known. Things to worry about include:

  • Accidental exposure of film to light (flashing)
  • Particles of dust or celluloid caught in the camera’s gate
  • Tearing or breakage of the film
  • Exposure to high intensity X-rays in transit
  • Loss of the original negative
  • Chemical damage to the original negative during processing

And so on. These risks were considered acceptable, so long as the proper procedures were followed to mitigate the risks, with production insurance secured to ensure that when non-recoverable problems came up, there was a way to get back what was lost, or at very least recreate it.

As the industry shifted towards digital production, it brought with it a new set of risks, risks that are far more intangible than those of film. In some cases, there are perceptions of risk where there is none because of the technologic complexity.

How Rare are Copy Errors?

One of these false risks (emphasis on false) that comes from a misunderstanding of the technology is the fear of copy errors from the magazine to the hard drive, RAID, or tape done by the DIT. On it’s face, it’s actually a pretty reasonable fear that something can go wrong with this ‘digital thing' in a way I can’t understand. And so DITs and desktop applications designed for the transfer process started using checksums to ‘verify’ the footage, i.e., ‘detect’ errors in the footage so that the operator knows when things go wrong and have the chance to fix it.

Spoiler alert: in practice, they don’t.  Why not?  To understand that, let’s look at what a checksum is.

  MD5 SHA1 SHA256
Original f5e0b2f2edc471d1bf32fb3e5581bded ce471ede829dcbbd834b31dcc22488c611f683a6 1bcdb7790ead799558ad564c219c6727b579ce864db61df11078a879e4380869
One Pixel Changed 2a4362499052e249db8ac86e43de559e e726d8c82f5456b34f50e1f30453d996bc0ac7a5 1a688eda0be90042afde65b91d4257e671a8cfb2011d39c944a4b428cb252771
A Different Pixel Changed b588f80c0056c875852d0259a98e9988 bd6099eae3372a105840ca5dc1a1784bdc17265c cdce3c29afd079c51c644133c987d6aebd936eac154acff41ec503fe7e6dabd0

Checksums / Hashes of the Error Correction Coding & CRC image below, with single digital value changes (255,255,255 -> 255,255,254 on one pixel only).

A checksum is a semi-unique value derived from a set of data.  It’s a way of reducing the gigabytes of ones and zeros that represent digital information (images, sound files, programs, etc) into a small, easy to store and compare number - 32 hexadecimal characters (digits 0-9 and letters A-F) for MD5, 40 hex characters for SHA-1, and 64 hex characters for SHA256. The important idea here is that changing a single digit within the digital data (replacing a 1 with a 0, or a 0 with a 1) changes the entire resulting hash, in a completely random and usually unpredictable way.

And so the tradition wisdom is that if you want to be sure that what you’ve copied is bit-for-bit accurate to what is on the digital magazine, you can verify the data using checksums by:

  • Reading the file from the media to generate a checksum (Source Checksum)
  • Copying the file from the media to the storage device
  • Reading the file from the storage device to generate a checksum (Destination Checksum)
  • Comparing the source and destination checksums to make sure that they’re the same.

But there’s a faster way to be 100% sure that what you’ve created is a bit-for-bit accurate copy of what the computer can see on the media:

  • Copy the file from the media to the storage device

Surprise! All you have to do is copy the file! Because while going through the copy process, the file will go through a multitude of bit-for-bit checks designed to make sure that when Windows or MacOS tells you the file’s copied, it’s actually copied. No additional hashing or checksums needed.

Let me repeat that: If your computer tells you things copied right, they copied right!  If things go wrong and it can’t get a bit for bit copy, it’ll stop the transfer and throw up an error.  True copy errors are so exceptionally rare that you can trust your new copy is bit for bit accurate.  Every. Time.

If that makes your anxiety go haywire, like it did for me for many years, and you don’t just want to take my (and your computer’s) word for it, let me explain a bit about how the technology involved actually works so you can understand why adding an external checksum doesn’t actually do anything more to protect your data than a simple copy.

Single Bit Errors in Data Transmission & Storage

First, a twist: solid state media, tapes, and hard drives have minor single bit errors all the time.  In fact, the SSD or HDD that sitting in your computer probably has parts of it that have ‘failed' at the time of manufacturing!  And if it's older than a year old it may have had any number of 'failures' since.  And you don’t know it.  In fact, you shouldn’t know it, because it’s normal.  Really, really normal.

And it’s completely okay, because SSDs and HDDs are designed in such a way that they can have failures, fix them, and continue to operate without issue.

Single bit errors, where a “1” becomes a “0” or a “0” becomes a “1”, are incredibly common.  Small imperfections in the magnetic surface or in the NAND based flash memory construction cause changes in single bits of data on a regular basis.  Stray magnetic fields, electrical interference, and random high energy particles streaming from outer space can all change the value of single bits on a drive, or when the data’s being copied.  And they would be a major problem, if all your hard drive or solid state drive did was store and transfer your data as a continuous string of ones and zeros, identical to the actual data you’re trying to store.  But it doesn’t.

Instead, your data is stored as a special type of self correcting code, with a few more bits stored above and beyond that to help flag when change are made.  Speaking more specifically, data is stored as Reed-Solomon codes with Cyclic Redundancy Checks (CRCs) on bytes and sectors of data.  These CRCs and Reed-Solomon codes allow the hard drive controller or solid state drive controller to detect and to correct single bit and single byte errors, as well as larger multi byte errors within a sector (a sector is typically 4096 bytes, or 4 kilobytes).

So how do these protect your data?  Let’s handle writing data to a storage device first.

Error Correction Codes & CRCs are generated from the data when you write it to the media by the media controller.

Whether we’re talking about your camera media or the destination media you’re copying to, first and foremost, the CRCs and Reed-Solomon codes are generated by the storage device when you write to it, and these are the actual 1’s and 0’s stored on the storage device.  Storage devices never store data directly without these codes.  Without getting into a whole bunch of technical details, when you actually tell a hard drive to write a chunk of data, it goes through the following steps:

  • Find a place on the hard drive marked as ‘free space’ it can write to
  • Assign the data to that available sector
  • Move the read-write head arm to the target sector
  • Convert the data to be written into Reed-Solomon code and calculate the CRC
  • Write the data as Reed-Solomon code and add the CRC to the sector using the write head
  • WHILE WRITING THE DATA, read back the data from the disk using the read head that follows the write head: The read and the write heads are independently wired on the end of the arm, and the read head always follows the write head over the hard disk platter so that it can immediately read what’s been written and make sure it’s actually there.
  • Check that the data read and decoded on write matches the target data and recalculate the CRCs using the written data
  • Tell the operating system or whatever device is controlling it that the data was successfully written and it’s ready for more data.

The read head follows the write head as the disk moves beneath it so that it can verify what's been written is correct.

Or in other words, the hard drive performs the data verification pass before it says that the data's been written.  So do tapes.  Solid state drives and flash media do something similar, except that the read step is executed in sequence after the write step, since the circuitry for reading and writing is the same.  But it performs the same calculations of CRC and Reed-Solomon codes, and the same verification of the data after writing before allowing the process to move on.

But what happens when things go wrong?  Simple - if it can’t write to a sector, it marks the sector as bad and automatically reassigns where the data can go.  It does this until it finds a sector where it can write successfully.  If no such sectors are available, it returns an error to the operating system or device, stops the copy, and alerts the user to the problem.

So that’s writing, what about reading?

Unsurprisingly, a very similar data integrity check process happens with you go to read the data on a hard or solid state drive.  It finds the data, reads it with its CRCs, and recalculates the CRC values to make sure it’s the same as when it was written.  Any single bit problems with the data (‘soft errors’) are immediately fixed using the Reed-Solomon code as it’s being decoded, and the drive tries to fix the data in place too.  If the hard drive can’t fix the data in place because of actual damage to the drive instead of other minor single bit errors, it marks the sector as bad and moves the data to a new sector (a hard error).  And it does this all before passing the data back to the operating system or device that requested it.

All solid state media and hard drives manufactured by reputable manufacturers include extra sectors to allow this process to happen without reducing the storage capacity: they’re designed for small parts to fail over time without the whole drive or media failing.  When they accumulate too many hard errors (they keep count!), they flag themselves as failed and alert the operating system or device.

So to summarize: when you write to a hard drive, tape, or solid state drive / media, the drive creates and stores error detection and correction information, and automatically verifies the write.  When you read from a drive, it automatically detects and corrects any errors before passing on any information to the operating system or device, double checking data integrity on read.

But what about in between?  Don’t worry - every single connection and transmission standard has its own form of error detection designed to ensure that it’s communicating properly, which is one of the biggest causes of overhead mentioned in our last post about bottlenecks.  Once again, the error detection and correction information is created by the transmitting device and verified by the receiving device, and each chunk of data must be verified before moving to the next part of the data.

What About Checksums?

As long as things are going right, checksumming camera data doesn’t add any more protection to your data than a simple copy does.  And even when things do go wrong, they can’t do anything more to detect or fix problems than not having them there.  Though they may have the intangible benefit of calming an anxious producer or cinematographer.

But while they technically aren’t harmful, they're a waste of the time rereading the original source and copied data and calculating the checksums. In my experience, they can quickly become a red herring that gives DIT’s and producers the confidence that nothing is wrong when things have silently gone awry.

That's what trapped this crew who called us: the checksums said things were okay when they really weren't. Luckily, their editor started working the very next day and they discovered they had a problem sooner rather than later. However, had they followed a better set of asset management recommendations, they could have found it as soon as it happened.  And in case you're wondering about what did go wrong and how we fixed it, stay tuned for Part 3.

But next, in Part 2, we're going to talk about the real risks in protecting your digital assets, and some recommended ways to protect yourself against it.