Mystery Box
CREATORS + INNOVATORS
Screen Shot 2015-01-06 at 7.06.46 PM.png

Blog

PROTECTING YOUR DIGITAL ASS(ETS) PART 2: Best Practices

Raid 5 Array - Mystery Box Storage Solutions

In Part 1 we looked at the strategies your computer and other hardware use to prevent data corruption when the data’s being copied or stored.   Under normal circumstances, you’re extremely unlikely to get get errors that mean anything to your footage because of the  error correction codes and reliability of modern systems.  And if you do get errors, they’re far more likely to be the kind of errors checksums can’t detect.

But that doesn’t mean that your data is invincible.  There are a number of real threats to your digital assets, and protecting against these can be a little bit more challenging.  We’re going to group the many types of problems into three broad categories.  First, data loss or asset damage can show up on magnetic media because of a rare class of digital errors called “Unrecoverable Read Errors (UREs); second, all kinds of media and devices can suffer from physical damage or failure; and lastly nothing is safe from human error.  Let’s look at each of these and then go over the best practices that will keep your assets safe.


Unrecoverable Read Errors

In Part 1 we looked at the strategies hard drives and solid state media use to detect and correct single bit errors.  If enough of these single bit errors occur, or parts of a magnetic drive depolarize more than usual, on a very, very rare occasion the drive may encounter an unrecoverable read error, or URE, which is exactly what the name implies - an error it can’t correct.

But these are pretty rare, especially early in any form of media’s life. For a consumer hard drive, over the lifetime of the drive you can expect on average the unrecoverable loss of less than one bit in every 12.5 terabytes read (1014 bits). Usually this means you’ll see few if any errors in the early, and middle of the drive’s life, but a lot more near close to when it’s worn out (after a few years of daily use). For enterprise class hard drives you can expect less than one lost bit for every 125 terabytes (1015 bits) of data.

Linear tape has even lower unrecoverable error rates, with LTO-6 only losing on average less than a single bit every 12.5 petabytes, and LTO-7 seeing a single bit error at over 125 petabytes, which is very, very large amount of data!

High quality SSDs and other solid state media for all intents and purposes see almost no unrecoverable read errors for most of their life, as long as they’re being used regularly: when they do start to hit the write cycle limits typically large portions of the media fail at the same time, making it unusable very suddenly.  The thing is, it’s actually really difficult to hit write cycle limits: more often than not you’ll replace an SSD long before hitting it, though it’s still possible to hit that limit with solid state camera cards after a couple years of daily use.

Other forms of solid state media (older, cheap, or low quality flash drives or camera media and consumer SSDs) can suffer from data rot when they’re left powered off for quite a long time (six months or more).  This is because of issues with charge slowly leaking out of the flash memory cells that can essentially erase large portions of the data simply with time.  While newer and high quality flash media suffers from slower leaks, and power on the media periodically can help a little, the quantity and unpredictable nature of these kinds of UREs make SSDs and other solid state media unsuitable for long term storage, even though they generate few traditional UREs than magnetic media.

Let’s focus on hard drive UREs, since they’re typically the most economical choice for data storage.  While we said they were rare in most cases, one error in 12.5 terabytes on consumer hard drives is arguably ‘fairly’ often for an unrecoverable error, potentially making them ‘relatively common’ when you’re dealing with terabytes of video data at a time. So how do we protect against them?  Hashes and checksums aren’t the answer, because they can’t do any more detection than the media already does.  But there are a few really, really easy things that you can do to keep your media (and data) safe from UREs:

  1. Use manufacturer recommended media and card readers in your cameras and replace it after a recommended number of write cycles.

  2. Offload to a RAID 5 or 6 that uses enterprise class drives as soon as possible from the camera media.

  3. Make at least 2 archival copies of the digital assets as soon as possible and always have at least one offline (don’t edit with them).

  4. Use new hard drive or tape storage as your primary long-term (archival) backup solution.

We’re going to break down those key pieces of advice in a moment, but following those four key items are the best way of preventing UREs and errors caused by the write cycle limits of solid state media.


Equipment Damage

Equipment damage happens in all kinds of different ways, and nothing is entirely safe though the most vulnerable components when we’re talking about storage tend to be hard drives.  It sounds silly, but you’re more likely to lose footage from dropping a hard drive than from any other kind of failure.

Most electronics will break if you drop them from high enough places, but hard drives are the only common computer component with moving parts.  Their read-write arms move over spinning disks inside the physical device. Their read-write heads are fragile, resting only a few micrometers above the disk.  Sudden shock that comes from a drop of even a short distance can permanently damage the heads when they strike the platter, especially if the drive is on or the disks still spinning.  This can cause large portions of the drive to become permanently unreadable, or rather, unreadable without a data recovery service that removes the damaged heads and replaces them with new ones.

But other electronics are susceptible to drop shock too, just not as much.  The general rule of thumb is that the bigger and heavier a device is, the more likely it is to break when dropped.  Laptops and computers have many large flat boards that can bend and break when they’re dropped from small heights because of how the device twists as it absorbs the impact.  But small SSDs and camera cards usually are really well supported in their cases and don’t twist or compress very much (unless they fall from a really big height), making them much harder to break with drop damage.

Solid state media DOES tend to be more susceptible to heat damage and electric shock though.  Leaving cards in the sun or having them too close to heaters (or leaving them in a pocket and having them go through the dryer) can be catastrophic, especially on more generic media types like SD Cards, CF Cards or CFAST cards.

Tapes are almost impervious to dropping, and even if you shatter the plastic housing by dropping the tape cartridge from large heights, the tape itself can usually be cleaned and respooled if the right care is taken.  These also tend to be the least affected by heat and electric shock.

While we can argue that most of that constitutes human error, there are other main kinds of damage that we need to protect against, not related to human error.  The most common problem is simply wear and tear.  Over time, with use hard drives and solid state drives wear out.  The more random seeks you do, such as the ones that come with video editing, the faster a hard drive will wear out. Something we’ve seen a lot of too is inexpensive control boards and connectivity circuits (USB, Thunderbolt, etc) slowly increasing their resistance over time and drawing more power until they break entirely.  This is just a normal part of wear and tear on integrated circuits and there’s nothing you can do to stop it.

Lastly, and hopefully less likely, you may find hard drives to be DOA, or “Dead on Arrival”.  In order to keep production and unit costs low, all manufacturers tolerate a certain number of drives simply not working, or failing very quickly after purchase.  The percentage of drives that fail out of the box is low, but it’s common enough that if you go through enough hard drives on a regular basis, you’ll probably experience it at some point.

So how do we protect our digital assets against equipment damage?  Let’s list them out:

  1. Make at least 2 archival copies of the digital assets as soon as possible. Keep at least one copy offline except when copying to it, and store it in a safe place.

  2. Use new hard drive or tape storage as your primary long-term (archival) backup solution and keep it offline (don’t edit with it). Test them when you get them, then treat these as a write-once, never reuse situation.

  3. Use a RAID 5 or 6 that uses enterprise class drives as your primary live RAW storage drive.

If you notice some overlaps between this advice and the advice for protecting against UREs. That’s not an accident: the same best practices that protect against read errors protect against hardware failure.


Human Error

A mantra that we live by these days when looking at systems, backup solutions, and other aspects of data protection is: “How do we protect this from people, including ourselves”.  It’s impossible to eliminate all human error.  No, seriously.  If your backup solution depends on no one ever making mistakes, your solution needs to be fixed.  Even the best people make mistakes, and that’s okay.

Beyond clumsiness, by far the most common human error is accidental deletion or overwriting of something that shouldn’t have been deleted.  This can happen when we’re trying to free up space on a common use drive, create a new set of proxies or intermediates, or simply just by a folder being somewhere it shouldn’t and getting deleted with other files that were slated to be purged.

Fortunately there are three best practices that can reduce the amount of human error and reduce the likelihood of human error causing a real problem to almost nothing.  They are:

  1. Make at least 2 archival copies of digital assets as soon as possible. Keep at least one copy offline except when copying to it, and store it in a safe place. Never delete anything from an archive copy.

  2. Foster a culture where there isn’t a fear of making the occasional mistake. Focus on learning from the mistakes that do happen.

  3. Create a checklist of ‘things to do before deleting’ and limit the ability of the average user to delete files.

Once again, we have overlap with the other best practices!  So now that we’ve looked at what can go wrong, let’s take some time and dive into all of these recommendations and see how they protect our footage in the short and long terms.


Recommendation #1: Use Manufacturer Recommended Media

This should be a no-brainer, but unfortunately it isn’t. Camera media is one of those accessories that it’s almost impossible to have ‘enough’ of in today’s data-hungry camera market, and it can be really tempting to cut costs here.  So why are some manufacturer’s proprietary media many times more costly than third party media?  And why should we care about the brand and model of SD, CF or CFast card we’re using?

There are a few answers to that.

One is card read-write speeds.  Not all SD, CF, or CFast read or write at the same rate.  Every camera will have a minimum class of card that can reliably and continually write at a rate faster than the camera records the video.  Slower cards will often limit which quality settings are available, or simply not work, and many off label or ‘non name brand’ cards may not hit the write speeds they’re listed at.

A second reason is card reliability.  Not only do all cards have different read-write speeds, but cards made by different manufacturers and at different quality levels have different numbers of write cycles before they stop working.  That’s a major problem with SD and CF cards particularly, since there’s such a large market for them that it’s easy to end up with unreliable cards when you buy a less familiar brand.  More expensive cards usually have many more write cycles available to them than less expensive cards, all things being equal.

Here the problem is less about ‘will it work’ than it is about how long will it work.  Under a typical load, high quality media should rarely hit its maximum write cycles before being upgraded with the camera or replaced, often lasting 3 years of daily use or more.  On the other hand, under normal loads, cheap media will often fail in a year or year and a half (sometimes even much sooner than that!), and when it does you get the pleasant experience of reshooting whatever was on that card.

This is the number one reason that major camera brands like Sony, Red, Arri, and Vision Research all use proprietary solid state media that you can only buy from them.  Yes, it’s more expensive to you as the owner / user, but the media is guaranteed to be of the highest quality, with more write cycles available than you should ever need and the fastest available solid state cells (typically fast and highly reliable NAND based single level cells, rather than the less expensive, slower, and less reliable multi level cells found in most consumer SSDs, SD and CF cards).  That, along with better controllers or RAID schemes actually do push the manufacturing costs up.  But for reliably protecting your data it’s completely worth it.

As a side note, some inexpensive card readers can actually cause data corruption in a way that can’t be detected with hashes or checksumming and so it’s important to use the more expensive, manufacturer provided or approved readers.

The last thing you want is a card to fail in your camera after you’ve spent the morning recording.  Because of the way that solid state media works, it can be quite difficult, if not impossible, to recover footage from a broken or damaged card.  The lesson: make it a non-issue; use manufacturer recommended media and card readers.

Mac Pro - Mystery Box Computer

Recommendation #2: Offload to a RAID 5/6 ASAP

When we’re running DIT workflows, either for ourselves or for clients, the first place the footage ends up once the card is connected to a computer is a RAID 5 for staging before being copied to whatever hard drives, tapes or other RAID systems that will actually be using the footage or storing it permanently.

The short version of the reason why is that RAIDs allow you to a) create a second copy quickly, and that second copy is b) more resilient to data loss from single bit UREs or single hard drive failures.

Using black magic (read: math), a RAID 5 or a RAID 6 ties multiple hard drives together with fault tolerance called parity data.  If I have eight hard drives, say 4TB each, I can combine their capacity and overall write speeds by simply striping data (write a small amount to each sequentially) across all eight of the drives to give myself 32TB of useable data.  This is called a RAID 0.  They’re great for speed and capacity, but not for digital asset protection: if any single drive generates an error or fails, I lose all of my data.

If I use a RAID 5 or RAID 6, I still combine all of their capacities and speed, but I add a little bit of data overhead for parity data.  In a RAID 5 configuration, the eight 4TB disks would give me 28TB of useable space (you lose the storage capacity of one disk), with a fault tolerance of one disk: the parity data spread across all of the disks allows me to lose any one disk, or part of the data on one block within a stripe can be corrupted with the RAID being able to fix it without any data loss.  A RAID 6 uses two different kinds of parity to offer a fault tolerance of two disks, while reducing the useable capacity by two disks instead of one (24TB available, in our example).

The absolute fastest way of reducing the risk of data loss from unrecoverable read errors is to duplicate the footage as soon as possible, and by putting it on a RAID 5 or 6, you do two things: greatly reduce the risk of data loss from hardware failure or damage, and for all intents and purposes eliminate the possibility of read errors.

How do the eliminate read errors? Let’s look briefly at what the actual odds of data loss from unrecoverable read errors are for a couple of different storage scenarios, using the error rates we looked at before: 1 URE for every 1014 bits on average for consumer drives, and 1 URE for every 1015 bits read on average for enterprise drives. We’ll assume we’re reading 1 TB of data from each type of media:

Data Stored as Odds of Data Loss with 1TB Read
One Copy on Single Hard Drive (Consumer) 1 in 12
One Copy on Single Hard Drive (Enterprise) 1 in 114
Two Copies on Two Hard Drives (or 2 Disk RAID 1 Mirror, Consumer) 1 in 3.06 x 109
Two Copies on Two Hard Drives (or 2 Disk RAID 1 Mirror, Enterprise) 1 in 3.09 x 1010
One Copy on 8 Disk RAID 5 (Consumer) 1 in 2.53 x 108
One Copy on 8 Disk RAID 5 (Enterprise) 1 in 1.96 x 109

As you add more copies on individual hard drives (or member disks in a RAID 1 mirror), the odds of data loss drop exponentially; so does increasing parity protection from a RAID 5 to a RAID 6.

How come?  Essentially, to get data loss with multiple copies or when parity data is in play, you need to generate errors when reading both copies of the footage, and the error needs to be in the same PART of the footage.  1TB of data has around 268 million ‘parts’ (blocks) where something can go wrong on single hard drives, or between 8 and 34 million ‘parts’ on a RAID.  And though RAIDs have fewer divisions for the same amount of data, there’s also less ‘load’ on each disk since it only stores part of that 1TB of data, which keeps the odds of data loss incredibly low.

Once again, moving the data to a RAID unit is the fastest way to reduce the odds of data loss.  It’s hard to misplace a RAID, and they’re less prone to damage or failure than single drives, which is why regardless of what we’re putting the footage on for a client later, we always copy to a local RAID first: a RAID 5 Promise Pegasus2 R8 at the office, or a G-Technology G-SPEED Shuttle XL for on location transfers.  For any DIT, this should be the first step in getting your digital assets from the camera card to where they’re going to be finally stored.

But on a single RAID is not the only place we put them.  In fact, we don’t delete or reuse camera media until we have a second complete backup copy of our assets, and neither should you.


Recommendation #3: Make & Keep (at least) 2 Archival Copies

The more copies of a file you have, on different physical media, the less likely you are to experience a catastrophic error.

Any time you have only one copy of a set of digital assets you’ve put yourself in a precarious situation.  Whether it’s the original camera media, the master RAID, or the external drive you’ve been copying to, single copies are vulnerable to loss, theft, physical damage from a drop or collision in transit, general failure, electrical shock, extreme heat, and perhaps worst of all: human error.

Hard drives and RAIDs have the worst susceptibilities to physical damage, but solid state camera media are usually small and can get misplaced and lost, overwritten, or in rare cases be damaged by electrical discharge or extreme heat.  Linear tape tends to be rather resilient in all respects, but takes a while to create and access.  But no media offers adequate protection from human error.

Accidental (or intentional) deletion, overwriting, butter fingers, malicious intent, or a multitude of other things that we can do to our footage makes human error the absolute number one cause of data loss, far more than any other source.  And while there are recovery techniques available in most cases, there are times that retrieving assets simply isn’t an option.

Fortunately, the best option to protect against all of these is also the easiest advice to follow: always a) make and keep at least two archival copies, on b) two different physical devices, on c) magnetic media (hard drive or tape), and d) keep them physically separated.

For example, a duplicate set could be something like two separate hard drives, a hard drive and a RAID, a RAID and a tape, or a RAID and two tapes, and so on.  If your camera allows for multi card recording without penalties, do it.  If it doesn’t, make a second copy as soon as you can, usually to a RAID for protection, and make further duplicates from there.

But here I’m going to differentiate between two different types of copies: live copies and archival copies.  Live copies are copies of your digital assets that you work with.  Archival copies are copies of your digital assets that you create once and store.

Archival copies should rarely be accessed - never during the initial work phase unless you suffer a failure with the live copies, and rarely in the future when you’ve purged the live copies for newer footage.  And if you do need to retrieve something from the archive, it should never be accessed directly by applications: archives should ONLY be used to create a new live copy when something’s gone wrong, and the archives should only be handled by the archivists or workflow managers to reduce human error.

If you’re following this best practice you’ll have at least 3 copies of all of your original assets while a project’s being worked on: your live working copy, and two archival copies.  It’s arguably acceptable when budgets are tight to only keep one archival copy while you’re working with a second copy on your live files, but we strongly recommend against it: if something happens to your live copy and for whatever reason the single backup is unavailable, you can end up in big trouble really quickly.

Two archival copies also lets you protect yourself against less common problems: theft, fire, water damage, and other form of disasters, but only if you keep the two copies isolated from each other.  Here at Mystery Box, we keep one archive at the office for accessability, and a second off site at a place we’re not going to tell you.  Every tape gets a duplicate when it’s made, so that no matter what happens, we have protection against natural disaster and malicious intent.

When you’re placing so much trust in archival copies, you want to make sure that they’re as reliable as possible for as long as possible.  So how do you minimize problems with your archival copies?  Always use new magnetic media.


Recommendation #4: Always Use New Magnetic Storage Media for Archives

As soon as your initial copies from camera media to duplicate live storage are done, you need to prepare your archival copies.  Since the odds of errors happening on a hard drive or RAID increases with the devices use, starting with new archival media, writing to it once, and treating it as read-only from there on out protects against loss and damages throughout the entire post production process.

I going to repeat myself here: we always recommend two archival copies, in addition to any live copies you’re actually working with.  It’s the only way to be certain your footage is protected.

While hard drives are adequate for archival purposes, modern high density drives aren’t actually stable over many years - they need to be powered on and checked for the soft errors described in Part 1 (a process called “data scrubbing”) on a regular basis or risk suffering bit rot from magnetic depolarization, where magnetic patterns slowly equalize with their neighbors.  This can cause wide spread data corruption, which means the data should be read in its entirety once a year or so.  Hard drives archives are also still susceptible to physical damage like dropping or extremes in temperature which means they can still be considered risky beyond their simple URE values.  If you have multiple copies on individual drives, though, single drives are still better to use than solid state media or RAIDs.

Solid state media, while very reliable in the short term, is completely horrific for long term storage.  Because of how the data is stored in imperfect electrical cells, SSDs and flash media can start getting bit rot within six months of storage - the cells leak over time and the values change, unless they’re kept powered on and go through data scrubbing on a regular basis.  They’re also relatively expensive, in terms of cost per terabyte, so it’s best not to use them for medium to long term archival.

RAIDs also make for poor archival storage, since they tend to be expensive for archives and their speed here is largely useless.  The speed could be useful if you’re accessing huge amounts of your archives on a regular basis - but that would make them more of a live storage system anyway.  Like individual hard drives they need to be powered on and verified with data scrubbing at least once a year, and are susceptible to storage bit rot and physical damage.

Which leaves us with the absolute best media for archives: linear magnetic tape.

With high data capacities, the lowest cost per terabyte, and shelf lives of 15 to 30 years without suffering unrecoverable bit rot, tapes are the ideal archival solution.  If your company or services depend on the safe storage of your own or others digital assets, and you’re not using a tape storage solution you’re gambling with your economic future.  The odds say disaster will happen to you, even if it hasn’t happened yet, and you will lose data if you aren’t taking the necessary precautions.  

Dual tape backups, with each backup set stored in a different location, provides enough data stability and reliability that you can assume zero data loss risk, especially if the archive is spot checked on a regular basis.  They’re easy enough to keep around, and really only need to be upgraded on a ‘leapfrog’ basis to newer technologies - i.e. moving your archives from older to newer media once every two generations (LTO-4 to LTO-6, LTO-5 to LTO-7), or about once every 5-8 years.  Leapfrogging generations and updating to newer media makes sure your archive is still accessible in the future, and lets you use new media to reset your shelf life clock.

But why new media?  Can we repurpose old media?  Sure, you CAN repurpose old media, but you probably shouldn’t.  Reusing LTO tape media is usually fine, but the more often it’s been used, the more likely it is to be suffering physical wear and tear, and the more likely it is to have a shortened life span.  Reusing hard drives without reconditioning them is a little more problematic since the existing magnetic patterns on the disk reduce the stability of the storage in the long term and make it more likely to depolarize.  And any hard drive that’s ever been used for editing should be considered extremely unstable since it’s undergone a lot of wear and tear.  Reusing and repurposing RAIDs tends to be more stable because of their redundancy, and are perfectly fine as intermediate devices, but again, should not be used for permanent archives by most organizations.

If you’re gawking at the price tag of creating proper archives, look at any investment in tape archival storage or any other archival solution as a necessary insurance policy, because that’s what it is.  Just like production insurance was a necessary cost in the days of film (and still is) in case something goes wrong, archival costs are simply the costs of doing business in the digital age.


Recommendation #5: Foster a Culture that Allows but Protects Against Human Error

While the previous recommendations all dealt with aspects of technology and the choices made there, this recommendation is more about paradigm and attitude in approaching asset protection.

There’s a weird quirk about human psychology that the more pressure someone feels to “not fail” on a regular basis, the more likely they are to make a mistake.  Which is especially a problem when they’re the only person or process that can prevent a failure.

Which means that often the best way to reduce failure and minimize its effects is to cultivate an environment where the penalties for failure aren’t catastrophic (job loss, for instance).  In order to be the most effective here, your culture needs to avoid aggressively punish failure, while instituting policies and procedures to both reduce the risk of failure and minimize its effects when failure does happen.

That’s a little abstract concept, so let me give you an example of little procedures designed to minimize human error.

At Mystery Box, as soon as we remove media from a camera, it’s labeled with tape to identify it as ‘hot’.  The tape may only be removed by the individual copying the footage, who does not return the media to the camera crew until after a) the first copy is made to a specific RAID, b) the footage is spot checked for errors and to ensure it appears as expected, and c) a second copy to another RAID is started.

ACs and DITs both keep track of the mags: mags are never returned directly to the camera cart, but are physically handed off and confirmed that the footage is backed up and the mag ready to be reused.  If at any point there is confusion as to what media was removed or should be used, the DIT or AC will stop and double check the media in question.  Even if that means stopping the whole shoot for a few minutes to double check assets.

While this procedure seems a little cumbersome, and can be frustrating at times when the shoot is brought to a halt waiting on media, it protects against a whole host of human error and other technical problems that could otherwise appear.  Here’s a short list of things it helps prevent:

  • Copying the wrong footage

  • Reformatting camera media with ‘hot’ footage

  • Camera setting or camera hardware errors

  • Transfer errors

  • Camera media errors

  • Accidental deletion of media from one or more RAIDs

  • Storage media errors (or failure)

Having a checklist of steps to follow makes sure that with all of the many things that could go wrong, even things with the smallest probabilities of errors, we’re constantly checking to make sure things are going right.

The risk of human error in copying the wrong footage, accidental deletion, or reformatting the media before being copied are brought into check by the focus on the step by step procedure: if you follow the steps it becomes really easy to see exactly what has and has not been copied.  And the culture says that if there’s any slightest amount of doubt, stop and double check.

Our DIT procedure coupled with regular communication with the camera crew to ensure the completeness and accurateness of the footage, we can almost completely eliminate human error from the transfer process.  And when it does crop up (and it does!), it’s buffered in a way that keeps our digital assets are safe and secure.

By having multiple people work together on ensuring asset safety and security, we create a culture of trust. Each team member is committed to doing their best, but also knows that when they make a mistake it’s not going to be the death of the production, the company, their career.  We’re encouraged to avoid error, but not terrified of failure.

There are a lot of little things that can be done throughout a full post-production workflow to minimize human error, far more than we can list here, and the solutions that work for us may not work perfectly for you.  But so long as you look at each step critically (i.e. imagine what can go wrong in the worst possible ways) and set up a procedure for storage and archival that follows the other four recommendations, you’ll find ways to hedge your bets against human error for your organization, while putting yourself in a near error-free starting position.


Let’s Recap

In summary, our 5 recommendations to protecting your digital assets, in the transfer process and throughout post are:

  1. Using manufacturer recommended hardware for speed and reliability.

  2. Offloading your assets immediately to a RAID 5 or 6 to reduces the odds of loss from corruption, hardware failure, damage or loss.

  3. Creating two copies immediately before wiping your camera cards, on different storage devices, and expanding that to two archival copies that are never touched as soon as possible

  4. Using new magnetic tape for archival backups, or new hard drives if tape is not an option, for the longevity recoverability of magnetic media.

  5. Creating a culture that allows for, but protects against, failure.

Following the best practice recommendation for protecting your digital assets will bring your odds of data loss to effectively zero.


What About Checksums?

With all this talk of data loss and storing multiple copies of assets, it’s easy to imagine that this, finally, is where hashing and checksums make a difference!  If my archival media has failed, or I want to make sure a data set hasn’t changed, I can use the checksums or hashes I created to verify.

Yes, you can.  Many insurance policies require it as the primary means of protecting your footage.  But, checksums probably won’t be the first to tell you if your footage is corrupt.  As we mentioned in Part 1, the device itself will alert you to errors before your checksums and hashes fail saying that the data can’t be read.  Data corruption from bit rot that’s unrecoverable will slow or stop copy operations, while silent data corruption due to controller failure (which is incredibly rare but does happen) presents itself as garbled file names and other metadata attributes along with problems mounting the file system.

On the other hand, just batch rerunning the checksums on a set of footage forces the drive to read all of the data on the drive, which will detect and correct soft errors from bit rot.  So while the checksums themselves don’t necessarily have much benefit, running a full checksum or hash operation does because it will clean soft errors.  RAIDs often come equipped with a data scrubbing tool that does the same thing though - you’ll find it almost certainly on most NAS units - and running it on a monthly schedule is also good way of verifying the data integrity and recovering from any soft errors.

On the other hand, if you’re using new media for your archives and only writing to it once, the odds of soft errors drops by a few orders of magnitude, so it’s not something you usually have to worry about.

Still there are times when we use hashes or checksums and they’re invaluable.  Checksums are great when you have two files that look identical (i.e. have the same filename) but we aren’t certain if they are identical.  Sometimes the date modified may be different because of a file system error (like a mismatch between a MacOS system and a Linux based RAID).  Sometimes it’s because we’re pretty sure we rendered out multiple copies of a file to different locations but can’t remember if we used unique filenames.  Rarely it’s because we’re pretty sure we trimmed a RAW file and aren’t sure if the one we’re looking at is the full version or not.

In all of these cases, running a hash is a really fast way to actually see if two files are the same file or not.  On MacOS we use terminal and run the command:

    > md5 /Volume/Filepath/Filename

Usually just by typing “md5” and dragging the file from Finder into the terminal window.  On Windows we use an open source program called “Checksum Compare” which will run and compare the hash values of two files, or all files within two folders.

Or if you’ve generated a media hash list (MHL) from original media, you can rerun the checksums on an individual folder and compare it to the original files.  If you want to prepare an MHL for this, we recommend using MD5 over SHA1 or SHA265, simply because it runs significantly more quickly, while all three types end up with different hashes when you’re dealing with random changes.  xxHash is even faster on high speed data access devices so it’s a good option too, but it’s not cryptographically secure.

Checksums and hashes are actually invaluable to cryptography (such as making secure internet connections) or to verify that the file you download hasn’t been tampered with (called fingerprinting) - this is where they really shine.  SHA256 is the (current) gold standard for these hashes right now, since both MD5 and SHA1 hashes are predictable to the point that someone making non-random changes could in theory alter the contents in the file.  But for media footage verification, these are a little overkill.


What Should You Do Instead?

Are there cases where checksums will throw up errors if things aren’t working, yes.  But the odds of this happening are really, really low, and there are usually other ways of discovering data problems, such as visual inspection.  On the other hand, the risks of loss, damage, camera or other hardware errors is much, much higher.  The majority of the real risks are reduced simply by having a second copy as soon as possible, and then visually inspecting or spot checking that copy after it’s made.

Most programs aimed to help DIT work run hashes as a separate operation to the copy operation, and often copy to destinations sequentially.  Let me be clear: any application that increases copy time increases the odds of data loss.  It’s not always by a big amount, but it’s a non zero value that depends on the age and quality of your equipment.  Adding additional read operations to camera media to generate the checksums shortens the media’s life, as does running the checksums on your live or archival copies - again, not by a huge amount but it can add up if you surreptitiously recalculate the checksums to verify your footage.

A great choice for balancing speed with checksums is a DIT program like Hedge.  Hedge reads the data from the camera media, briefly caches it to RAM, and then writes it multiple output locations simultaneously.  At the same time it generates hashes from the RAM cached data.  RAM caching means that it’s capable of creating multiple copies and a media hash list faster than Finder or Windows Explorer can make a single copy.  That’s a program with real value.  

The caveat here is that by using a RAM cache, it actually increases the odds of a copy error, unless you’re running it on workstation class hardware with error corrected RAM (ECC RAM).  Not by a small amount either - the probabilities of error go up by several orders of magnitude to the point where they can’t be ignored.  Which means if you’re copy application is using a RAM cache, suddenly checksums add real value in verifying the copy went well!

Once things are copied (and maybe checksummed) you NEED to spot check the footage (specifically when you’re doing the initial transfers, not every copy you make from here to eternity).  Look at every few clips, scrub through them in the RAW viewer or QuickTime or DaVinci Resolve or whatever will play them back, and make sure that everything looks right.  

The overwhelming majority of hardware related errors aren’t transient, meaning that when they happen, they happen over and over again.  This includes times when the camera, card, or card reader are malfunctioning.  So by spot checking your footage you should be able to quickly notice any block artifacting, major encoding errors, bad camera settings, or otherwise damaged or faulty footage and take the appropriate action.

If your spot check fails, you’ll need to start troubleshooting asap, which usually follows a few steps (stop once the problem is isolated):

  • Double check that you see the same error reading the file directly from the card (rules out the very slim chance of a data transfer error)

  • Change card readers to rule out the card reader, or, try playing back the footage from the card on the camera. (rules out the card reader)

  • Format a new card and record a few clips to the new card.

    • If this new footage is clean, try fully formatting and recording on the first card again to see if you get errors. (rules out the camera, points to the card)

    • If this footage isn’t clean, start trying to diagnose the camera (rules out the card, points to the camera)

  • Download and install the latest camera firmware and repeat the footage test. (corrupt firmware is a very likely culprit - reinstalling the firmware can fix errors and there may be stability updates)

  • Contact camera technical support. (at this point there’s not much more you can do)

Since the most likely culprits are the card or the camera, checksumming without a visual inspection can lead you to have hours or days of footage captured with errors, while giving the impression that everything is okay!

Following these best practices will keep you safe from all kinds of problems with your footage, and the ugly beast of human error.  But for all of your planning and preparations, on a very rare occasion, catastrophe does actually strike.  What happens then?  That’s what we’ll be talking about in Part 3.


Postscript: Miscellaneous DIT Advice

I tried to find a good place to work this into the rest of our discussion here because it came to my attention since publishing the first blog post in the series that a few assumptions I’d made on general practices I assumed were common, aren’t.  Since the rest of this post is about protecting assets through the whole post pipeline instead of DIT specifically, they didn’t fit.  So without giving too much detail on the explanations why, I wanted to include a few bullet point concepts that are important to DIT in general.

Use equipment you trust - Whether you’re using your own equipment or renting from a rental house, it’s important to trust that the equipment you’re using is in good repair.  Make sure there aren’t superfluous programs running and the equipment is kept clean.  Test it before use to make sure the ports you’re using provide enough power to peripherals, aren’t full of dirt, and make solid connections.  Do a couple of video render tests to make sure there aren’t bugs lurking because of something someone else has done.

Use workstation class equipment with ECC RAM - When possible, use workstation class equipment with error corrected RAM, such as a MacPro or a Windows workstation / workstation laptop.  On the Intel side of things look for computers with Xeon processors or their AMD equivalents - these require ECC RAM and are built to a tighter specification than the consumer class processors and RAM making them that much more reliable.

Keep things cool - Heat is the number one enemy of computers and storage devices.  Keep your equipment out the sun, in an air conditioned room if possible, or with sufficient airflow around the devices to keep them cool on location.  Keeping your equipment clean increases its ability to dissipate heat and reduce wear and tear so make sure if you’re using your computers in dusty or outdoor locations to clean them inside and out on a regular basis.

It’s okay to reuse the RAID you dump to - I mentioned this briefly above, but it’s good to reiterate.  You can trust RAIDs more in reuse than single hard drives.  Don’t reuse hard drives as dump drives, but reusing RAIDs is okay because of the lower individual drive loads, higher speeds, and protection benefits of parity data.

Avoid strong electrical or magnetic fields - This means high voltage transformers and magnetic ballasts and the like.  Keep your computer equipment away from them, and on a separate electrical circuit if possible.  Strong electric fields can induce charges in your computer or solid state media and cause wonky behavior or fry chips; strong magnetic fields can depolarize hard drives causing the data to become corrupt.

If you do have to go near these things, keep your equipment grounded; if for whatever reason you’re power source isn’t grounded setting up a ground is as simple as attaching a length of bare copper wire to an unpainted piece of metal, and partially burying the wire in the ground.  Consider adding mesh metal cages connected to the ground around sensitive electronics to act as a Faraday cage.  The metal walls and backs of a grounded RAID unit is usually an okay Faraday cage, but external drives often have enclosures that aren’t connected to ground and allow strong electromagnetic fields through them (or use the computer as ground, which limits the energy dissipation).

Everything goes on a UPS - Unless you’re running ultraportable setups with everything bus powered off of a laptop, every piece of equipment, reader, or storage device hooked up to your DIT setup needs to be powered through a UPS.  This will reduce the risk of data loss or corruption from electrical problems, which is especially needed when running on generator power.  I’m not talking only about loss of power events, but power spikes and brownouts that are common as lights fire up and turn off as well - these can really hurt hard drives and solid state cards if they’re not properly isolated, and a UPS acts as an electrical “condom” between you and the rest of the power draw on set.

Don’t use your DIT hardware for other tasks - If you specialize in DIT, don’t use those computers or hardware for any other purpose, especially editing or playing games.  You want to keep the wear and tear on the processor, RAM, and especially the ports to a minimum so that they don’t start misbehaving sooner than they otherwise would.  If you’re running DIT for yourself, this may be unavoidable; just be aware it carries some additional risk particularly when you’re not working on workstation class hardware.

Don’t render and copy at the same time - This is especially true when your copy target and your render target are the same device, or when you’re reading from the device you’re copying to during render, but it also applies generally too.  When you’re copying you don’t want any other application competing for resources, especially to the drive or RAID you’re writing to, which will slow the copy down.  Memory leaks or other secondary problems from non-copy related tasks can lead to data corruption on transfer (again, this is less of a problem on modern and workstation class hardware).  But perhaps more importantly, writing to a drive from two sources at once (the camera card and the render) ends up splitting the blocks of data on the drive which reduces the possibilities of data recovery if things go way wrong (which we’ll talk about in Part 3).

Never fill a hard drive more than 80% full - This applies to standalone drives and RAIDs.  It sounds silly, BUT when you fill a hard drive more than 80% of its storage capacity, the actual odds of failure, data corruption, or loss start jumping up a lot, and keep increasing the closer to 100% it gets.  It’s weird, but it’s true.  Plan your storage to avoid more than an 80% fill to keep your odds of data loss way down, especially if you’re planning on individual hard drives as archival backups.  LTO tapes don’t have this problem, though, so it’s fine to fill them as much as you can for your archive.
 

Written by Samuel Bilodeau, Head of Technology and Post Production