Resolving Post Production Bottlenecks

Our first challenging 8K project involved color grading and mastering footage from our trip to Peru.  Prior to that trip, we’d done some 8K work  and had been building and testing our tools to make sure we were 8K ready.  The Peru project was the first full test of the system, since it required ‘the full treatment’ of in depth color correction for HDR, plus a number of other post enhancements and manipulations before we could compress and upload the final project.  After the edit was locked it took around 90 man hours and 120 computer hours to move from proxies to completed video.

As expected, working on the Peru video exposed a number of bottlenecks we hadn’t planned for, one of them was in DaVinci Resolve.  We’d graded 8K footage before, but it was largely light controlled studio work and allowed for copy and paste operations.  But the diversity of light conditions and natural lighting of the Peru footage meant that nearly every shot required a many-node correction tree to match our usual quality.  While DaVinci Resolve had performed adequately for a few nodes when operating in 8K, playback slowed to a nearly unresponsive crawl as we added more correction nodes.  This was despite the fact that our computer has 24 physical cores on two processors, 128 GB of RAM, and dual K6000 graphics cards capable of 10 TFLOPS of graphic processing.

After running a few tests we found that the main culprit for the slowdown was the graphics processing power on playback - we were simply asking more than even those beefy, two year old cards could handle.  That latency added probably 10-20 hours worth of waiting during color correction overall when we account for the added time needed to see the effect of a correction over even a few frames or when tracking movement.

Nvidia Quadro P6000s in our Windows-based Render Rig

Knowing it was a graphics power issue, we brought in to test two of NVIDIA’s latest workstation cards, the Quadro P6000s.  Each P6000 is capable of 12.5 TFLOPs of processing (compared to a K6000’s 5 TFLOPs / card).  On paper, it would give our system a 2.5x graphics boost.

Let me just say, right now, the P6000 is a phenomenal card.  If I were building another workstation right now, that’s what I’d put in it, no question.  In performance testing they regularly beat our K6000s at playback frame rate and the number of nodes during grading in DaVinci Resolve by a factor of 2, almost exactly as expected.

And then we didn’t buy them.

Why not?  Simple.  We ran our tests the week after NAB, where Blackmagic Design had announced a new version of DaVinci Resolve, version 14, with ‘improved playback performance’.  Where version 12.5 always played back at the selected quality level and at the highest color precision in the color pane, in version 14 the redesigned playback engine seamlessly dropped the perceptual quality ever so slightly so that while you couldn’t see a difference (you can see it on the scopes) the program would get much closer to real time playback.

That change made a huge difference in the program: while the real time performance and analysis of the P6000s outpaced the K6000s in version 12.5, in version 14 the performance difference was a meager 10% - the bottleneck had largely been shifted to another point in the system.  And as for full quality output rendering, well, with both the K6000s and the P6000s, render output was limited by the disk write speeds and now how quickly the computer could process the footage.

Just like that, the upgrade was made pointless for us because a software change moved the bottleneck.  Better to save the $4,000 upgrade cost for something else.

Every system has one or more bottlenecks - the factors that limit all other operations or functions and controls the maximum speed things can happen.  This is true in every aspect of life, whether we’re talking chemistry, physics, biology, human resources, a film set, or editing and grading footage in post-production.

We’re not going to get into the bottlenecks in film production here since they tend to have a variety of causes and are often unique the type of production you’re working on or the companies or individuals involved.

Instead we want to look at finding bottlenecks in Post-Production, understanding how each one can limit the speed at which you can work, and when it can be simple or inexpensive fixes that can increase the level of productivity.

Broadly speaking, all bottlenecks in post fall into the following categories: storage device speed, storage transfer speeds, peripheral transfer speeds, processing power (CPU and GPU), software architecture, and workflow.


Hands down the most common bottleneck in post production and the one that’s at times the easiest and at other times the hardest one to fix is the problem of storage access: how quickly can we can read or write files we’re working with.  The jump from 1080 to UHD resolution meant a jump of four times more pixels per frame, and the jump from UHD to FUHD (8K) did it again, something you’ll see very clearly when working with intermediate codecs like ProRes.  These jumps means that storage devices that were plenty quick with 1080p footage, like say an external hard drive, may suddenly not be fast enough when working with 4K intermediates instead.

The simplest way of increasing your storage speed and eliminating storage device speed bottlenecks is to switch to faster storage devices.  This means moving from one hard drive to two, or to an SSD, or from an SSD to a larger or faster RAID device.  But when you push against the limits of what RAIDs are available or affordable, you can quickly find yourself out of luck.  Let’s break down how each of these types of devices create bottlenecks.

The slowest storage device you can use for editing or intermediate work is a single hard drive (internal or external).  Of all of the storage types, a single hard drive will take the most amount of time to read or write data from disk, and experience the highest amount of latency.

Latency is the fancy term for the amount of delay a device or system has between when you ask it to do something and when it actually does it.  That delay can come from a variety of places.  Take a hard drive for instance.  Hard drives have moving parts: a spinning set of disks called platters that can only spin so quickly, and a moving arm that has to physically move across those platters to find and read the information you’re asking for.  The movement takes time. And while is incredibly fast (faster than your eye can track it), it’s far slower than something like an SSD, which is all electronic and doesn’t have to move anything.

Which means that in a practical sense, A single hard drive usually tops out (right now) at around 230 MB/s of read or write speed, theoretically at least.  Cheaper drives are often a lot slower, and other factors can slow things down too, like how much data is on the disk, the disk’s spindle speed (RPM), cache size, and whether you’re doing a lot of reading or writing from different files.

It’s when you’re jumping around between files that latency with hard drives really starts to pose a problem.  The small amount of time it takes to move the arms back to the file system and cue up the next file makes working with image sequences or just fast paced video editing extremely slow and largely unbearable - much slower than the listed sequential read or write speeds of the drive.

A faster option is a solid state drive.  Solid state drives have no moving parts so the latency is minimized to just the time it takes to send and receive commands.  Consumer SSDs usually offer read speeds around 450-550MB/s while pro grade SSDs can top 1.2 GB/s, and the very low latency to find and cue up files means that they feel even faster when using them, especially when compared to a hard drive.

Various external storage options for video, including single hard drives, single SSDs, dual hard drive RAID 0s, and an 8 hard drive RAID 5.

On the other hand, they cost a lot more than hard drives per terabyte of storage, meaning that they aren’t always the most economical choice when dealing with the large sizes involved with video files.  But for external edit drives they can be relatively inexpensive.

If you’re looking for more capacity or higher speeds than a single disk or SSD can offer, local or network attached RAIDs are the next step up.  A RAID takes two or more hard drives or SSDs (or a mixture of both!) and combines them to speed up data access (or increase data reliability, or both).  A solid rule of thumb is that the more disks a RAID uses and the higher the disk capacities are, the faster they can be.

Locally attached RAID arrays, using current hard drives, can get read or write speeds of 700MB/s - 1.2 GB/s fairly easily when doing simple data stripes.  Adding a RAID 5 or RAID 6 level of data protection reduces that speed, and the controller may run a little slower too, but generally a local RAID offers a single user high enough speeds for most kind of video work.

When multiple users need access to the same data though, you’ll probably want move the RAIDs from being directly attached to a single computer to attached to the network (NAS) or as a storage area network (SAN).  But while increasing accessibility, shared storage introduces the increased latency of seeking and finding the information for multiple users - a good NAS can support 20 or so light users, or only a handful of power users; the upward limit of speed and capacity for SANs is really, really high - so high that often your computer can’t talk to the device(s) as fast as they can read or write, even with many users.

But the trade off with both quality NAS and SAN storage (especially SANs) is that they are quite expensive for small production houses, which is why we say that storage is a bottleneck both easy and near impossible to fix.

Generally a good set of rules for storage access to eliminate bottlenecks (or to keep them under control) is:

Hard drives: Use for proxy files (1080p60 max), audio files, and to transfer files between locations.
Solid State Drives: Use for fast offloads, caches (eg. After Effects’ render cache), system drives
RAIDs: Use for RAWs, uncompressed intermediates, shared storage, and any operation that needs speed.  Upgrade to network attached or SAN storage for higher speeds and capacities.

This is why our 8K renders from DaVinci bottleneck on storage: 10b DPX at 8K resolution has a data rate of 7.9GB/s at 60p.  And while we know of and can design RAID storage systems that can sustain those data rates, it’s not practical.  In part because of how ludicrously expensive it would be, and in part because we wouldn’t have a way to talk to it that quickly.  Because how fast you can talk to things also makes a difference.


How quickly a device can read or write data is a fairly easy and obvious first place to start looking for bottlenecks.  But sometimes it’s not about how fast a storage, input/output, or peripheral card can actually do its job, but about how fast your computer can actually talk to that device.

The easiest place to see the difference is, once again, with storage.  Between USB, Thunderbolt, or Ethernet among others there are lots of different ways to talk to your storage devices, and they too can create bottlenecks.  Let’s look at how those choices affect performance.
 

The universal serial bus, or USB, connectivity standard is the most ubiquitous connector and set of standards on the market today.  It’s designed to be a simple, universal standard, that will work with everything from keyboards to phones, wifi cards, video cameras, and hard drives.  With each generation it’s gotten faster, and that’s helped with data storage access:

  Max Data Rate (bits) Max Data Rate (Bytes)
USB 2.0 480 Mb/s 60 MB/s
USB 3.0 5 Gb/s 640 MB/s
USB 3.1 10 Gb/s 1.25 GB/s

There are a couple of things to notice about those values.  First, USB 2.0 is horrifically slow for video work, so double check what kind of port you’re plugged into or what the storage device is using.  Second, USB 3.0 provides faster speeds than all single hard drives, many small RAIDs, and most SSDs, meaning that might not always need faster connections (like a Thunderbolt connector on a single hard drive).

On the other hand, those values are still only theoretical maximums.  There’s some data overhead because of ways the computer talks to the device through the controller and over the cable.  Add in latency between the host issuing a command and the device responding and you end up with a pretty hefty speed penalty: maximum real-world speeds usually perform at 80-85% of the “line speed” (15-20% overhead).

Latency delays over USB become problematic when dealing with small data transfers, like moving or accessing many photographs or image sequences, or when video editing; in our experience operations like that can push the overhead as high as 50-70%.  With USB 2.0 that’s a major problem, and while it’s less of an issue with USB 3.0 and 3.1, it’s still a factor to consider when looking at what could be creating a bottleneck.

USB speeds are cut even further by a process called bandwidth sharing.  On most computers the USB ports share a single controller, and so the available speed is shared speed among all of the different ports and any other hubs attached to the system.  The more USB devices plugged in and in use, the slower each device will run.  When you have a system with multiple controllers you can often improve USB speeds by splitting devices between the controllers, but this is usually not an option.

Overhead and bandwidth sharing are two really important things to remember when dealing with any kind of connection - we’re going to see them show up again and again.

There are ways to reduce their effects though.  One older competitor to the USB standard for data access was FireWire, also known by the designations IEEE 1394, i.Link and Lynx.  Its major advantage over USB was that it wasn’t a host-device protocol, but a peer-to-peer protocol.  Essentially, the FireWire controller on both the host and the device are able to talk to each other without much interaction from the computer’s CPU, reducing latency, and the device’s controller operated more independently to further reduce the command overhead.  That’s why old DV tapes used FireWire instead of USB - less overhead, no bandwidth sharing, low latency meant that you could get much closer to line rates.

Today FireWire is effectively a dead standard, replaced by Thunderbolt for reasons we’ll get into in a minute.  I bring it up though for two reasons: first, you’ll still find it on older archival drives, and second, the comparison between FireWire 400 and USB 2.0 is the perfect comparison to understanding bottlenecks within protocols.  While all intuition would tell you FireWire would have been slower, since its raw line rate speed was lower (400 Mb/s vs USB’s 480 Mb/s), it ended up outperforming USB for video because of its higher efficiency and lower penalties when under load.
 

Today’s fastest common direct attached standard is Thunderbolt, and it works alongside the USB standards.  Apple and Intel developed Thunderbolt to be a sort of ‘best of both worlds’ between USB and FireWire - plug and play connectivity, low connection overhead, daisy chaining, device variety support, and so on.  But it’s expensive to implement for lightweight devices because it requires active cabling, and doesn’t allow for hubs or port expansion so USB still serves an extremely important role.

NB: With Thunderbolt 3 and USB 3.1 standards, the two standards share a single port, so it’s important to know which protocols are actually available over a USB 3.1 / Thunderbolt 3 connector.  Typically USB 3.1 is available when the host port supports Thunderbolt 3, but a USB 3.1 host will not support Thunderbolt 3 devices.

Thunderbolt is at its core a PCIe bus extension, which means that it gives the CPU direct access to the attached peripherals.  And since PCIe commands are low level commands, you can get really high transfer speeds with very little overhead.  This is a huge advantage.

  Standard Max Data Rate (bits) Standard Max Data Rate (Bytes)
Thunderbolt 1 10 Gb/s 1.2 GB/s
Thunderbolt 2 20 Gb/s 2.4 GB/s
Thunderbolt 3 40 Gb/s 4.8 GB/s

Except those numbers from the standard aren’t true.  How do we know?  Because it’s a PCIe bus extension with only 2 or 4 lanes of access to the processor, and PCIe has a fixed overhead that’s accounted for in the PCIe specification speeds.  Which means the maximum data rates that you can ever get for Thunderbolt are:

  Standard Max Data Rate (bits) Standard Max Data Rate (Bytes)
Thunderbolt 1 8 Gb/s 1 GB/s
Thunderbolt 2 16 Gb/s 2 GB/s
Thunderbolt 3 31.5 Gb/s 3.94 GB/s

Even with the adjusted speeds, both Thunderbolt 1 and Thunderbolt 2 typically operate faster than direct attached storage device RAIDs can read and write, which makes it a great way to reduce transfer rate bottlenecks over USB.  BUT, there are other ways Thunderbolt creates bottlenecks too.

Like USB, Thunderbolt suffers from bottlenecks caused by bandwidth sharing.  With the exception of the MacPro, any computer with multiple Thunderbolt ports uses a back end bandwidth sharing schema on the PCI connection, meaning that the available maximum speed (10 or 20 Gbps for Thunderbolt 1 & 2) is split between the attached devices on a time sharing system.  It’s important to understand a little about how that works.  In essence, if only one device is requesting time on the connection, it’ll get full bandwidth; if more than one are requesting time to communicate simultaneously, they’ll each get their turn and the bandwidth is split.

In this case, because Thunderbolt specifications are so much higher than RAID transfer speeds, when just using Thunderbolt for storage devices bandwidth sharing ins't necessarily a problem - overall you'll usually make better use of the available bandwidth if you execute two operations simultaneously, so that more of the total available bandwidth is in use than if only one device was using it.  Where you can run into major problems is that your per-device bandwidth drops, and if you need to sustain a specific bandwidth to or from a device for performance, playback or capture, bandwidth sharing can effectively cripple a high performance device.

Or to put it simply, when you’re using more than one Thunderbolt device at a time, the overall connection speed stays the same but the amount of speed available to each device drops.  This isn’t necessarily a terrible thing, since their connection speeds tend to be higher than single storage devices can use; but the speed penalties can add up as you use more and more Thunderbolt devices.  To regain speed, use fewer Thunderbolt devices at a time.

Distribution of Thunderbolt Bus's on a MacPro.  Each Bus supports 20 Gbps of shared bandwidth, for up to 60 Gbps of peripheral access.  Image courtesy of Apple Support.

Even the MacPro, with its 6 x Thunderbolt 2 ports, only allocates three controllers and bandwidth is shared in pairs of ports.  Here, changing which physical port a device is attached to can cut down the bandwidth sharing and improve performance when using multiple external devices.

Daisy chaining may be one of our favorite Thunderbolt features since you can attach lots of devices to a single port.  However, it also introduces some performance penalties that can create bottlenecks.  The most obvious one is the additional bandwidth sharing load.  But in addition to that, the length of the daisy chain increases latency and reduces the speeds of the devices further down the line.

Devices closer to the host have to read each chunk of data that flowing through them, to see if the data is meant for it or not.  If it is, it takes the appropriate action; if not, it retransmits the data to the next device down the line in the daisy chain by recycling (regenerating) the signal it received.

Each device in the chain goes through this same process whenever data is flowing through it in either direction (downstream host to device, or upstream device to host).  As a result, devices further down the line have measurable drops in speed from the increased latency, even when they’re the only device communicating.  Measurements we’ve done put that drop at about 5-7% of line speed per device in the chain.

To improve daisy chain performance, move more frequently accessed or higher speed devices to be closer to the host, and strategically copy files from devices on separate ports or controllers to minimize the latency issues.  We should note here that Thunderbolt speeds are full duplex, meaning you get the same speed bandwidth in both directions (10 or 20 or 40 Gbps), at the same time, so you actually don’t get performance penalties from two devices on the daisy chain competing for bandwidth, when you copy from one to the other.  But keeping them higher whichever chain they’re part of does give better overall performance.

So bandwidth sharing and daisy chains affect Thunderbolt performance, but not quite as much as Thunderbolt’s dirty little secret of including DisplayPort video data.  Technically speaking, DisplayPort video data for connecting monitors is required to be available to any Thunderbolt connected device, anywhere in the chain.

A lot of marketing documents (read: almost all) imply that using Thunderbolt you can have both a video signal and a high speed data link at 10 or 20 Gbps.  You don’t.  They share the cable bandwidth, and DisplayPort takes bandwidth priority, which can drop your data speed by up to half.  Meaning that the resolution and framerate of your attached display affects the speed you can talk to peripherals over Thunderbolt on the same daisy chain, and depending on how it's attached internally, on the same controller.  Which means that you could be bottlenecking on a chain of devices if a display is on the same line.  If you think this is a problem in your setup, try attaching the display over HDMI or a dedicated DisplayPort connection when available to see if you gain performance on your Thunderbolt peripherals.

All the same cautions and potential bottlenecks are present with Thunderbolt 3 too, but its higher bandwidth means that at the moment it’s not likely you’ll run into them for data transfer to or from hard drives and RAIDs; but as data rates continue to climb it may become a problem in the future.

On the other hand, Thunderbolt 3 still presents a potential bottleneck for high bandwidth external PCIe devices, like a graphics card. We’ll discuss those kinds of bottlenecks in more detail later, but the fact that it only offers four PCIe Gen 3.0 lanes makes it a potential limiting factor when adding external GPUs to a computer.
 

We’re going to move away from talking about storage devices and peripherals directly attached to the host computer and talk about ‘remote’ storage devices, specifically network attached storage (NAS), connected over ethernet.  As soon as you need larger capacities or shareability than direct attached storage offers, network attached storage is usually the first solution, and ethernet is the easiest way to connect to it.

Network Attached Storage in-use at Mystery Box.  NAS #1 is a Synology DS3612xs with DX1211 expansion unit, supporting up to 24 hard drives or SSDs (currently configured with 20 x 8TB HDD + 1 SSD with two disk redundancy for 144TB of Network Storage).  NAS #2 is a QNAP TVS-871T populated with 8 x 6TB HDD with one disk redundancy.  The Synology is used for archival storage while the QNAP is used for active RAW and Intermediate storage.

Ethernet refers to general networking over physical cables and switches as opposed to over wifi.  We’re not going to talk about wifi much here because even the fastest wifi connections speeds available today (802.11ac) are only about 20% faster than USB 2.0, and should not be used for transferring anything other than small files and internet connectivity.

There are a few different factors, some obvious and some hidden, that can affect the speeds of accessing files over a network.  The main factors are network connectivity standard speed, the protocols being used to transfer data, how the storage devices are attached to the network, number of users sharing the device, and how much other traffic is on the network.

Ethernet comes in different connectivity speeds, the most ubiquitous on the market today being Gigabit Ethernet, with a line speed of 1 Gb/s, or about 125 MB/s. Gigabit ethernet is fast enough to handle compressed video files, though it’s easy to saturate shared connections to a storage device attached to the network as a NAS.  For higher bandwidth video use, 10 gigabit ethernet (10GbE) is more of the necessary minimum, allowing for transfer rates comparable to direct attached devices.  There are even higher speed standards like 25GbE, 40GbE and even 100GbE, but since these are rare to find and expensive to implement, we’re going to ignore them for now.

Like Thunderbolt and USB, Ethernet connectivity has connection overhead that reduces the maximum achievable speed from the line speed.  However, because of the way that network connections work, they have a lot more overhead and suffer from a lot more latency than direct attached connections.  Typically overhead and latency can consume as much as 20-40% of your available bandwidth, though in the very best lab conditions we’ve gotten 92% of line speed (but that’s really not typical!).

There are a lot of networking tweaks that you can implement to increase ethernet speed, too many to get into in fact, and most are difficult to implement because there are no simple option sets to turn on or off. But one thing that can really make a difference and that everyone who uses network attached storage should use is called the Jumbo Frame.

Without getting into technical details, ethernet cuts data transmissions up into small groups called frames.  Typically an ethernet frame is limited to 1500 bytes, which is fine for really small files and small data transfers.  But large data files, like video files, can run into latency issues with a standard frame size, since each frame requires processing time by the client, the host, and every switch in between, in addition to the delay waiting for acknowledgement by the other computer that the frame was received correctly.

Enable Jumbo Frame under MacOS by opening your network settings, selecting the adaptor you wish to change, and selecting the advanced settings.  Change the configure mode to manual and switch the MTU to Jumbo.

Most modern network cards and network switches are capable of using what are called Jumbo Frames, frames with up to 9000 bytes instead of 1500.  The configuration of jumbo frames is a little different on each device so you’ll have to look it up for your specific operating systems, NAS devices, and switches.  But making the change on all devices that are accessing or sharing video over a network can often increase transfer speeds by 10 - 20%.  And it’s pretty easy to do - just make sure to do any configuration for it at the switch first before changing the frame size on the computers or you may suddenly lose network connectivity.

There are a couple of other tweaks that you can do to increase network speeds.  First, reduce the number of hops (switches) between workstations and servers, since each switch adds a small amount of time to forward the data and increases latency.  Second, keep network encryption and authentication to a minimum on your video network since these can really slow things down.  Lastly, if you’re feeling adventurous and still want more speed than a single 10GbE connection can support, you may want to look into bonding or link aggregation, a method that allows a computer or server to use multiple physical ethernet connections to increase overall connection bandwidth.  We’re not going to get into how to do that though because it’s complicated to set up.

Once you’ve setup your network cards and switches to run in the fastest possible configuration, the next place to look for network bottlenecks is at the protocols you’re using to connect to a storage device.  There are four main network protocols you’ll find for accessing video over a network: Server Message Block (SMB, also called CIFS), Apple Filing Protocol (AFP), Network File System (NFS), and iSCSI.

All of the latest iterations of SMB, AFP, and NFS offer comparable speeds under ideal conditions, but they end up having a few “gotcha’s” in real-world applications.  SMB, for instance, is the connection method of choice for Windows clients to connect to a file server (they can’t use AFP); Mac’s are also happy to talk to each other over SMB at full speed rates; but trying to get Windows and Macs to talk to each other using SMB usually introduces major speed limitations (that often get worse with each release of one operating system or the other).

Mac’s also don’t like talking to other file servers over SMB and end up with weird data rate limits.  AFP is the fastest method for a Mac to connect to a file server, and is usually the best choice when it’s available.  Linux prefers to use NFS when talking to other Linux computers, but you have to tell both MacOS and Windows specifically to connect to a server over NFS - it doesn’t show up over Bonjour or in your network shares as an available or default connection.

Because of these weird limitations, we usually configure our network shares to run as both AFP and SMB shares so that both our Windows and our Mac computers can connect as fast as they can with as little headache as possible.  We also have found that because of the direct communication speed problem it’s usually faster to push files from the Mac or Windows to the file server first and then pulling them down from the other computer instead of directly moving from one computer to another.

The fourth network protocol is called iSCSI, which can be thought of as making a direct connection over a network.  It’s usually fast, getting closest to the fastest available speeds with minimal overhead.  However, its biggest drawbacks are that it’s difficult to configure, and doesn’t allow for shared write access, since the client computer assumes it and only it has write access - two computers writing to an iSCSI target simultaneously could (probably will) overwrite each other’s data and lead to corruption.  iSCSI is much more useful in deploying virtual servers than for use with video asset sharing.

Once we’ve considered connection types and protocols used, the last main places to look at bottlenecks over a network are the network topology and network traffic.  The network topology refers to how the network is arranged (what’s connected where), while network traffic refers to how many computers are communicating over the portion of the network video data is flowing.  

Where possible, arrange your network connections so that the computers accessing a NAS for video data are ‘close’ and somewhat isolated from the rest of the network.  Keep them on the same switch if possible, with a single uplink to the rest of the network.  This helps minimize hops and minimize the amount of network traffic on the switch to primarily just the video data.  Since the video data won’t be competing with other kinds of data, you’re likely to achieve better speeds.

Recommended method for attaching video storage and their clients to a network.  Keep all storage devices and clients on the same switch and isolated from the rest of the network, when possible.

NOTE TO IT/NETWORK PROFESSIONALS:
What I’m recommending is different than the standard Cisco Hierarchical Model of network design.  Network isolation here should not be interpreted to mean virtual LANning but to mean a physical subset of the distribution layer with an isolated access layer that’s adjacent to the main distribution layer so that no data apart from video data and attached client upstream requests flow through this segment.

For all intents and purposes this can be accomplished with a single switch with a single uplink to the main distribution layer, or multiple cross connected switches with multiple uplinks to the main distribution layer, so long as the network path between the NAS and client never requires passing through the main distribution layer.  STP isolation from the segment outwards is recommended to reduce the broadcast of available network services to the main network.  While this may increase cabling costs while reducing redundancy and availability, the reliability and speed of the video network stub can effectively double.

When considering network design, remember that all users on the network compete for bandwidth to access a NAS.  When a NAS has only one ethernet connection to the network, this can quickly slow down connectivity for all users, even when you’re using 10GbE.

Usually you want to try to make the limiting speed factor of any network attached storage device the disk array itself, or the controller’s ability to field requests.  For small arrays that’s pretty easy to do with a single 10GbE connection, but for larger arrays, you may benefit from multiple ethernet connections attached to the same, or separate switches.  Most NAS appliances have at least two ethernet connections that can allow for multiple connections to the network to divide up disk array bandwidth - even something as simple as having your graphic designers connect over 1GbE while your video editors connect over 10GbE, or arranging the network discovery so that half of the video machines connect over one of the 10GbE ports while the other half connect using the other can make huge amounts of difference in video teams of 5-20 people.  Again, bonding here can help, but configuration can be an issue.

Some NAS devices will even sport many ethernet ports to allow clients to directly attach over ethernet to the device, which is a fantastic option when available.  No switches in between means less latency, which directly translates into higher speeds.  But these solutions often cost a lot more money so there are additional tradeoffs when considering this as an option.

With ethernet connections it’s easy to see how various bottlenecks can add up: the speed of a disk array with multiple users talking to it and competing over the same single connection in and out, coupled with protocol issues can slow things down a lot.  Or, the fact that you’re connected to your 10GbE over Thunderbolt with its own overhead and latency issues that get added to ethernet’s latency and speed issues, coupled with multiple hops or a noisy network and suddenly your speed drops out.  Troubleshooting network bottlenecks can be quite tricky.
 

There are a lot of other storage connection technologies that we could talk about, things like the difference between connecting over SATA and SAS, or how Fibre Channel is a networking protocol different than ethernet that is useful for SANs; but the reality is if you’re really concerned about these things, you need to hire a professional to take a look at your infrastructure and plan things out, or talk to your IT department and walk them through the sustained bandwidth requirements on a per-client basis for video work and find the best solution for your company.
 

Up until this point we’ve been focusing on storage device and connectivity speeds as bottlenecks.  There’s a reason for this: those are usually the main bottlenecks most people encounter on a regular basis, and should be the first concern for where to eliminate bottlenecks when you’re worried about them.  But they’re not the only places where bottlenecks happen.

We’re going to go into even less detail on the root causes of these bottlenecks than we have up to this point, because they get really technical really quickly.  Generally, however, there are three kinds of bottlenecks we want to look at related to your equipment: peripheral transfer rates bottlenecks, processing power bottlenecks, and software bottlenecks.


When talking about peripherals causing bottlenecks, we’re going to focus on two main parts of the computer: the PCIe Bus, and the RAM.
 

How various peripherals are attached to the CPU.  This is based on a System on a Chip (SoC) design, largely the one found in the Intel Skylake / Kaby Lake architecture.

The Peripheral Component Interconnect Express (PCIe) bus is the central hub system that the computer processor uses to talk to everything else attached to it, with the exception of the RAM.  Devices connected to the PCIe bus are either directly wired into the motherboard of the computer, or attached as expansion cards into card slots.

When dealing with laptops, all in ones like the iMac, and compact towers the MacPro, essentially all PCIe components, including the graphics cards and all connection ports like USB and Thunderbolt, are directly connected as part of the motherboard rather than as replaceable components.  This makes upgrading or fixing bottlenecks related to peripherals difficult.

For all other workstations, the most important factor you can look at is the number of PCIe lanes allocated to each expansion slot or peripheral.  A lane is a connection to a processor, and each lane runs at a maximum speed dictated by the PCIe generation used by the processor, motherboard, and the expansion card.  Higher generations are faster, and the card and processor will negotiate their speed based on the highest generation of PCIe they both can manage.  Which means that the overall speed which a processor can talk to a card is usually determined by how many lanes are available.

The number of lanes available to or used by a peripheral is typically written as x##, as in four lanes would be “x4”, so that’s the convention we’ll use here too.  The number of PCIe lanes available for all expansion cards on today’s consumer processors is typically x16, or x40 for each processor on workstation class.

A lane used for one peripheral is unavailable for another peripheral, so typically a motherboard with multiple expansion slots will split the available lanes between them as you plug more cards in.  A motherboard with three card slots, for instance, may allow: one card with x16 connections, two cards each with x8 connections, or three cards with x8, x4, and x4 connections respectively.

For instance, plugging in only one graphics card may get all x16 lanes, but adding a second graphics card will usually split the bandwidth into two x8 connections, each card getting half of what’s available.  This usually equates to a 25% performance drop on each card, in PCIe Gen 2.  So adding a second card won’t double the graphics power, it usually only gives a 50% graphics processing improvement.

How PCIe lanes are divided between attached peripheral devices.  While this particular diagram shows the peripherals as attached PCIe cards, integrated PCIe cards that are not part of the native chipset are attached the same way internally.

But if you then add a 10GbE card to the third slot, the speed available to the second graphics card is cut in half again, often making the performance drop on that card 50-75%, while still reducing the speed available to the network card.  Meaning that in a worst case scenario, adding a second card won’t improve graphics performance at all, and drop your network speed as it does it.

Moving to workstation class hardware, while expensive, really becomes beneficial when we consider PCIe lanes.  The current generation of Xeon processors offer x40 lanes per processor, meaning that you can have two graphics cards running at x16 and a 10GbE card running at x4 with room to spare.  And if you add a second processor the amount of bandwidth available to the peripherals is even higher.

Thunderbolt at its core is a PCIe extension, with x4 bandwidth, which means that any cards connected over Thunderbolt share only x4 bandwidth of connectivity, which can be a problem when adding many types of performance cards, and especially when trying to add an external graphics card.

PCIe expanders ‘create’ additional PCIe lanes for expansion cards while sharing a single uplink to the CPU, and they can be a useful option in some cases, but create bottlenecks in others.  PCIe devices can talk to each other, so there are configurations in performance computing and graphics processing where you’ll see benefits.  On the other hand, they can create collisions when trying to use many peripherals that need more direct communication with the processor, such as if you’re using a network card and a graphics card in the same expansion box.  This, by the way, is how Thunderbolt expansion boxes with two or more PCIe expansion slots work, so it’s something important to be aware of.
 

When we use the word “memory” here, we’re talking exclusively about random access memory, or RAM, which the computer uses to store programs and files that it’s working with.  Generally speaking, the more memory a computer has, the less it has to read and write to disk when doing processing operations. RAM is really, really fast to read and write to so having more of it usually improves performance, especially with video applications.

We’re going to skip the specifics of how RAM works, but there are a couple of important things to know about RAM that can affect your performance.

The first is that the more RAM you have, the more the computer’s operating system can store there temporarily.  It sounds obviously, but both MacOS and Windows use RAM caching when accessing files, meaning that they can temporarily store files you’re working with in RAM.  More RAM means more (or bigger) files stored, and a faster access in programs that are using the same file over and over, like After Effects or other VFX applications.  More RAM is also useful when doing some transformation operations, like when using large files in Photoshop.

RAM also has a speed factor that can come into play.  And while you may not consider the RAM speed very often, higher RAM speeds allow for faster reads and writes to and from memory, which can quickly add up in video applications.  Use the fastest RAM you can afford.


When your workstation computer chokes on a video file or takes a while to render, it’s easy to wish for a more powerful computer.  While under most general usage, the bottleneck is rarely your actual computer, when you do performance work like grading and mastering in 8K resolution on today’s hardware, intensive animation, or 3D graphics, you may find yourself fighting against performance bottlenecks related to the power of your central or graphics processing capabilities.

Upgrading your CPU or graphics cards may do the trick.  Or, if you’re already working on high quality workstation hardware, you may find that it makes no (or very little) difference whatsoever.  That’s what happened to us with the P6000 tests and DaVinci Resolve 12.5 vs 14.  The primary slowdowns during grading wasn’t an issue of having enough power available, but was an issue of how the application used the power available to it.

What makes things more complicated is that between your CPU and your graphics cards there are a couple of different types of power a computer has, and they’re not always easy to quantify.  There are three main types:

  • Single Thread CPU Power
  • Multithreaded / Parallel CPU Power
  • Hyperparallel Graphics Power (GPU Power)

Single thread or single core power is the power offered by one CPU on the computer, and is usually measured as how much work a core can do in one cycle, multiplied by the number of cycles per second (Hertz) the CPU runs at.  For CPUs of the same manufacturer and generation (what year it was made), higher clock speeds mean more power, and generally newer processors of the same manufacturer mean more power per cycle at the same clock speeds.  The power per cycle between manufacturers, however, is not usually directly comparable, and often requires comparison testing.

Multithreaded or parallel CPU power is the total amount of power a CPU has, the sum of all of its cores.  For instance, a 4 core computer would have 4 times the total processing power of one of its cores. A 4 core / 8 thread computer would have around 6 times the performance, since adding 2 threads per core usually gives a performance bump of about 50% when dealing with real world numbers.

Both single threaded and multithreaded power is important in today’s video world.  Some very common applications and plugins use only one processor core when running specific tasks that either can’t be made to run on multiple cores at the same time because they have to run sequentially, or aren’t cost effective to rewrite the software to do that.  Here, how much power a single core has ends up being the limiting factor on how quickly the computer can complete a task.

More and more programs and plugins these days operate with a multithreaded approach - any time the program can do things in parallel it splits up the operation between all (or a specific number) of CPU threads.  This makes better use of the overall power a CPU has to offer.

The difference between single threaded applications and multithreaded applications ends up meaning weird things happen.  Our 8 core MacPro with a processor speed of 3.0 GHz sometimes does things faster than our 12 core, 2.7 GHz MacPro or our 24 core, 2.2 GHz Windows machine.  Not because it has more overall processing power, but because it ends up with marginally more on a single core than the others.

And with that we come to graphics processing power.  GPUs have a much smaller amount of raw power per cycle than a CPU, have lower clock speeds, and will often take a lot longer to accomplish a single task on a single core.  But in exchange they offer a few thousand cores.  For hyperparallel tasks - tasks where the same thing is done over and over on many, many different parts that aren’t dependent on each other (like on, say, pixels in an image…) - GPUs can absolutely crush the processing power of CPUs, while being completely unable to do anything else.

But once again, the software has to take advantage of the GPU for the GPU to make a difference.  DaVinci Resolve for instance, will take advantage of all of the GPU power you can give it to process color transformations, but the CPU is required for running ProRes compression so adding more graphics power may not solve an encoding bottleneck there.  Adobe Media Encoder will leverage the GPUs for H.265 / HEVC compression, but offers much less control over what parts of the specification you’re using than if you use x265, which runs in parallel on the CPU.

This is why power and performance bottlenecks aren’t usually just as simple as “give it more power”.  Often changing programs to another one that does the same thing, but more efficiently, lets you reduce bottlenecks more quickly and at lower expense than changing hardware.

But what about when you have a program or plugin you NEED to use because there are no alternatives, or the alternatives are inferior in other ways?  How can you decide what kind of power you need to improve?

Being able to figure out when software is to blame for a bottleneck can be a little tricky.  Often the question is - is this slow because it’s just something that takes time, or is it slow because the software isn’t optimized properly?  Fortunately, both Windows and MacOS have some tools available that make diagnosing software problems possible.

Regardless of what kind of computer you’re using, the hardware and operating system are continually generating a huge amount of information about its own state that, with the right set of programs, you can access.  Both MacOS and Windows have a process monitor built in (Task Manager under Windows, Activity Monitor under MacOS) that will list all of the current applications and how much of the CPU they’re using, how much memory they have allocated, how much they’re accessing disk storage, and how much network activity they’re creating.

CPU usage is a great way of seeing whether a program is parallelized or not.  Under Task Manager, CPU usage is shown as a percentage of the number of cores / threads available.  For instance a 4 core / 8 thread computer will show a CPU usage of 12.5% (100% / 8) if a processes is just using one thread on a core, and up to 100% as it’s using more of the available threads and cores available to it.  In Activity Monitor on the MacOS side, each thread is considered 100% of a CPU, so using one thread on a 4 core / 8 thread computer will show a CPU percentage around 100% and up to 800% if it’s using more of the available cores and threads.

Task Manager showing CPU and Network use under load.

Activity Monitor showing CPU use under Load.

Activity Monitor showing Network Access Speed

GPU-Z Showing single GPU use under Load

iStat Monitor showing GPU

Seeing how much the GPU is being used can be a little bit more difficult.  On Windows, there’s a free application called GPU-Z you can download from techpowerup.com.  Based on the open source CPU-Z sensor package (also a useful Windows tool), GPU-Z will show you the load on each of your GPUs, allowing you to see exactly how much of your available graphics processing power is being used.  On the Mac side, a non-free but inexpensive program called iStat Menus will give you access to similar kinds of information including how much of your GPU memory and processor are being used by a program over time.

These sensor type applications are great ways to figuring out if you have places or applications where adding more power or RAM or changing the software used can help, though they don’t always provide an immediate set of actions to follow for a solution.  Making changes to the programs or methods you use to accomplish something are workflow changes, and that’s the last place we’re going to talk about bottlenecks.


Let’s be clear: workflow bottlenecks can be far less clear with respect to cause and the solutions available to fix them.  Partially this is because every point in a process takes time, and while you can reduce some of it, you can’t reduce all of it.  Editing takes time.  Color Correction takes time. Compression takes time.  Finding story takes time.  Quality takes time.  ...Dealing with poorly shot footage takes time... But even accounting for that there may be specific places where a few small changes to your workflow can reduce times, increase speed or both.

A good example of this is editing with a proxy workflow instead of with RAWs.  Yes, it takes additional time to make edit proxies and many small facilities may choose to skip the proxy creation and edit with the RAWs for the faster turnaround time.  And when there’s only one person editing on a very tight deadline this can work really well.  However, when you’re dealing with more than one editor, or working from multiple locations you may end up creating more of a time problem passing drives with footage back and forth, or find yourself having to centralize the process where the RAWs are available.

RAWs may also create local bottlenecks on individual editor computers, either because of the transfer rates of the media they’re on (or how they’re connected), or because of the amount of processing power it takes to continually decode them on the editor’s laptop or workstation, especially when effects are applied.  When all of the small amounts of time editing with RAWs may create are accounted for, it may be far more efficient to take the time to create and edit with proxies and online to the RAWs later for mastering than to use the RAWs directly.

Other workflow elements that can slow things down include things like multiple render steps that could be consolidated into just one or two by doing things concurrently within an application (when that’s available) or by switching to a program that may trade off a small amount of quality for consolidation.  Fewer steps, especially render steps, can have major effects on reducing time costs and bottlenecks.

Lastly, automating steps can be both a boon and a curse to workflow bottlenecks.  If the computer can run things without human interaction, it usually lets the computer be more efficient (by not waiting for instruction, for instance).  But many tasks can’t, or shouldn’t necessarily be automated because either the time it would take to automate the task is orders of magnitude bigger than the time to actually do the task, or requires a level of creative input the computer can’t manage with simple scripts or decision trees.  In those cases, semi autonomous processes can speed things up while still giving the creative operator control over the results, while batch preparation and later execution can save the operator from losing precious time on the computer waiting for it to finish a previous task.
 

But we’ll end with a caution and caveat about automation.  On a rare occasion, automation can actually create risks to your digital assets.  But that’s not a bottleneck necessarily, so it deserves its own discussion, which will be part of an upcoming series here on Protecting your Digital Ass(ets). Stay tuned.