Beyond Thunderbolt – Mac Pro GPU Slot Comparisons

I’m going a little more techie than usual with this post. You have been warned.

There’s been much wringing of hands over the future of the Mac Pro, which is long overdue for an upgrade. New chips from Intel are imminent, but Apple’s been characteristically quiet about what’s in store, and if there will even be another Mac Pro.

The current 27″ iMac is no performance slouch, and with Thunderbolt technology built in there’s been speculation about the true need for internal slots. Put multiple Thunderbolt ports/busses on an iMac and you’ve got a pretty capable machine for video editing and graphics, with plenty of room for external expansion thanks to Thunderbolt. At NAB this year there were plenty of new Thunderbolt accessories.

The sticky-wicket is GPU performance. Thunderbolt is fast, but it’s current iteration isn’t nearly as fast as an internal 16-lane PCIe connector, which is where high-performance graphics cards sit.

This video recently popped up on YouTube, demonstrating an NVidia Quadro 4000 accelerating the effects in Adobe Premiere CS6 over thunderbolt.

As I am fond of saying, “It’s amazing that it works at all!”

This got me wondering how important the 16 lane PCIe speed really is for GPU performance. How much performance is a result of data moving across the buss, and how much is simply on-card number crunching? I decided to do some testing.

I recently replaced the GPU in one of our Mac Pro 3.1 towers, pulling the ATI 3870 and putting in an NVidia GTX-470 that came from eBay, its ROMs flashed for Mac compatibility. I wanted a CUDA capable card to take advantage of the accelerated features in Adobe’s CS6 tools, and this card offers a high price/performance ratio. It’s a lot of CUDA cores for the buck, and it’s been running flawlessly for a couple of months.

Here are some test results from the GTX-470, installed in the 16-lane slot in the Mac Pro. First up, some stats from the CUDA-Z utility, which is designed to benchmark CUDA cards on the Mac.

So that’s our baseline, running in the 16-lane slot.

I also ran the Cinebench test from Maxon, which is designed to test OpenGL performance. Here’s the results from that test:

That’s the GTX-470 in orange in the results list, 26.16 fps.

Now, interestingly, the GTX-470 doesn’t set the world on fire when it comes to OpenGL performance. If you look at the list of result you’ll see that the ATI 3870, a much older card with less memory, outperforms it. I did some research, and the conclusion seems to be that NVidia did some internal bandwidth bottlenecking on this card to keep it from competing with the much more expensive Quadro series. As a gaming card, the GTX-470 is designed to get images rendered to the screen quickly, but for OpenGL it is limited in getting those rendered images back to the CPU. The CUDA pipes are wide open, I’m told.

So what happens when we benchmark the GPUs in the 4-lane slot, which is closer to the bandwidth we’d expect to have over Thunderbolt?

First up, the GTX-470 CUDA test in the 4-lane slot:

Here’s the 16-lane slot test again for comparison:

These results are pretty much what you would expect. The speed of getting data to and from the card drops significantly in the 4-lane slot, yet the internal functions of the card stay close enough to call even.

Here’s the OpenGL Cinebench test in the 4-lane slot:

It’s nearly the same performance that we got in the 16-lane slot, close enough to call even.

Based on this test, for OpenGL we would likely see no real-world performance hit from running via an external Thunderbolt chassis.

I also tested the older ATI 3870 card, and it went from 31.2 fps in the 16-lane slot to 30.9 in the 4-lane slot, not what I would consider a significant difference.

So while the raw numbers from the CUDA-Z benchmark show a significant performance gap in getting data to and from the GPU, the more real-world Cinebench tests seem to indicate that in this particular combination of hardware it’s not as big a performance hit as you might expect.

This is far from the last word on this. There are, of course, flaws in this method of testing. It’s a bit Apples vs. Oranges comparing the raw CUDA numbers to the OpenGL numbers. I didn’t set out to create the be-all-end-all statement on GPU performance in Mac Pro slots, rather I was simply curious to know how much of a difference slot-placement actually makes, and if the notion of using a high-performance GPU in an external Thunderbolt expansion chassis makes any sense. So I took a couple of hours and played around with it. I’d love to see some real-world comparisons using some of Adobe’s apps, and to see what the threshold is for when the number of lanes starts to make a real-world difference. It may be that for many users that threshold is rarely reached. There will, of course, be those who lust after every possible shred of performance and spare no expense getting it, and there are probably apps for which the threshold is, indeed reached pretty quickly. (DaVinci Resolve comes to mind, for example.)

Based on the prototype video posted above and these test results, I’m much more optimistic about this being a practical solution than I had been previously. A 27″ iMac with an external CUDA card could prove to be an extremely viable, cost-effective workstation. Don’t get me wrong, I’d love to see another round of Mac Pros, but even if Apple decides to drop or significantly rework them the situation doesn’t seem nearly as grim. We have options.