Hardware & Technical NVENC update and it's impacts on streaming

On February 11, 2019 NVIDIA updated their video codec SDK to version 9.0 and introduced several new enhancements to NVENC (http://us.download.nvidia.com/Windows/419.67/419.67-win10-crd-release-notes.pdf -- page 5).

Video Codec SDK 9.0 (Released Feb 11, 2019)
Included Features
•Supports NVENC/NVDEC on NVIDIA Turing GPUs
•NVENC API has been updated to support HEVC B-frames on Turing GPUs.
•NVENC API adds the capability to output the encoded bitstream, and motion vectors from Motion-Estimation-only mode to video memory. This avoids the overhead of copying the output from system to video memory for processing pipelines operating directly on video memory.
•NVENC API now accepts CUArray as an input buffer. The SDK contains a sample application to demonstrate how to feed Vulkan surface to NVENC using Vulkan-CUDA interop.
More information: https://developer.nvidia.com/nvidia-video-codec-sdk

Shortly thereafter, OBS Studio 23 was released with a new NVENC encoder taking advantage of these features, as well as being more streamlined in general (https://obsproject.com/blog/progress-report-february-2019).

NVENC Improvements

Originally, we used FFmpeg's implementation of NVENC to save time. It was less than a few hundred lines to implement, and like x264, it only required the raw frames on system RAM. However, I knew that if I implemented it myself and revamped the backend to where we could just give encoders textures directly, it would improve performance. The reason we didn't was behind the complexity of supporting Windows 7 as well. NVIDIA had contacted me asking me about it, and we talked back and forth on the matter. After discussions, I came up with a pretty simple plan: just forget Windows 7. If the user is on Windows 7, just simply fall back to the older version! It saved a lot of time, though not as much time as I'd hoped.
Multi-threading is very difficult to do right

I started off simple to get an initial implementation going: having the encoder the graphics thread (which is normally for rendering), but if either rendering or encoding lagged, it would cause a cascade of subsequent lag. My hope was that the encode call wouldn't stall, but unfortunately it turned out that it can stall, so the only solution was to separate the encoding to another thread, like we already did with software encoders. I had to implement texture sharing in commit b64d7d7. This allowed the ability to not only share textures (like we did for game capture), but also lock textures between multiple threads and graphics contexts to ensure frame synchronization.

After a lot of trial and error, I finally came up with a good threaded implementation in the libobs backend, which I implemented in commit 93ba6e7. It operates on a circular texture queue buffer of a few textures, and I was able to make a specific optimization where if an encoder that uses RAM data is not simultaneously active (e.g. x264), then I can just swap the NV12 texture directly in to the queue instead of having to do an extra texture copy. Finally, after painfully laying all that groundwork for texture-based encoding support in the backend, it was time to finalize my new custom implementation of NVENC, which was accomplished in commit ed0c7bc.

So needless to say, I am very happy with how I was able to implement it as well as the optimizations I was able to come up with. It was pretty fun.
Performance Benefits

The performance benefits of the new NVENC are pretty significant. Before, the process looked like this:

OBS renders a frame
OBS transfers that texture from GPU to RAM like it would for any other encoder
FFmpeg NVENC uploads it to the GPU
FFmpeg NVENC encodes it

Now, it looks like this:

OBS renders a frame
NVENC encodes it

This is not just a performance improvement of OBS, but also reduces the impact of OBS on any game you're playing while using NVENC. It's a must-have for anyone streaming or recording games with a single PC setup.

Seeing as I had just bricked my four year old Windows 7 (which does not support these enhancements) install, found Server 2019 to still be a buggy mess, and was unwilling to put Windows 10 on my primary system, I threw Server 2016 on it and decided to test the improvements.

Traditionally, hardware h.264 encoders have done relatively poorly in quality for a given bit rate, relative to mature software encoders like x264. In my case, this relegated these hardware encoders (like NVENC) to recording for archival and/or later uploading, as the bandwidth requirements of acceptable quality live video streams were prohibitive (a few years back I took a hit from 90Mbps up to 10Mbps up as I moved to an area with no IPS competition and older infrastructure). Any live streaming I did had to be crammed into 8-9Mbps max, which meant 1080p60 content would only look acceptable with software x264 with a custom preset that carefully maxed out my 4.3GHz i7-5820K without overloading things to the point I had either game or encoder stalls.

Recently, changes to Elite: Dangerous and the firmware/software Spectre/Meltdown mitigations have forced me to both slightly reduce my CPU's clock speed and have added some more overhead to the video capture process. Now limited to 8000kbps, with the equivalent of the "Fast" preset, quality was getting pretty borderline and the performance hit in game was becoming noticeable in some situations.

I still haven't finished configuring my new OS install yet, but I have run some tests with the new versions of OBS and NVENC and my initial results are very promising.

Test setup, relevant specs only:
i7-5820K (6 core, 12 thread, Haswell-E) @ 4.2GHz core, 4.1GHz uncore
X99 motherboard, firmware patched with latest microcode
4x4GiB DDR4 @ 2667MT/s 12-11-12-26-T1
GTX 1080 Ti (Pascal GP102) @ 2000MHz core, 1485MHz (5940 QDR) memory
Windows Server 2016 Standard, stripped and fully patched/mitigated
NVIDIA 419.35 WHQL Game Ready driver with CUDA P2 state disabled
OBS Studio x64 23.0.2
Unigine Valley 1.0 benchmark

I chose to record a loop of Unigine Valley (so I could get a repeatable run of a dynamic scene that would be vaguely reflective of many gaming scenarios) running at 2560*1440 (extreme quality + 4x MSAA) and have OBS scale it to 1080p and capture at 60 fps. I used both x264, with my custom 'fast' preset (tuned for optimal YouTube compatibility) that is at the limit of what my CPU can reliably encode at 8Mbps as well as NVENC at the same bitrate, with the highest quality settings that also fell within Youtube's streaming guidelines as closely as possible.

I've uploaded the raw captures to Sendspace for comparison. Links below:

x264, 1080p60, 8000kbps CBR, two-second keyframe interval, 'fast' preset + scenecut=0 bframes=2 ref=1 threads=12 -- https://www.sendspace.com/file/y7ni0

NVENC, 1080p60, 8000kbps CBR, two-second keyframe interval, 'max quality' preset + look-ahead, PVT, and 2 bframes -- https://www.sendspace.com/file/d3e786

As you can see, the encode quality is nearly indistinguishable (a few minor differences, but I have major difficulty picking out which is better or even which is which), but the hit to video frame rate, despite being a largely GPU limited test, is noticeably lower. Valley scored 4677 while being recorded with x264 and 5072 with NVENC. That's nearly a 10% improvement while encoding with the GPU over the CPU, with vastly lower CPU overhead.

Overall, I'm pretty damn impressed. I was considering a major upgrade, partially to be able to record with x264 at the 'medium' or 'slow' presets, but I may be able to hold off a while longer now.

Also, note that this was on a Pascal GTX part. Newer Turing parts should have even better encoding efficiency, as NVIDIA's material mentions. I have not personally tested this though.
 
Last edited:
I wouldn't imagine a big impact on streaming really. Most "professional" streamers use a dedicated computer for composition and encoding and may as well push things through the CPU or at least have a system that's not otherwise loaded down; unless one is going for (virtually) lossless quality, it doesn't matter. The biggest improvements in the implementation itself appear to be a zero-copy approach which nominally reduces latency, but we're talking microseconds versus the (milli)seconds of latency involved in network transport, so an improvement some six orders of magnitude below the real "problem" won't change much (no, Google, it really won't).

This may be interesting in some other areas though. Ultra-high-resolution live encoding, think 8k or ultra-high framerate, with HEVC profit from lower memory bandwidth use and hardware support, but neither is relevant for the general consumer populace, or will be in the foreseeable future. In professional pipelines you only want any of that in the very last delivery stage too (working with HEVC or even AVC streams is, to put it in professional terms, <User has been banned from the forums for excessive use of profanity.>
 
Most streamers aren't professionals, they are hobbyists/dabblers, or at least started that way. Having essentially free x264 fast/medium quality encoding will make things much more assessable.
 
Last edited:
Hobby and dabbling was perfectly feasible before though, and especially with the recent "core race" in CPUs pushing a lot of unused compute power around, chances are anyone motivated enough to stream their stuff is on an acceptable hardware platform anyway. Not UHD perfect, but good enough.

I simply don't see it as any kind of game changer. Nice, but not worth throwing my underwear onto the stage :p
 
Hobby and dabbling was perfectly feasible before though, and especially with the recent "core race" in CPUs pushing a lot of unused compute power around, chances are anyone motivated enough to stream their stuff is on an acceptable hardware platform anyway. Not UHD perfect, but good enough.

I was on 'good enough' before, but it's that much better now.

For someone with an older or lower end setup, who doesn't have a huge number of surplus CPU cycles, the quality vs. bitrate jump of GPU hardware encoding could well be a game changer. If one is playing semi-modern games at decent settings, the only thing that's certain is that one has a moderately powerful GPU...and now that's all one needs to get a HD stream that doesn't look like mud.

Nice, but not worth throwing my underwear onto the stage :p

I wasn't wearing any underwear.
 
Short clip of my archival/upload settings with the new encoder: https://www.sendspace.com/file/7a3p93

That's 4k recorded at 1440p60 with CQP 16 at about 130Mbps (which is why the video is only 16 seconds long). Performance hit vs. no recording was only about 5-7% and I wouldn't have noticed it at all without the frame rate meter.
 
Back
Top Bottom