On February 11, 2019 NVIDIA updated their video codec SDK to version 9.0 and introduced several new enhancements to NVENC (http://us.download.nvidia.com/Windows/419.67/419.67-win10-crd-release-notes.pdf -- page 5).
Shortly thereafter, OBS Studio 23 was released with a new NVENC encoder taking advantage of these features, as well as being more streamlined in general (https://obsproject.com/blog/progress-report-february-2019).
Seeing as I had just bricked my four year old Windows 7 (which does not support these enhancements) install, found Server 2019 to still be a buggy mess, and was unwilling to put Windows 10 on my primary system, I threw Server 2016 on it and decided to test the improvements.
Traditionally, hardware h.264 encoders have done relatively poorly in quality for a given bit rate, relative to mature software encoders like x264. In my case, this relegated these hardware encoders (like NVENC) to recording for archival and/or later uploading, as the bandwidth requirements of acceptable quality live video streams were prohibitive (a few years back I took a hit from 90Mbps up to 10Mbps up as I moved to an area with no IPS competition and older infrastructure). Any live streaming I did had to be crammed into 8-9Mbps max, which meant 1080p60 content would only look acceptable with software x264 with a custom preset that carefully maxed out my 4.3GHz i7-5820K without overloading things to the point I had either game or encoder stalls.
Recently, changes to Elite: Dangerous and the firmware/software Spectre/Meltdown mitigations have forced me to both slightly reduce my CPU's clock speed and have added some more overhead to the video capture process. Now limited to 8000kbps, with the equivalent of the "Fast" preset, quality was getting pretty borderline and the performance hit in game was becoming noticeable in some situations.
I still haven't finished configuring my new OS install yet, but I have run some tests with the new versions of OBS and NVENC and my initial results are very promising.
Test setup, relevant specs only:
i7-5820K (6 core, 12 thread, Haswell-E) @ 4.2GHz core, 4.1GHz uncore
X99 motherboard, firmware patched with latest microcode
4x4GiB DDR4 @ 2667MT/s 12-11-12-26-T1
GTX 1080 Ti (Pascal GP102) @ 2000MHz core, 1485MHz (5940 QDR) memory
Windows Server 2016 Standard, stripped and fully patched/mitigated
NVIDIA 419.35 WHQL Game Ready driver with CUDA P2 state disabled
OBS Studio x64 23.0.2
Unigine Valley 1.0 benchmark
I chose to record a loop of Unigine Valley (so I could get a repeatable run of a dynamic scene that would be vaguely reflective of many gaming scenarios) running at 2560*1440 (extreme quality + 4x MSAA) and have OBS scale it to 1080p and capture at 60 fps. I used both x264, with my custom 'fast' preset (tuned for optimal YouTube compatibility) that is at the limit of what my CPU can reliably encode at 8Mbps as well as NVENC at the same bitrate, with the highest quality settings that also fell within Youtube's streaming guidelines as closely as possible.
I've uploaded the raw captures to Sendspace for comparison. Links below:
x264, 1080p60, 8000kbps CBR, two-second keyframe interval, 'fast' preset + scenecut=0 bframes=2 ref=1 threads=12 -- https://www.sendspace.com/file/y7ni0
NVENC, 1080p60, 8000kbps CBR, two-second keyframe interval, 'max quality' preset + look-ahead, PVT, and 2 bframes -- https://www.sendspace.com/file/d3e786
As you can see, the encode quality is nearly indistinguishable (a few minor differences, but I have major difficulty picking out which is better or even which is which), but the hit to video frame rate, despite being a largely GPU limited test, is noticeably lower. Valley scored 4677 while being recorded with x264 and 5072 with NVENC. That's nearly a 10% improvement while encoding with the GPU over the CPU, with vastly lower CPU overhead.
Overall, I'm pretty damn impressed. I was considering a major upgrade, partially to be able to record with x264 at the 'medium' or 'slow' presets, but I may be able to hold off a while longer now.
Also, note that this was on a Pascal GTX part. Newer Turing parts should have even better encoding efficiency, as NVIDIA's material mentions. I have not personally tested this though.
Video Codec SDK 9.0 (Released Feb 11, 2019)
Included Features
•Supports NVENC/NVDEC on NVIDIA Turing GPUs
•NVENC API has been updated to support HEVC B-frames on Turing GPUs.
•NVENC API adds the capability to output the encoded bitstream, and motion vectors from Motion-Estimation-only mode to video memory. This avoids the overhead of copying the output from system to video memory for processing pipelines operating directly on video memory.
•NVENC API now accepts CUArray as an input buffer. The SDK contains a sample application to demonstrate how to feed Vulkan surface to NVENC using Vulkan-CUDA interop.
More information: https://developer.nvidia.com/nvidia-video-codec-sdk
Shortly thereafter, OBS Studio 23 was released with a new NVENC encoder taking advantage of these features, as well as being more streamlined in general (https://obsproject.com/blog/progress-report-february-2019).
NVENC Improvements
Originally, we used FFmpeg's implementation of NVENC to save time. It was less than a few hundred lines to implement, and like x264, it only required the raw frames on system RAM. However, I knew that if I implemented it myself and revamped the backend to where we could just give encoders textures directly, it would improve performance. The reason we didn't was behind the complexity of supporting Windows 7 as well. NVIDIA had contacted me asking me about it, and we talked back and forth on the matter. After discussions, I came up with a pretty simple plan: just forget Windows 7. If the user is on Windows 7, just simply fall back to the older version! It saved a lot of time, though not as much time as I'd hoped.
Multi-threading is very difficult to do right
I started off simple to get an initial implementation going: having the encoder the graphics thread (which is normally for rendering), but if either rendering or encoding lagged, it would cause a cascade of subsequent lag. My hope was that the encode call wouldn't stall, but unfortunately it turned out that it can stall, so the only solution was to separate the encoding to another thread, like we already did with software encoders. I had to implement texture sharing in commit b64d7d7. This allowed the ability to not only share textures (like we did for game capture), but also lock textures between multiple threads and graphics contexts to ensure frame synchronization.
After a lot of trial and error, I finally came up with a good threaded implementation in the libobs backend, which I implemented in commit 93ba6e7. It operates on a circular texture queue buffer of a few textures, and I was able to make a specific optimization where if an encoder that uses RAM data is not simultaneously active (e.g. x264), then I can just swap the NV12 texture directly in to the queue instead of having to do an extra texture copy. Finally, after painfully laying all that groundwork for texture-based encoding support in the backend, it was time to finalize my new custom implementation of NVENC, which was accomplished in commit ed0c7bc.
So needless to say, I am very happy with how I was able to implement it as well as the optimizations I was able to come up with. It was pretty fun.
Performance Benefits
The performance benefits of the new NVENC are pretty significant. Before, the process looked like this:
OBS renders a frame
OBS transfers that texture from GPU to RAM like it would for any other encoder
FFmpeg NVENC uploads it to the GPU
FFmpeg NVENC encodes it
Now, it looks like this:
OBS renders a frame
NVENC encodes it
This is not just a performance improvement of OBS, but also reduces the impact of OBS on any game you're playing while using NVENC. It's a must-have for anyone streaming or recording games with a single PC setup.
Seeing as I had just bricked my four year old Windows 7 (which does not support these enhancements) install, found Server 2019 to still be a buggy mess, and was unwilling to put Windows 10 on my primary system, I threw Server 2016 on it and decided to test the improvements.
Traditionally, hardware h.264 encoders have done relatively poorly in quality for a given bit rate, relative to mature software encoders like x264. In my case, this relegated these hardware encoders (like NVENC) to recording for archival and/or later uploading, as the bandwidth requirements of acceptable quality live video streams were prohibitive (a few years back I took a hit from 90Mbps up to 10Mbps up as I moved to an area with no IPS competition and older infrastructure). Any live streaming I did had to be crammed into 8-9Mbps max, which meant 1080p60 content would only look acceptable with software x264 with a custom preset that carefully maxed out my 4.3GHz i7-5820K without overloading things to the point I had either game or encoder stalls.
Recently, changes to Elite: Dangerous and the firmware/software Spectre/Meltdown mitigations have forced me to both slightly reduce my CPU's clock speed and have added some more overhead to the video capture process. Now limited to 8000kbps, with the equivalent of the "Fast" preset, quality was getting pretty borderline and the performance hit in game was becoming noticeable in some situations.
I still haven't finished configuring my new OS install yet, but I have run some tests with the new versions of OBS and NVENC and my initial results are very promising.
Test setup, relevant specs only:
i7-5820K (6 core, 12 thread, Haswell-E) @ 4.2GHz core, 4.1GHz uncore
X99 motherboard, firmware patched with latest microcode
4x4GiB DDR4 @ 2667MT/s 12-11-12-26-T1
GTX 1080 Ti (Pascal GP102) @ 2000MHz core, 1485MHz (5940 QDR) memory
Windows Server 2016 Standard, stripped and fully patched/mitigated
NVIDIA 419.35 WHQL Game Ready driver with CUDA P2 state disabled
OBS Studio x64 23.0.2
Unigine Valley 1.0 benchmark
I chose to record a loop of Unigine Valley (so I could get a repeatable run of a dynamic scene that would be vaguely reflective of many gaming scenarios) running at 2560*1440 (extreme quality + 4x MSAA) and have OBS scale it to 1080p and capture at 60 fps. I used both x264, with my custom 'fast' preset (tuned for optimal YouTube compatibility) that is at the limit of what my CPU can reliably encode at 8Mbps as well as NVENC at the same bitrate, with the highest quality settings that also fell within Youtube's streaming guidelines as closely as possible.
I've uploaded the raw captures to Sendspace for comparison. Links below:
x264, 1080p60, 8000kbps CBR, two-second keyframe interval, 'fast' preset + scenecut=0 bframes=2 ref=1 threads=12 -- https://www.sendspace.com/file/y7ni0
NVENC, 1080p60, 8000kbps CBR, two-second keyframe interval, 'max quality' preset + look-ahead, PVT, and 2 bframes -- https://www.sendspace.com/file/d3e786
As you can see, the encode quality is nearly indistinguishable (a few minor differences, but I have major difficulty picking out which is better or even which is which), but the hit to video frame rate, despite being a largely GPU limited test, is noticeably lower. Valley scored 4677 while being recorded with x264 and 5072 with NVENC. That's nearly a 10% improvement while encoding with the GPU over the CPU, with vastly lower CPU overhead.
Overall, I'm pretty damn impressed. I was considering a major upgrade, partially to be able to record with x264 at the 'medium' or 'slow' presets, but I may be able to hold off a while longer now.
Also, note that this was on a Pascal GTX part. Newer Turing parts should have even better encoding efficiency, as NVIDIA's material mentions. I have not personally tested this though.
Last edited: