§ ¶Beware of Huffyuv's "Predict median" Mode
Ben Rudiak-Gould's Huffyuv is a fairly common lossless video codec on Windows. Its main advantages are fast compression speeds, lossless support for both YCbCr and RGB video, and of course, it's free. Because it's lossless, the compression ratios that can be obtained are limited, but generally you can get around 1.6:1 to 2.2:1 on average, and that's a big help when you can use it during real-time video capture.
There is, however, a bear trap in its settings.
If you bring up the configuration dialog for the video codec (Video > Compression > Huffyuv, Configure in VirtualDub), there are two combo boxes at the top of the dialog which set the prediction modes for YUY2 and RGB video. The prediction mode changes the way that Huffyuv attempts to detect patterns in the video and thus the speed and effectiveness of the compression. Most of the modes are fast for both compression and decompression, but you should be aware of the Predict Median mode on the YUY2 side. I've made the mistake of picking this mode before, and while it's fast for compression, it's unexpectedly slow for decompression. The reason it's so slow is that the Predict Median predictor goes through a non-vectorized scalar code path on the decompression side and is thus very slow. This won't affect your video, of course, but if this turns out to be too slow for your post-processing work, you can always recompress it again with another Huffyuv predictor thanks to the lossless nature of the codec.
I should note that I only have experience with the last official version of Huffyuv, 2.1.1. There have been some unofficial updates to Huffyuv to add workarounds for compatibility issues with applications and to add YV12 support, but I don't know if those have any speed improvements to the predictor. I looked into trying to speed up the median predictor in it at one point but was unable to do so, and I'm guessing that it's a hard problem because it shares a lot in common with the Paeth predictor in PNG, which is also inherently serial and hard to parallelize. I also looked into rewriting the bitstream parser into C++, but I tried using compiler intrinsics and of course ended up hanging both the VC8 and VC9 compilers.
The MultimediaWiki has a description of the Huffyuv compression algorithm: http://wiki.multimedia.cx/index.php?title=HuffYUV. Unfortunately, it seems to be missing a few critical details, most notably the exact nature of the VLC bitstream: 32-bit words in little endian format, codewords placed starting from the MSB, Huffman codes allocated longest codeword first, and with a maximum codeword length of 31 bits.
Incidentally, just like with Avisynth, I've noticed that people like to butcher the name of this codec for some reason. It's Huffyuv, first letter capitalized, rest in lowercase. It's been that way since 1.0.0 and appears that way in the documentation, dialogs, and source code. Yet, for some reason, people keep misnaming this codec HuffYUV.
I don't find "HuffYUV" objectionable, but I do see people type "HuffyUV" a lot. People also tend to butcher my handle, typing "stickyboy" instead of "stickboy". Go figure.
As for AviSynth, although that's not the capitalization BenRG originally used, it is the capitalization used by the people who picked up the project after BenRG.
James - 26 05 08 - 21:43
Did you try something like this: http://forum.doom9.org/showthread.php?p=..
I gave it a shot but speed ended up being the same.
squid - 28 05 08 - 01:42
Ew, yuck. You definitely DON'T want to do that since you'll get killed on the transpose due to cache misses. Huffyuv has a low ALU-to-memory count, so it's pretty sensitive to issues like that (read: it's fast because it doesn't do much). Also, as the guy noted, you can't do that with the stock Huffyuv format due to the cylindrical dependency.
It is possible to decode U and V in parallel, but I don't know how useful that is with Y being the bottleneck. I guess if the operation is bottlenecked just by pure ALU op count then it could give a ~30% boost or so.
I'm currently working on implementing support for some of the extensions that have been introduced by others post 2.1.1, and I think I want to beat the person who introduced them. (Was it you?) The annoying addition is dynamic Huffman tables, which wouldn't be a problem except that they're ENDIAN SWAPPED. Basically, someone wrote their own decoder/encoder that worked by swizzling the entire frame from Huffyuv ordering (32-bit LE words) to big endian so standard JPEG/MPEG style bitstream routines would work, and then they added the Huffman table on the front and reswizzled the whole buffer. This means that not only is the Huffman table swizzled, but it also misaligns the rest of the frame so that it can't be decoded with the original bitstream decoder! Endian swizzling the entire buffer is not a good idea for speed. I think I can work around this by priming the decoder, but what a pain.
(In case you haven't guessed, I'm adding a Huffyuv decoder to VirtualDub.)
Phaeron - 28 05 08 - 04:24
Avery, for fast endian swap on Core 2 Duo and Penryn you can use PSHUFB instruction to swap 16 bytes at a time (best if unrolled to cache line size of course). I agree that HuffYUV is a mess.
As for the capitalization, it is simple -- Huffman + YUV, only tech savvy people will capitalize it like that. I suppose that those who write Huffyuv follow standard English capitalization rules.
Igor Levicki (link) - 28 05 08 - 12:06
The version I’ve been using for a long time is “HuffYUV revisited” 2.2.0
BugsBunny - 28 05 08 - 16:08
I did add yv12, dynamic tables and a directshow encoder interface but I didn't release it into the wild. Any encodes using advanced features use the fourcc ADHF to avoid conflicts. For dynamic tables I stored the tables (in the same format as what goes in extradata) before the compressed frame data and they were extracted using the existing huffuv functions. The frequency of table recalculation could be set in the codec config so there were delta frames with virtually no speed loss when seeking as long as the app used the ICDECOMPRESS_PREROLL flag properly.
AFAIK the only other dynamic tables implementation comes from ffdshow/libavcodec...
squid - 28 05 08 - 16:26
Also, isn't the maximum codelength limited to 27 bits, with the remaining 5 bits used to store the length for faster encoding? I'm not sure about this but seem to recall something like it somewhere...
squid - 28 05 08 - 18:42
Yeah, but I doubt that endian swap would be ALU bound even if you just did BSWAP, which is single clock. SSE2 isn't so bad either (PSHUFLW + PSHUFHW + PSRLW + PSLLW + POR).
The main objection I have is that the second group of people decided to rename the frigging codec for no good reason.
Might be a restriction of the 2.1.1 encoder, but the decoder should be able to decode anything up to 31 bits. (It ORs in the LSB in order to avoid having to test the flags from the BSR instruction.) I'm not sure what ffmpeg does, although when I tested ffvfw it just put the same Huffman tables in all frames for all channels, which isn't very dynamic.
This certainly has been more interesting than I had expected -- got pretty quick and decent responses from the VC++ team about the 64-bit codegen bugs I hit.
Phaeron - 28 05 08 - 23:55
They type HuffYUV to avoid confusion with ♥Huffyluv♥
user - 29 05 08 - 02:42
Avery, AFAIK HuffYUV actually works on R-G, G, and B-G for prediction so the name is a bit off anyway.
As for BSWAP, it is 1 clock but it can swap only 4 bytes at a time while PSHUFB shuffles 16 bytes per clock, and on Core i7 (Nehalem) CPUs PSHUFB has two 128-bit shuffler units so it executes at a rate of 0.5 clocks.
Igor Levicki (link) - 02 11 08 - 03:49
If you're willing to change the HuffYUV bitstream format, it's possible to make Median SIMDable by reordering the pixel scan in a diagonal manner, so that you can decode 16 independent pixels at the same time, thus allowing SIMD median. This would also speed up the other prediction modes.
Dark Shikari - 06 01 09 - 12:39