H.264 Decoder Overview
H.264 Advanced Video Coding is an ITU standard for encoding and decoding video with a target coding efficiency twice that of H.262 (MPEG2). For example, it enables PAL resolution video to be transmitted at 1Mbit/sec. Like other video coding standards, H.264 specifies how to reconstruct or decode video from coded bits but does not specify how to encode video. H.264 shares many of the techniques used in other video codecs and adds new variations to improve coding efficiency. Coding efficiency is defined in terms of the log of the ratio of number of bits required to encode a video over the number of bits in the original video.
H.264 uses a variety of techniques to reduce the number bits necessary to encode video. It uses intra-prediction, to predict a video block from other video blocks within the same frame. It uses inter-prediction to predict video blocks from blocks in previous frames.
H.264 operates on 4x4 as well as 8x8 pixel blocks, unlike previous standards that only operaed on 8x8 blocks.
Network Adaptation Layer
Variable Length Coding
Discrete Cosine Transformation and Quantization
Intra Prediction: Spatial Correlation
Inter Prediction: Motion estimation and motion compensation
H.264, like many other video compression, relies on
To reduce computational complexity and to promote uniformity of implementations, H.264 uses an integerized approximation of the Discrete Cosine Transformation.
Deblocking Filter
H.264 incorporates a "deblocking" filter to smooth the artifacts caused by operating on square blocks of pixels. This filter is incorporated into the encoding loop.
H.264 in Software
We started this work by examining the ITU reference source code for H.264 and the ffmpeg open source implementation.
In the course of the project, we have learned that the standard reference implementations are of no use in studying the computational complexity. The reference implementations perform poorly on general purpose processors and are unusable as a starting point for the design of hardware accelerators. Standard software implementations obscure data dependences and concurrency. We switched to ffmpeg, which has been optimized for a variety of CPUs including x86 and ARM.
We found that the software implementations of H.264 were unsuitable as a starting point for hardware acceleration because everything is put into global memory and control flow is explicitly sequential. Because a small number of large structures are used and passed throughout the source code, it is very difficult to find and expose any concurrency or parallelism.
- - write a section that analyzes the source code as a starting point
- - high-level point is everything gets put into global memory and gets munges - need to analyze data and control flow - explain it so that we can optimize the data dependences
H.264 in Bluespec
Our approach was to re-implement H.264 decoder in Bluespec, keeping the modularity of the algorithm and exposing its parallelism. Each of the components in the H.264 block diagram is implemented as a transactor. We define a transactor as ... The transactors are connected by FIFOs, decoupling the execution of each component.
// Instantiate the modules INalUnwrap nalunwrap <- mkNalUnwrap(); IEntropyDec entropydec <- mkEntropyDec(); IInverseTrans inversetrans <- mkInverseTrans(); IPrediction prediction <- mkPrediction(); IDeblockFilter deblockfilter <- mkDeblockFilter(); // Internal connections mkConnection( nalunwrap.ioout, entropydec.ioin ); mkConnection( entropydec.ioout, inversetrans.ioin ); mkConnection( inversetrans.ioout, prediction.ioin ); mkConnection( prediction.ioout, deblockfilter.ioin );
- - section on how to approach it in Bluespec/transactor style
- - straightforward dataflow network - make the blocks explicit in the code - if we just code it up, takes X lines bluespec, verilog, area, time - talk about memory -- by pipelining it, it turns into separate local memories
- - big energy and cost savings
- - straightforward dataflow network - make the blocks explicit in the code - if we just code it up, takes X lines bluespec, verilog, area, time - talk about memory -- by pipelining it, it turns into separate local memories
Architectural Exploration
- - section on exploration (perhaps)
- - so now what? - we want to improve it for different designs - enable architural exploration
Algorithm Exploration
- - FPGA work
- - to enable algorithmic work - coding complexity, coding efficiency, power - to enable hardware design for different targets (cost vs performance)
- check wen mei hu for references
- - everything is 5% if you profile it - not easy to improve
