# H.264/AVC Hardware Encoders and Low-Power Features

Ngoc-Mai Nguyen, Edith Beigne, Suzanne Lesecq

CEA, LETI, MINATEC Campus F-38054, Grenoble, France {firstName.LastName}@cea.fr

*Abstract*—Because of significant bit rate reduction in comparison to the previous video compression standards, the H.264/AVC has been successfully used in a wide range of applications. In hardware design for H.264/AVC video encoders, power reduction is currently a tremendous challenge. This paper presents a survey of different H.264/AVC hardware encoders focusing on power features and power reduction techniques to be applied. A new H.264/AVC hardware encoder, named VENGME, is proposed. This low power encoder is a four-stage architecture with memory access reduction, in which, each module has been optimized. The actual total power consumption, estimated at RTL level, is 19.1*mW*.

## Keywords-H.264 encoder, HW architecture, power feature

# I. INTRODUCTION

One of the most widely used video compression standard recommended by the Joint Video Team (JVT) formed by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG), the H.264 Advanced Video Coding (H.264/AVC) [1] contains a wide set of video coding tools to support a variety of applications ranging from mobile services, video conferencing, digital broadcast for IPTV, HDTV and digital storage media. Compared with previous standards MPEG-4, H.263, MPEG-2, H.264/AVC achieves respectively 39%, 49%, and 64% of bit-rate reduction [2]. The H.264 successor, H.265/HEVC, has been formally published in 2013 [3]. It is promising bandwidth saving with 40.3% of bit rate reduction [4]. This enables Ultra High Definition Television which is currently rarely in use. However, this new standard also requires computing complexity much more intensive than the H.264 which leads to shortened battery life and higher power consumption. With this higher cost, the switch to the new standard has to be carefully thought.

The main challenges that drive H.264 hardware (HW) implementation are area cost, coding speed for real-time, high definition resolution and power consumption. With the improvement of semiconductor technology, the area cost drew small attention nowadays while researchers still focus on coding speed improvement, especially for complex encoders specified in high profile. Lastly, power consumption of video encoders is becoming a major concern because video applications for mobile devices are now popular.

After presenting basic concepts of H.264 video encoding in Section II, this paper gives an overview of H.264 hardware encoders, focusing on low power features (Section III). Then, in Section IV, the VENGME H.264 video encoder architecture is proposed. Finally, power simulations at RTL level illustrate the VENGME platform capabilities. Duy-Hieu Bui, Nam-Khanh Dang, Xuan-Tu Tran

VNU University of Engineering and Technology Hanoi, Vietnam tutx@vnu.edu.vn

### II. H.264 VIDEO ENCODING BASICS

The encoding path consists of Intra (IntraP) and Inter (InterP) predictions containing Motion Estimation (ME) and Motion Compensation (MC), Forward Transform and Quantization (FTQ), Re-ordering, and Entropy coder (EC), see Fig. 1. IntraP predicts the current macroblock (MB, 16×16 pixels) based on the previously encoded pixels in the current frame, to remove spatial redundancies of video data. To remove temporal redundancies, InterP estimates the motion of the current MB based on the previously encoded pixels in different frames. The Residual data (i.e. difference between the original current MB and the predicted one) is transformed and quantized. Postquantization coefficients are reordered and entropy encoded to lastly remove statistical redundancies. The encoded video is encapsulated into the Network Abstraction Layer (NAL) units. A decoding path containing Inverse Transform and de-Quantization (ITQ) and De-blocking Filter (DF) is also built to generate reference data for prediction. IntraP uses directly the data from ITQ, while InterP refers to reconstructed frames from DF.

The H.264/AVC standard has adopted several advances in coding technology to achieve a higher compression ratio:

- The tradeoff between compression performance and video quality to meet the applications requirements is obtained with a new way to handle the post-quantization coefficients. For example, Context-Adaptive Variable Length Coding (CAVLC) is used to encode residual data. In CAVLC, VLC tables are switched according to already transmitted syntax elements. Since these VLC tables are specifically designed to match the corresponding image statistics, the entropy coding performance is highly improved in comparison to schemes using only a single VLC table [5];
- Variable block size prediction provides more flexibility. The IntraP can be applied either on 4×4 blocks individually or on entire 16×16 MBs. Nine (resp. 4) different prediction modes exist for a 4×4 (resp. 16×16) block. From the comparison of the cost functions of all possible modes, the one with the lowest cost is selected. The InterP is based on a treestructure where the motion vector and prediction can adopt various block sizes and partitions ranging from 16×16 to 4×4 blocks. To identify these prediction modes, motion vectors and partitions, the H.264/AVC standard specifies a complex algorithm to derive them from their neighbors;
- The forward/inverse transform also operates on blocks of 4×4 pixels to match the smallest block size. The transform is still Discrete Cosine Transform (DCT) but with fundamental differences compared to the ones in previous standards [6].

In [7], the transform unit is composed of both DCT and Walsh Hadamard transforms for all prediction processes;

- In-loop DF in the H.264/AVC depends on the so-called Boundary Strength (BS) parameters to determine whether the current block edge should be filtered. The derivation of the BS is highly adaptive because it relies on the modes and coding conditions of the adjacent blocks.



Fig. 1. Functional diagram of the H.264/AVC encoder.

#### III. SURVEY OF H.264 HARDWARE ENCODERS

This section analyses H.264 Hardware (HW) encoders found in the literature from the basic idea of pipelining architecture to several creative improvements. Power features of these architectures are discussed.

## A. Basic pipelining design

Due to the long encoding path of the H.264 standard, pipelining architectures at MB level are usually implemented. Fig. 2 shows the major modules of a four-stage H.264 encoder [8]. The ME block, operating with the MC one to perform InterP, is a potent coding tool but with a huge computational complexity. The ME module with full search can spend more than 90% of the overall computation [9]. In pipelining architectures, the ME task is separated in two sub-tasks (integer (IME) and fractional (FME)). To achieve a balanced schedule, the IntraP is placed in the third stage. The Intra mode decision requires FTQ/ITQ and reconstruction (Rec.) in the same stage with IntraP. The last stage contains two independent modules, EC and DF. The fourstage pipelining architecture cuts the coding path in a balance manner which eases the tasks' scheduling but increases the overall encoder latency.

Sometimes, only 3 stages are implemented. Firstly, FME and IntraP are grouped into one stage to share the current block and pipeline buffers [10]. Secondly, the latency on the entire pipeline is minimized [11]. Lastly, reducing the number of stages also decreases the power consumption for the data pipelining [9]. However, this scheme obviously leads to an unbalanced schedule. When IntraP and FME operate in parallel, too many tasks are put into the second stage. To avoid this throughput bottleneck, [9] has retimed the IntraP and Rec. to distribute them into the last two stages. The luminance data is first processed in the second stage, and then the chrominance one is treated in the third one. The FME engine is also shared for the first two stages [9].



Fig. 2. A 4-level pipelining architecture for H.264 HW encoder.

### B. Improvements for specific oriented design

H.264 encoders can be split in 3 sets, namely scalabilityoriented, speed-oriented and low-power-oriented ones. Due to space restriction, the reader can refer to [12] and [13] for the first 2 sets respectively. The VENGME design targeting low power consumption, only the last set is discussed hereafter.

Low-power H.264 encoders also implement the pipelining architecture with additional low-power techniques, e.g. DCSS [14] or clock-gating [9] that both exploit the inactive state of sub-modules to reduce power consumption. Fig. 3 shows the time slots when power can be saved. DCSS cuts off the clock signal for the stages when all modules are not operating, reducing power consumption by 16% [14]. Clock-gating pauses the clock signal of unused modules, leading to power reduction up to 20% [9]. Thus, this latter seems to provide more power reduction but its control is more costly.



Fig. 3. DCSS and fine-grained clock gating in H.264 encoding schedule.

Memory access also consumes power. A cache memory can be implemented [15] to pre-fetch reference images for MC module in HD application, leading to 75%-86% of MC external bandwidth reduction. The encoder in [16] applies 3 algorithms: frame memory compression, early skip mode decision and reduced FME search range to save up to 49.9% bus and external memory power consumption. In [10], Intra and FME are placed in the second stage to use common current block and pipeline buffers. Moreover, it implements 8-pixels parallelism IntraP to reduce area cost and a particular ME block that can deal with high throughput. The high throughput IME with Parallel Multi-Resolution ME algorithm leads to 46% of memory access reduction. A low-power ME module implementing data reuse techniques to save memory access is proposed in [9]. The IME data access solution proposed consumes 78% less than a standard IME engine. The FME engine halves the memory access thus saving a large amount of data access power. Thus, reducing memory access is an efficient highthroughput low-power scheme. However, it requires specific sub-modules, leading to a complex and difficult design task.

Other designs propose not only low-power features but also quality scalability. As an example, [16] defines 10 power levels to adapt the power consumption depending on the remaining energy. Lastly, some H.264 encoders are specifically dedicated to mobile applications [9]. For instance, [17] focuses on portable video applications for a wide range of resolutions, up to HD720@30fps. For each resolution, four different quality levels are defined. Each one of these latters is associated to a given clock frequency (power level). The quality-scalability feature is implemented with parameterized modules, e.g. InterP and IntraP.

| TABLE I.SURVEY OF H.264 ENCODER ARCHITECTURES |                          |                                               |                                              |                                 |                                                                                     |                                                     |                                                                     |                                                         |                                               |  |  |  |
|-----------------------------------------------|--------------------------|-----------------------------------------------|----------------------------------------------|---------------------------------|-------------------------------------------------------------------------------------|-----------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------|-----------------------------------------------|--|--|--|
|                                               | 2007 [11]                | 2008 [12]                                     | 2009 [13]                                    | 2006 [18]                       | 2009 [9]                                                                            | 2007 [14]                                           | 2008 [10]                                                           | 2009 [17]                                               | 2011 [16]                                     |  |  |  |
| Target                                        | Real-time                | Scalable<br>Extension<br>SVC; high<br>profile | Perf.<br>Low power<br>Video size<br>scalable | HW design<br>for H.264<br>codec | Low-power<br>Power aware<br>Portable devices                                        | Low-power,<br>real-time,<br>high picture<br>quality | High profile,<br>Low area cost,<br>high throughput                  | Dynamic Quality -<br>Scalable, PW<br>aware video applis | Low-<br>power;<br>power<br>aware              |  |  |  |
| Profile                                       | Baseline,<br>level 4     | High pro-<br>file; SVC                        | High, level<br>4.1                           | Baseline, lev-<br>el up to 3.1  | Baseline                                                                            | Baseline, Baseline/High,<br>level 3.2 level 4       |                                                                     | Baseline                                                | N/A                                           |  |  |  |
| Resolution                                    | 1080p30                  | HDTV<br>1080p                                 | 1080p30                                      | 720p SD/HD                      | QCIF, 720SDTV                                                                       | 720p<br>SD/HD                                       | CIF to 1080p                                                        | CIF to HD720                                            | CIF,<br>HD1280×7<br>20                        |  |  |  |
| Technology (nm)                               | UMC 180,<br>1P6M<br>CMOS | UMC 90<br>1P9M                                | 65                                           | UMC 180,<br>1P6M CMOS           | TSMC 180, 1P6M<br>CMOS                                                              | Renesas 90,<br>1POLY-<br>7Cu-ALP                    | UMC 130                                                             | 130                                                     | N/A                                           |  |  |  |
| Frequency (MHz)                               | 200                      | 120 for<br>high profile<br>166 for<br>SVC     | 162                                          | 81 for SD 180<br>for HD         | N/A                                                                                 | 54 for SD<br>144 for HD                             | 7.2 for CIF<br>145 for 1080p                                        | 10-12-18-28 for<br>CIF<br>72-108 for HD720              | N/A                                           |  |  |  |
| Gate count (Kgate)                            | 1140                     | 2079                                          | 3745                                         | 922.8                           | 452.8                                                                               | 1300                                                | 593                                                                 | 470                                                     | N/A                                           |  |  |  |
| Memory (Kbyte)                                | 108.3                    | 81.7                                          | 230                                          | 34.72                           | 16.95                                                                               | 56                                                  | 22                                                                  | 13.3                                                    | N/A                                           |  |  |  |
| Power consumption<br>(mW)                     | 1410                     | 306 for<br>high profile<br>411 for<br>SVC     | 256                                          | 581 for SD<br>785 for HD        | <b>40.3</b> for CIF 2 ref.<br><b>9.8-15.9</b> for CIF 1<br>ref.<br>64.2 for 720SDTV | 64 for 720p<br>HD                                   | 6.74 for CIF base-<br>line profile<br>242 for 1080p<br>high profile | 7-25 for CIF<br>122-183 for<br>HD720                    | 238.38 to<br>359.89<br>depends on<br>PW level |  |  |  |

### C. Discussion

TABLE I. compares several state-of-art solutions. Various features are presented but focus is on power features.

Profile and resolution obviously influence operating frequency and so power consumption. Indeed, encoders that support multiple profiles [10] or multiple resolutions [9] [10][14][17][18] operate at different frequency and yield different power consumption. When comparing power figures the resolution and profile that the encoders support have to be taken into account. Specific low-power techniques [9][14] and strategy to reduce memory access [16][10] show promising power consumption figures [9][10][14]. Recent encoders with low-power features [9][17], with even smaller area cost, seem more suitable for mobile applications.

## IV. VENGME H.264 ENCODER

The "Video Encoder for the Next Generation Multimedia Equipment" (VENGME) circuit is proposed to implement an H.264/AVC encoder targeting mobile platforms. The current design is optimized for CIF video. However, it can be extended to larger resolutions by enlarging the reference memory and the search window.

## A. Architecture

An overview of the VENGME design is provided on Fig. 4. It differs from previous solutions on various aspects. Besides the first stage to load data similar to [14], it cuts the coding path into three main stages, namely prediction, TQ-Rec. and EC-DF. With both IME and FME in the same stage for sharing the IME information and the data in the search window SRAM, this pipeline is even more unbalanced than the three 3-stage ones found in the literature. However, an extra external memory access bandwidth can be saved, while the performance for the targeted applications remains unchanged. This design is suitable for the implementation of power management techniques. Indeed, the speed of two last stages can be adapted with respect to the heavy prediction stage.



Fig. 4. VENGME H.264 encoder architecture.

Moreover, InterP and IntraP in the same stage can be executed in parallel or separately, thanks to the system controller decision. In the separate mode, power can be saved via the switch off of IntraP or InterP while the other one is active. In the parallel mode, IntraP will finish first, and its results are stored in TQIF memory. Then, it can be switched off to save power. InterP and MC still search for the "best" predicted pixels. After having InterP results, TQIF memory can be invalidated to store new transformed results for InterP. The first solution, currently in use, can be seen as a low-power mode for the system.

For pipelining, the double memory scheme increases the area cost but it maintains the memory access. Some dedicated techniques to reduce the area cost and increase the throughput are applied to each module.

## B. Power simulation results

The proposed H.264 encoder has been modeled in VHDL at RTL level. The power consumption is estimated at RTL level in encoding video of QCIF resolution, with technology 32nm, using SpyGlass<sup>TM</sup> Power tool, see TABLE II. The (estimated) total power consumption is 19.1mW. Note that the leakage power at RTL level is not accurately estimated as it highly depends on gate choices (actually it is over-estimated). Thus, it can be assumed that this power consumption should be smaller.

A power consumption of 19.1mW makes the VENGME encoder suitable for mobile applications. However, there is still room for power consumption improvement via the implementation of adaptive power management techniques as activity is highly unbalanced between modules. Of course, this will complicate the design but it is a worthwhile effort with respect to the power gain. The analysis of the relative power results between the modules drives the power management strategy to be implemented.

| Power<br>(mW) | InterP | IntraP | TQ   | EC   | DF   | Sw_dma | H.264<br>encoder |
|---------------|--------|--------|------|------|------|--------|------------------|
| Total         | 7.86   | 0.32   | 0.82 | 0.9  | 0.71 | 6.39   | 19.1             |
| Leakage       | 3.43   | 0.1    | 0.52 | 0.31 | 0.46 | 2.88   | 9.21             |
| Internal      | 2.76   | 0.13   | 0.16 | 0.26 | 0.11 | 0.94   | 4.56             |
| Switching     | 1.68   | 0.09   | 0.14 | 0.32 | 0.15 | 2.57   | 5.37             |

TABLE II. POWER COMPOSITION OF VENGME H.264 ENCODER

### V. CONCLUSION

In this paper, a survey of H.264 HW video encoder implementations is presented. From the analysis of power figures, it can be deduced that both specific low-power techniques and memory access reduction provide power efficiency.

The VENGME platform, a new H.264 HW architecture video encoder has been presented. It targets CIF video for mobile applications but the design can be extended to higher resolutions. High throughput, small silicon area and low memory bandwidth were achieved in each module. The total power consumption, estimated at RTL level is 19.1mW which is equivalent to other HW encoders that target mobile applications. Note that this power is certainly smaller as the leakage one is overestimated. Currently, power management techniques are implemented on the VENGME architecture in order to save even more power.

## ACKNOWLEDGMENT

This work is partly funded by Vietnam National University, Hanoi (VNU) through research project No. QGDA.10.02 (VENGME), projects Catrene HARP No. CA112 and Catrene BENEFIC No. CA505. The authors would like to thank the Nafosted for travel grant.

#### REFERENCES

- "ITU-T recommendation and international standard of joint video specification", ITU-T Rec. H.264/ISO/IEC14496-10 AVC, January 2012.
- [2] A. Joch, F. Kossentini, H. Schwarz, T. Wiegand, and G. J. Sullivan, "Performance comparison of video coding standards using Lagragian coder control", *IEEE Int. Conf. on Image Processing - ICIP*, 2002.
- [3] "ITU-T recommendation and international standard of joint video specification", ITU-T Rec. H.265/HEVC, April 2013.
- [4] J. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, T. Wiegand. "Comparison of the Coding Efficiency of Video Coding Standards Including High Efficiency Video Coding (HEVC)", *IEEE Trans. On Circuits and Systems for Video Technology*, Vol. 22, no. 12, Dec. 2012
- [5] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. "Overview of the H.264/AVC Video Coding Standard", *IEEE Trans. on Circuits* and Systems for Video Technology, Vol.13, no. 7, pp. 560-576, 2003.
- [6] I. E. G. Richardson. The H.264 Advanced Video Compression standard. 2<sup>nd</sup> ed. John Willey & Sons, New York, NY, USA, 2010.
- [7] X.-T. Tran, V.-H. Tran, "An Efficient Architecture of Forward Transforms and Quantization for H.264/AVC Codecs", *REV Journal on Electronics and Communications JEC*, Vol. 1, no. 2, pp. 122-129, 2011.
- [8] Y.-W. Huang, T.-C. Chen, C.-H. Tsai, C.-Y. Chen, T.-W. Chen, C.-S. Chen, C.-F. Shen, S.-Y. Ma, T.-C. Wang, B.-Y. Hsieh, H.-C. Fang, and L.-G. Chen "A 1.3TOPS H.264/AVC single-chip encoder for HDTV applications", *IEEE Int. Solid-State Circuits Conf.*, 2005.
- [9] Y.-H. Chen, T.-C.-C. Chen, C.-Y. Tsai, S.-F. F. Tsai and L.-G. G. Chen, "Algorithm and Architecture Design of Power-Oriented H.264/AVC Baseline Profile Encoder for Portable Devices", *IEEE Trans. on Circuits* and Systems for Video Technology, Vol. 19, no. 8, pp.1118–1128, 2009.
- [10] Y.-K. Lin, D.-W. Li, C.-C. Lin, T.-Y. Kuo, S.-J. Wu, W.-C. Tai, W.-C. Chang, T.-S. Chang, "A 242mW 10mm<sup>2</sup> 1080p H.264/AVC High-Profile Encoder Chip," *IEEE Int. Solid-State Circuits Conf.*, 2008.
- [11] Z. Liu, Y. Song, M. Shao, and S. Li, "A 1.41w H.264/AVC real-time encoder SoC for sHDTV1080p", *VLSI Circuits*, 2007.
- [12] Y.-H. Chen, et al., "An H.264/AVC scalable extension and high profile HDTV 1080p encoder chip", IEEE Symp. on VLSI Circuits, 2008.
- [13] K. Iwata, et al., "A 256 mW 40 Mbps Full-HD H.264 High-Profile Codec Featuring a Dual-Macroblock Pipeline Architecture in 65 nm CMOS", IEEE J. of Solid State Circuits, 44 (4), pp. 1184-1191, 2009.
- [14] S. Mochizuki, T. Shibayama, M. Hase, F. Izuhara, K. Akie, M. Nobori, R. Imaoka, and H. Ueda, "A low power and high picture quality H.264/MPEG-4 video codec IP for HD mobile applications", *IEEE Asian Solid-State Circuits Conf.*, 2007.
- [15] S. Zuo, M. Wang and L. Xiao, "A Cache Hardware design for H.264 encoder", Int. Conf. Instrumentation, Measurement, Computer, Communication and Control (IMCCC), 2012.
- [16] H. Kim, C. E. Rhee, J.-S. Kim, S. Kim, H.-J. Lee, "Power-Aware Design with Various Low-Power Algorithms for an H.264/AVC Encoder", *IEEE Int. Symp. on Circuits and Systems (ISCAS)*, pp. 571-574, 2011.
- [17] H.-C. Chang, J. Chen, B. Wu, C. Su, J. Wang, and J. Guo, "A Dynamic Quality-Adjustable H.264 Video Encoder for Power-Aware Video Applications", *IEEE Transactions on Circuits and Systems for Video Tech*nology, Vol. 19, no. 12, pp. 1739–1754, Dec. 2009.
- [18] T.-C. Chen, et al., "Analysis and Architecture Design of an HDTV720p 30 frames/s H.264/AVC encoder", *IEEE Trans. on Circuits and Systems* for Video Technology, Vol. 16, no. 6, pp. 673–688, 2006.
- [19] N.-M. Nguyen, E. Beigne, S. Lesecq, P. Vivet, D.-H. Bui, X.-T. Tran, "Hardware implementation for entropy coding and byte stream packing engine in H.264/AVC", *Int. Conf. on Advanced Technologies for Communications (ATC)*, pp. 360-365, 2013.