# Analysis and Evaluation of Traffic-Performance in a Backtracked Routing Network-on-Chip

P.T. Hong<sup>‡</sup>, Phi-Hung Pham<sup>†</sup>, Xuan-Tu Tran<sup>‡</sup>, and Chulwoo Kim<sup>†</sup>

<sup>†</sup>Department of Electronics and Electrical Engineering, Korea University, Seoul, Korea <sup>‡</sup>College of Technology, Vietnam National University, Hanoi, Vietnam Email: {hongpt\_cn, tutx}@vnu.edu.vn; {hungpp, ckim}@korea.ac.kr

*Abstract*—VLSI designers recently have adopted micro networkon-chip (or NoC) as an emerged solution to design complex SoC system under stringent constraints pertaining cost, size, power consumption, and short time-to-market. Characterization of onchip traffics and traffic-performance evaluation are necessary steps bringing comprehensive and effective NoC design. This paper presents an analysis and performance evaluation framework of backtracked routing Network-on-Chip that provides guaranteed and energy-efficient data transfer. Experimental results, under common and application-oriented synthetic traffics, figure out the performance in terms of latency and throughput and suggest a tradeoff to developers to map applications into a proposed NoC platform.

Keywords --- Network-on-Chip; on-chip traffics; on-chip communication; network architecture; performance evaluation.

# I. INTRODUCTION AND RELATED-WORKS

Modern complex SoC design challenges pose VLSI designers to adopt micro network on chip or NoC approach as an emerged solution to deal with on-chip interconnection and communication scenario [1]-[4]. Latency, throughput, Quality of Services (QoS), and testing [5] are very important targets for any NoC design under the constraints of power, timing and silicon area. To achieve these goals, various NoC architectures with routers/switches designed for different topologies of NoC have been reported in the literature.

J. Kim et al. [6] introduced a low-latency two-stage router architecture based on wormhole-switching, virtual channel communication and adaptive routing applicable for 2-D mesh and torus on-chip networks. T. Bjerregaard et al. [7] introduced a complex NoC router architecture with supporting virtual channels to provide connection-oriented, as well as connectionless best-effort (BE) routing in a clock-less network. S. Vangal et al. [8]-[9] introduced a work of Intel 80-tile NoC architecture arranged as 8x10 2-D mesh network, operating at 4 GHz to enable a bisection bandwidth of 256 GB/s. Each tile consists of a processing element (PE) connected to packet-switched 5-port router with mesochronous interfaces. To date, most NoC designs have focused on wormhole-switching and packetswitching approaches, i.e. packet-switched NoCs. Packetswitched NoC also suffers from latency problems, even for wormhole switching when header flits are blocked. To deal with this situation, virtual channels (with or without priorities) and carefully designed adaptive routing is combined in packetswitching router architecture [6]-[9]. However, implementation of full-blown adaptive routing introduces complexity because of large buffers, lookup tables and complex shortest path algorithms, and even increases node delay. Buffers in packetswitching routers are usually implemented by memory elements, e.g. RAMs or latches. These memory components increase router cost considerably in terms of energy and area overhead. Packet-switching router inarguably leads to more complex and less energy-area efficient design than circuitswitching router.

In contrast to packet-switching, circuit switching approach has some advantages over packet switching since data are guaranteed to arrive in the same order as being sent with predictable latencies. Circuit switching requires a minimum control overhead when setting up a new connection, and simple communication protocol. This can help to design a lowcomplex and energy-efficient router. Because of no buffer requirement for circuit-switching router, it is easy to increase the number of wires in data lane to improve bandwidth without significant increase in area overhead. For these reasons, circuitswitched NoC designs are advocated by several researchers [10]-[14] for some SoC applications. Work by Wiklund [10] proposed a circuit-switched 2-D mesh NoC used in a hard realtime system with scheduling of the traffic, but faced a high latency for setting up a new circuit due to the blocking problem of the routers. They suggested a static timing schedule for new data streams to avoid being blocked by reserved links. Wolkotte et al. [11] proposed energy-efficient reconfigurable 2-D mesh NoC platform for common wireless multi-tile SoC architecture. Their work employed circuit-switching router with data converters splitting data into small lanes to avoid setup latency and used one Central Coordination Node to perform a centralized run-time mapping of application. Chang et al. in [12] used circuit switching as a trade-off solution for area-energy efficient application specific NoC design with high-locality communication. Work by Jerger [13] showed that circuit-switched networks, compared to packet-switched networks can provide a significant improvement of communication latency between processor cores in Chip Multi-Processors (CMPs), if once circuits are set up. They proposed a hybrid router design that supports both circuit and packet with circuit reconfiguration switching very fast (setup/teardown) and working under a prediction-based coherence protocol. Recently, consumer market poses tremendous increasing of portable and mobile devices running streaming applications. Typical streaming DSP applications are found in wireless base-band processing (for Hiper-LAN/2, WiMax, DAB, DRM, DVB, UMTS), multimedia processing (encoding/decoding), and MPEG/TV portable devices [16].

Platform design of such portable embedded systems demands an efficient Networks-on-Chip architecture because of the following requirements. Most of portable devices rely on batteries; therefore, compactness and energy-efficiency are critical criteria for platform design. Moreover, these SoC traffics have specific characteristics of temporal and spatial behaviors. Pipelined data flows in streaming applications exhibit predictable temporal and spatial fashions through successive processing nodes. The size of transferred block (or packets/message) and inter-arrival time are fixed or application dependent. Stream lifetime is kept relatively long time and most of throughput need to be guaranteed. To design an efficient platform for portable and embed devices, we propose a methodology using circuit-switching approach with switch design and NoC architecture proposed by Pham et al. [14]-[15] to guarantee throughput in torus interconnection topology. The work [14] reported a design, implementation and evaluation (timing, power, and area) of compact and high-performance switch at silicon-level. The work [15] introduced NoC architecture and some analysis results of blocking probability. In this paper, we discuss using of backtracking torus NoC and communication protocol in details. Then, we introduce a framework of analysis and evaluation of NoC trafficperformance under some common and application-oriented synthetic traffics.

The rest of the paper is organized as follows. Section 2 discusses interconnection architecture and backtracked routing communication protocol. Section 3 describes an on-chip network simulator and communication pattern modeling. Then, we figure out results of traffic-performance and trade-offs under some applied traffics. Finally, conclusion and further research are outlined in Section 4.

# II. INTERCONNECTION ARCHITECTURE AND COMMUNICATION PROTOCOL

#### A. Interconnecrion Architecture Description

Torus and mesh networks, or k-ary n-cubes, are popular topologies in traditional interconnection networks [18]. They comprise nodes in a regular n-dimension grid with k nodes in each dimension and links between adjacent nodes. At twodimension, their physical arrangements fit to 2-D chip implementation with uniform short wires. They allow communication locality exploration between communicating nodes. These topologies have a disadvantage of large hop count than logarithmic network. However, the large hop count is beneficial for path diversity. Path diversity property helps to discover more alternative paths when setting up a circuit in network between a source and a destination.

Torus topology has some attractive properties. Torus has good path diversity, can have good load balance, and has lower diameter than mesh network of the same degree. A physical layout version of 2-D torus, folded torus, can be also easily arranged to fit implementation of tile model in conventional 2-D chip. In folded torus, the maximum channel is reduced to twice that of the mesh network (Figure 1). A design of router/switch can be easily applied though the whole torus network by simply changing of static address of switching node. In this works, we choose 2-D torus topology as interconnection scheme for NoC architecture with backtracked routing algorithm.



Figure 1. Interconnection architecture.

Chosen interconnection topologies are shown in Figure 1. Based on this architecture, we customize and apply the switch designed by Pham et al. [14] with 32-bit data width and 5 bidirectional ports. Each switch has 4 ports connected to corresponding neighbors in North, East, South and West directions, and one local port interfacing to resource (or IP) through a wrapper. Two kinds of clocks are fed into the switch architecture. The data-pipelined clock is used in transmission phase when data is pipelined from a source to a destination, and the probing clock used to maintain backtracked routing in NoC.

# B. Bactracked Routing Communication Protocol and Latency Analysis

In circuit-switched NoC, we propose lightweight end-toend communication protocol including three phases: setup (or probing), data-pipelined transmission and release phases.



Figure 2. Timing diagram of backtracked routing communication.

In the setup phase, to setup a circuit, a header flit (or probe header) containing destination address is sent from source IP to discover alternative paths. We apply backtracked routing protocols to search for alternative paths instead of waiting for a channel to become idle. Backtracking protocol is suitable for circuit-switching, can be fault-tolerant and improves network performance [19]. When moving towards its destination, the probe header occupies channels. If the probe header cannot continue onward, it backtracks over the last occupied channel, releases it, and continues searching from the previous node. Deadlock is prevented in circuit switching by reserving channel before starting transmission. During the probing phase, the probe header backtracks to avoid blocking other connections when its desired channel is not available. Additionally, livelock is not a problem as well, because backtracking protocols can use historical information to avoid probing the same path repeatedly. Each switch maintains probing activity occurred within one-clock-cycle under synchronization of global probing clock signal. The probe header uses the same physical channel with data to save wiring cost. The probe header is routed in a distributed fashion without the need of central controller or supplemental network. In this paper, we present exhaustive profitable backtracking (EPB) [19] applied to switches to search only profitable links (shortest paths) for energy-efficient data transfer. When probe header reaches the destination, an ACK signal is backward to the source. Upon receiving ACK signal, source IP starts transmitting pipelined data to destination through direct end-to-end connection in transmission phase. Several pipelined transmission techniques are discussed in-depth by [20] to enhance link bandwidth and reliability of communication. In the release phase, circuit is freed in hop-by-hop basis from the source to the destination. Timing diagram of the communication protocol is shown in Figure 2.

Format of inter-switch handshake signals performing endto-end communication protocol are defined as follows:

| Request Signal:               |
|-------------------------------|
| 1: Circuit Setup              |
| 0: Idle                       |
| Answer signal:                |
| 00: Idle                      |
| 01: Circuit Setup Acknowledge |
| 10: Network Blocked           |
| 11: Busy Destination          |

As discussed in Part A, there are two clocks applied to switch architecture, the data-pipelined clock and the probing clock. The data-pipelined clock is used to send data through direct connection. The data-pipelined clock decides linkbandwidth and fall-through latency through switches. The probing clock, being used for backtracked routing, forms a circuit setup latency of communication. The communication latency (or network delay) is expressed by following equation:

$$T_{network} = T_{setup} + h.T_{switch} + \frac{L}{\dot{b}}$$

where,

Tnetwork : Network delay

T<sub>setup</sub> : Setup delay

h: Number of switches in the shortest path between source IP and destination IP

T<sub>switch</sub> : Fall-through latency through switch

L : Length of packet

b : Link bandwidth.

## III. ON-CHIP TRAFFIC-PERFORMANCE EVALUATION

This section presents OMNet-based NoC simulator and communication pattern configuration applied to simulate the network behavior.

# A. Network-on-Chip Simulator and Traffic Pattern Configuration

OMNeT++ is a public source, generic and flexible C++based simulation environment with strong GUI support that allows a fast, easy debug and high-level simulation [21]. The OMNeT-based NoC simulator is abstracted in Figure 3. The on-chip communication network comprises switch and link models. Each IP is modeled by a traffic generator (TG) that sends packets and a sink to receive packets from networks.



Figure 3. Configuration of NoC simulation framework.

Performance evaluation of interconnection network usually employs application-driven and synthetic traffic [18]. The application-driven traffic approach models the network and application simultaneously, based on full system simulation and communication traces. However, the application-driven traffic is costly to develop and control. In this work, we use two kinds of synthetic traffics for evaluation; one is purely synthetic traffic, the other is application-oriented traffic. Packets generated by TGs are characterized by three configurable parameters: spatial distribution, inter-arrival time distribution, and packets size.

# Spatial distribution:

- *Uniform*: destinations are uniformly distributed. This is usually used for preliminary evaluation [18].
- *Locality*: destinations are uniformly distributed in a local region. We define the local region as a set of nodes placed at shortest Manhattan distance (i.e. 4-nearest-neighbor in this case of 2-D torus NoC), and a localization factor as the ratio of local traffic to total traffic [22]. Two values of localization factor are assumed, 0.8 and 1 (nearest-neighbors traffic).
- *Transpose*: All traffics from each source are directed to one destination through a transpose mapping function [18]. This traffic pattern is exposed by some applications that perform matrix-transpose or cornerturn operation.

**Inter-arrival distribution**: We assume *Poisson* distribution and *fixed-rate* distribution [11], [17]. The fixed-rate inter-arrival distribution is found in some real-time or hard real-time systems with block-transfer.

**Packet size**: We assume three cases of packet length, *short*, *medium*, and *long* packets to investigate the impact of packet size on network behaviors.

## B. Experimental Setup and Performance Results

Based on timing information imported from switch design by [14] in a typical  $0.13\mu$ m CMOS technology, we set probing frequency to 100 MHz (i.e. probe clock cycle = 10 ns), and data-pipelined clock to 1 GHz (i.e. a 32-bit link-bandwidth = 32 Gbps). We choose 4x4 torus for topology configuration of NoC. We define three types of packet length, short (128 bytes), medium (512 bytes), and long (2048 bytes). From simulator, we measure traditional performance metrics of interconnection network, average network delay and throughput, for performance evaluation. Delay values are presented in the number of probing clock cycles for readability. In simulation, each IP sends 10000 packets (or blocks) into the network, and the first 1000 packets are discarded for warming-up phase. The performance results are presented in Figures 4-7.



Figure 4. Average network delay versus offered load, Poisson inter-arrival.





Figure 5. Average network delay versus offered load, fixed-rate inter-arrival.











0.4 0.6 1.0 0.2 0.8 Offerred Load (Fraction of capacity)

Packet Size 2048 Bytes







Firstly, we examine network performance with injection of short packets. Under uniform and locality traffics (periodically fixed-rate and Poisson distribution of inter-arrival), average network delay grows very fast and sharply at offered load of 0.6 - 0.8 (depending on localization factor). The average network delay of transposed traffic reaches to "infinitive" at the lower value of offer load (0.2). The saturated throughput of transpose traffics is higher than that of uniform traffic and locality traffic. In locality traffic, the saturated throughput increases along with increasing of localization factor. It suggests that when application is mapped more locally, NoC can provide a gain in delay and throughput performance. Uniform traffic introduces the lowest value of saturated throughput in comparison to transpose and locality traffics.

Secondly, we analyze network performance under medium and long packet sizes. When increasing size of packet, the offered loads, at which average network delay grows sharply, is increased. Average network delay looks "flatten" that showing a predictable average communication delay at a given range of offered loads. Transpose traffic outperforms uniform and locality traffics in terms of throughput at higher offered loads. Especially, when packet size is long, throughput of transpose traffic increases linearly with increase of offered load (Figures 6-7, packet size 2048 bytes).

### IV. CONCLUSION

To date, tremendous increasing of portable and mobile devices running streaming applications pose challenges in design of complex SoC platforms under very stringent constraints pertaining cost, size, power consumption, and short time-to-market. Such device platforms require an efficient Networks-on-Chip to support dynamic on-chip communication scenario. Characterization of on-chip traffics and trafficperformance evaluation are necessary steps bringing comprehensive and effective NoC platform design. This paper has presented an analysis and performance evaluation framework of backtracked routing Network-on-Chip that provides guaranteed and energy-efficient data transfer. The experimental results, under common and application-oriented synthetic traffics, figured out the performance in terms of latency and throughput and suggested a tradeoff for mapping applications into the proposed NoC platform. Our further research will consider NoC performance based on full system simulation and communication traces of application-specific traffics.

## ACKNOWLEDGMENT

This work was co-supported by the Korea Research Foundation (KRF), and by Vietnam National University, Hanoi through research projects QC.07.18 and QC.08.18.

#### REFERENCES

- [1] Luca Benini and Giovanni De Micheli. "Networks on Chips: A New SoC Paradigm", IEEE Computer, January 2002, pp. 70-78.
- [2] W. J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In Proc. of DAC, June 2001. pp. 684-689.
- [3] Jari Nurmi, "Network-on-Chip : A New Paradigm for System on Chip Design", Int. Symp. on System on chip, Nov. 2005, pp. 2-6.
- [4] Giovanni De Micheli and Luca Benini, Networks on Chips: Technology and Tools, Morgan Kaufmann Publishers, Elsevier Inc. (USA), 2006
- [5] Xuan-Tu Tran, Yvain Thonnart, Jean Durupt, Vincent Beroulle, Chantal Robach: A Design-for-Test Implementation of an Asynchronous Network-on-Chip Architecture and its Associated Test Pattern Generation and Application, in Proc. of NOCS 2008, pp. 149-158.
- [6] J. Kim, D. Park, T. Theocharides, N. Vijaykrishnan, and Chita R. Das, "A low latency router supporting adaptivity for on-chip interconnects ", in Proc. of the DAC 2005. ACM Press, pp. 559-564
- [7] Tobias Bjerregaard, and Jens Sparso, "A Router Architecture for Connection-Oriented Service Guarantees in the MANGO Clockless Network-on-Chip", in Proc. of DATE, 2005, pp. 1226-1231
- [8] S. Vangal et al., "An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS", in Proc. of IEEE ISSCC 2007, pp. 98-99.
- [9] S. Vangal et al., "A 5.1GHz 0.34mm2 Router for Network-on-Chip Application", in Proc. of IEEE Symposyum on VLSI Circuits (SOVC), 2007, pp. 42-43.
- [10] D. Wiklund and D. Liu, "SoCBUS: Switched Network on Chip for Hard Real Time Embedded Systems", in Proc. of IEEE IPDPS, 2003, pp. 78a.
- [11] P. T. Wolkotte, Gerard J. M. Smit, Gerard K. Rauwerda, and Lodewijk T. Smit, "An Energy-Efficient Reconfigurable Circuit-Switched Network-on-Chip", in Proc. of IEEE IPDPS, 2005, pp 155a
- [12] K.-C. Chang, J.-S. Shen, T.-F. Chen, "Evaluation and Design Trade-Offs Between Circuit-Switched and Packet-Switched NOCs for Application-Specific SOCs", in Proc. of DAC, 2006, pp. 143-148.
- [13] Natalie E. Jerger, Mikko Lipasti, and Li-Shiuan Peh, Natalie Enright Jerger, Mikko Lipasti, and Li-Shiuan Peh, "Circuit-Switched Coherence", IEEE Computer Architecture Letters, Vol. 6, 2007.
- [14] Phi-Hung Pham, Yogendera Kumar and Chulwoo Kim, "A Compact and High-Performance Switch for Circuit-Switched Network-on-Chip", IEEE International System on Chip Conference (SOCC), Texas, USA, Sep. 2006, pp. 53-56
- [15] Phi-Hung Pham, Yogendera Kumar and Chulwoo Kim, "High Performance and Area-Efficient Circuit-Switched Network on Chip Design", 6th IEEE International Conference on Computer and Information Technology (CIT), Sep 2006, pp. 243-243
- [16] Gerard J. M. Smit, Andre B. J. Kokkeler, Pascal T.Wolkotte, Philip K. F. Holzenspies, Marcel D. van de Burgwal, and Paul M. Heysters, "The Chameleon Architecture for Streaming DSP Applications", EURASIP Journal on Embedded Systems, Volume 2007, Article ID 78082.
- [17] L. Tedesco et al., "Application Driven Traffic Modeling for NoCs", in Proc. of SBCCI, 2006, pp. 62-67.
- [18] W. J. Dally and B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufman Publisher, Elsevier Science (USA), 2004.
- [19] J. Duato, S. Yalamanchili and L. Ni, Interconnection Networks An Engineering Approach, Morgan Kaufman Publisher, Elsevier Science (USA), 2003.
- [20] Jan M. Rabaey, Anatha Chandrakasan, and Borivoje Nicholic, Digital Integrated Circuits – A Design Perspective, Prentical Hall (USA), 2003, Chapter 10, Timing Issues in Digital Circuits.
- [21] <u>http://www.omnetpp.org/</u>
- [22] P.P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh "Performance Evaluation and Design Trade-Offs for Network-on-Chip Interconnect Architectures". IEEE Transactions on Computers, Vol. 54(8), 2005, pp. 1025-1040.