In addition to network-level design choices like topology, routing function and flow control, router microarchitecture represents a major factor in determining NoC performance. Allocators represent one particularly important aspect of router design, as they directly affect overall network performance in two ways: Allocation quality, measured in the cardinality of matchings between requests and available resources, determines the utilization of the router's crossbar and the network channels, and thus has a direct impact on the network's performance under load. Furthermore, in typical packet-switched router design points, allocation tends to be on the critical path; consequently, delay-optimized allocator implementations are required to enable high-frequency operation.
Individual allocator implementations represent different tradeoffs between allocation quality and delay. Compared to traditional long-haul and system-level networks, performance in NoCs is typically much more sensitive to packet latency; this mandates the use of relatively shallow router pipelines and single-cycle allocation schemes. At the same time, NoCs are usually subject to tight cycle time, area and power constraints. Consequently, designers must select allocator implementations that maximize allocation quality subject to these constraints.
In order to reduce both delay and area requirements of the VC allocator, we have developed sparse VC allocation, an approach that significantly reduces the VC allocator's logic complexity by taking advantage of the fact that the VC a packet enters a given router on limits the range of VCs it can leave on, as shown in Figure 1. Detailed RTL-level evaluations suggest that this can reduce the VC allocator's delay by up to 31% and yield power and area savings of up to 72% and 79%, respectively.
Figure 1: Sparse VC Allocation
We have also developed an improvement to the speculative switch allocation mechanism introduced in prior work that shortens the switch allocator's critical path by up to 36% without compromising zero-load latency; this is achieved by taking a slightly pessimistic approach to speculation, which takes advantage of the fact that speculation yields the most benefit when the network is only lightly loaded.
Prior works in the NoC domain have predominantly implemented separable input-first allocators. Our results indicate that at least for synthesis-based implementations, wavefront allocators actually offer both lower delay and better matching quality for low-radix networks with limited numbers of VCs, including commonly encountered mesh configurations, when taking advantage of the two previously mentioned optimizations. Likewise, we find that matrix arbiters, commonly used as a basic building block in prior works, offer at most a minor reduction in delay compared to round-robin arbiters, but have significantly higher area and power overhead.
In order to quantify how allocator design choices affect overall network-level performance, we have conducted detailed simulations for two exemplary 64-node network topologies across a wide variety of different network parameters, traffic patterns and load conditions.
Our results indicate that despite significant differences in matching quality between different VC allocator implementations, overall network performance is largely insensitive to the quality of VC allocation, with both zero-load latency and saturation throughput being essentially identical for all configurations. This suggests that architects can essentially ignore matching quality when selecting a particular VC allocator implementation, and instead select the one that best matches delay and area constraints.
On the other hand, as shown in Figure 2, we find that the sensitivity of network performance to the quality of switch allocation increases with the router radix and the number of VCs: While a wavefront allocator offers marginally better saturation rate in a mesh with 2 VCs, it achieves up to 25% more throughput than the separable ones for a flattened butterfly network with 8 VCs.
Figure 2: Packet Latencies for Different Switch Allocators
From an architect's point of view, switch allocators that improve matching efficiency at the cost of increased delay are particularly suitable for improving performance in primarily throughput-oriented networks, where large quantities of data are transferred concurrently using DMA-like semantics. Examples of such networks include I/O interfaces connecting different functional blocks on a system-on-chip, or the data supply networks for highly parallel graphics accelerators. For primarily latency-sensitive applications like cache coherence traffic, on the other hand, network load is expected to be relatively low during normal operation; such applications favor separable allocators due to their comparatively smaller delay. Speculation, on the other hand, is most useful in latency-sensitive applications that directly benefit from the resulting decrease in zero-load latency. For throughput-oriented networks, the slight increase in saturation rate afforded by speculation is unlikely to justify the associated increases in delay and complexity. Furthermore, speculation is less attractive for topologies with low network diameter, where pipeline delay represents only a small fraction of the overall packet latency.
Input Buffer Organization and Management
Input buffers are major contributors to router power and area. Efficient router design requires that the number and depth of VCs into which the available buffer space is divided is chosen appropriately. Commonly used implementations that statically assign a fixed portion of the available buffer space to each VC are prone to utilization imbalance, leading to inefficient use of expensive buffer resources. We investigate dynamic buffer management schemes that avoid such imbalances. This leads to improved buffer utilization and allows us to significantly reduce the buffer size required in order to achieve a given level of performance. However, buffer sharing also introduces additional performance coupling between VCs, which can severely degrade the network's QoS and workload isolation properties. We are developing novel approaches for avoiding these adverse side effects without sacrificing the beneficial properties of dynamic buffer management.
Open Source Network-on-Chip Router RTL
In order to facilitate detailed evaluations of the delay, power and area tradeoffs associated with different microarchitectural design choices, we have developed a parameterized RTL implementation of a state-of-the-art VC router. Further details regarding the router RTL can be found here.
For further information, please contact Daniel Becker.