Uber: Utilizing Buffers to Simplify NoCs for Hundreds-Cores

Published 26 Jul 2016 in cs.AR | (1607.07766v2)

Abstract: Approaching ideal wire latency using a network-on-chip (NoC) is an important practical problem for many-core systems, particularly hundreds-cores. Although other researchers have focused on optimizing large meshes, bypassing or speculating router pipelines, or creating more intricate logarithmic topologies, this paper proposes a balanced combination that trades queue buffers for simplicity. Preliminary analysis of nine benchmarks from PARSEC and SPLASH using execution-driven simulation shows that utilization rises from 2% when connecting a single core per mesh port to at least 50%, as slack for delay in concentrator and router queues is around 6x higher compared to the ideal latency of just 20 cycles. That is, a 16-port mesh suffices because queueing is the uncommon case for system performance. In this way, the mesh hop count is bounded to three, as load becomes uniform via extended concentration, and ideal latency is approached using conventional four-stage pipelines for the mesh routers together with minor logarithmic edges. A realistic Uber is also detailed, featuring the same performance as a 64-port mesh that employs optimized router pipelines, improving the baseline by 12%. Ongoing work develops techniques to better balance load by tuning the placement of cache blocks, and compares Uber with bufferless routing.