Статья

A Tensor Processing Framework for CPU-Manycore Heterogeneous Systems

Lin ChengSchool of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USAPeitian PanSchool of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USAZhongyuan ZhaoSchool of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USAKrithik RanjanSchool of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USAJack WeberSchool of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USABandhav VeluriPaul Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USASeyed Borna EhsaniPaul Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USAMax RuttenbergPaul Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USADai Cheol JungDepartment of Electrical and Computer Engineering, University of Washington, Seattle, WA, USAPreslav IvanovSchool of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USADustin RichmondPaul Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USAMichael TaylorPaul Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USAZhiru ZhangSchool of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USAChristopher BattenSchool of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USA

2021en

ABI

Аннотация

Future CPU-manycore heterogeneous systems can provide high peak throughput by integrating thousands of simple, independent, energy-efficient cores in a single die. However, there are two key challenges to translating this high peak throughput into improved end-to-end workload performance: 1) manycore co-processors rely on simple hardware putting significant demands on the software programmer and 2) manycore co-processors use in-order cores that struggle to tolerate long memory latencies. To address the manycore programmability challenge, this article presents a dense and sparse tensor processing framework based on PyTorch that enables domain experts to easily accelerate off-the-shelf workloads on CPU-manycore heterogeneous systems. To address the manycore memory latency challenge, we use our extended PyTorch framework to explore the potential for decoupled access/execute (DAE) software and hardware mechanisms. More specifically, we propose two software-only techniques, naïve-software DAE and systolic-software DAE, along with a lightweight hardware access accelerator to further improve area-normalized throughput. We evaluate our techniques using a combination of PyTorch operator microbenchmarking and real-world PyTorch workloads running on a detailed register-transfer-level model of a 128-core manycore architecture. Our evaluation on three real-world dense and sparse tensor workloads suggests these workloads can achieve approximately 2– <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$6\times $ </tex-math></inline-formula> performance improvement when scaled to a future 2000-core CPU-manycore heterogeneous system compared to an 18-core out-of-order CPU baseline, while potentially achieving higher area-normalized throughput and improved energy efficiency compared to general-purpose graphics processing units.

Перевод пока недоступен

Идентификаторы

DOI: 10.1109/tcad.2021.3103825

Цитирования и источники

Цитирований: 2Использованных источников: 0

Показатели — AkademScholar