Hardware-Software Co-Design for Efficient Deep Learning Systems II
Date30th Dec 2021
Time11:30 AM
Venue Google Meet
PAST EVENT
Details
Deep Neural Networks (DNNs) have defined new frontiers in many domains and enabled billions of users to benefit from them. However, the sustenance of this rapid growth depends primarily on the compute demands of DNNs increasing at a rate many folds faster than Moore’s law. Thus, it is critical to address this compute challenge at scale. In the previous seminar, we provided three principal approaches, viz. (i) Improve the resource efficiency of DL systems, (ii) Design efficient networks, and (iii) Improve the design methodology of efficient networks and devices and three solutions towards this direction. In this seminar, we take this a step further and propose two more solutions spanning these principal approaches.
First, we present a hardware-software co-design approach to improve the resource efficiency of efficient deep neural networks on the popular systolic arrays. Specifically, depthwise separable convolutions are the de-facto building blocks for efficient DNNs. However, their computational patterns lack sufficient data reuse to operate well on a systolic array and are inefficient. We thus present an efficient operator, called FuSeConv, an efficient hardware dataflow, called Spatially Tiled Output Stationary (ST-OS), and a superior training methodology, called Neural Operator Scaffolding (NOS) to improve this efficiency gap. The proposed solutions together achieve a significant speedup of 4.1-9.25x with state-of-the-networks on the ImageNet dataset while preserving the accuracy due to NOS. Additionally, we combine this with Neural Architecture Search (NAS) to automatically design DNNs that are accurate and efficient on systolic arrays.
Second, we present a design methodology to discover efficient language models automatically. Task agnostic pre-training followed by task-specific fine-tuning is the default approach to train Natural Language Understanding (NLU) models. Given the enormous task and device diversity, training unique NLU models across tens of tasks and devices is extremely prohibitive. We present SuperShaper, a task- and device-agnostic pre-training approach that simultaneously trains millions of transformer models together by varying their shapes, i.e., their hidden dimensions across layers. This pre-training is enabled by sandwiching every transformer layer with linear-bottleneck matrices, which can be sliced to generate different transformers. Despite its simple design and approach, SuperShaper can discover more accurate networks than a range of hand-crafted and automatically searched networks on the GLUE benchmarks. Further, we propose a heuristic-based method to derive networks of suitable shapes quickly, thus radically amortizing or simplifying the cost of Neural Architecture Search (NAS).
In essence, we believe our proposed solutions improve the efficiency of DNNs on systolic arrays and enable the automatic design of efficient language models to take a concrete step towards rapidly defining new frontiers to serve the next billion users.
Speakers
Vinod Ganesan (CS16D200)
Computer Science & Engineering