







### Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions

Chunwei Xia<sup>1,2</sup>, Jiacheng Zhao<sup>2</sup>, Qianqi Sun<sup>2</sup>, Zheng Wang<sup>1</sup>, Yuan Wen<sup>3</sup>, Teng Yu<sup>4</sup>, Xiaobing Feng<sup>2</sup>, Huimin Cui<sup>2</sup>

School of Computing, University of Leeds
 SKLP, Institute of Computing Technology, CAS
 University of Aberdeen
 THeWake Research

UNIVERSITY OF LEEDS

# **Deep Learning compilers**



\_\_global\_\_ void saxpy(int n,
float a, float \*x, float \*y) {
 int i = blockIdx.x\*blockDim.x +
 threadIdx.x;
 if (i < n) y[i] = a\*x[i] + y[i];</pre>

•••







## **Operator fusion**



Memory hierarchy with access latency

DL compiler can reduce inference latency by Operator fusion

## State-of-the-art

- [SystemML, VLDB'18]
- [DNNFusion, PLDI'21]
- [DLFusion, ISPA'20]
- [Apollo, MLSys'22]

- [Rammer, OSDI'20]
- [TASO, OSDI'20],
- [HFUSE, CGO'21]

| Representative operator | Second op<br>First op | One-to-One   | One-to-Many  | Many-to-Many | Reorganize   | Shuffle      |
|-------------------------|-----------------------|--------------|--------------|--------------|--------------|--------------|
| Add, Relu               | One-to-One            | One-to-One   | One-to-Many  | Many-to-Many | Reorganize   | Shuffle      |
| Expand                  | One-to-Many           | One-to-Many  | One-to-Many  | ×            | One-to-Many  | One-to-Many  |
| Conv, GEMM              | Many-to-Many          | Many-to-Many | Many-to-Many | ×            | Many-to-Many | Many-to-Many |
| Reshape                 | Reorganize            | Reorganize   | One-to-Many  | Many-to-Many | Reorganize   | Reorganize   |
| Transpose               | Shuffle               | Shuffle      | One-to-Many  | Many-to-Many | Reorganize   | Shuffle      |

DNNFusion: Rule base operator fusion

## State-of-the-art

Then



3 kernels

**Bottom-Up Operator fusion** 



1. Rule/Heuristic based  $\rightarrow$ Bad extensibility

2. Local optimization  $\rightarrow$  Bad data reuse

Minimization problem

Local minima



### What we need

Maximize data reuse
 Fit onto hardware
 Fully automated



Ultimate Goal: One model to a single kernel



 Try to generate the whole model as a single kernel Top-Down Global Opt.





















#### GPU has limited resources

| <pre><blockdim(2), threaddim(128)="">kernel1(float* a,){  shared char shared_pool [4*1024*1024];   //gemm1 code    }</blockdim(2),></pre> | Block Count limit<br>Shared memory limit<br>Shared memory limit<br>Shared memory limit<br>Shared_float* a,){<br>shared char shared_pool [4*1024*1024];<br>//kernel1 code |
|-------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre><blockdim(8), threaddim(128)="">kernel2(float* a,){  shared char shared_pool [1*1024*1024];   //gemm2 code   }</blockdim(8),></pre>  | Grid.sync(); //Global synch<br>//kernel2 code<br><br>Grid.sync(); //Global synch<br>kerneln<br>}                                                                         |
| kernelN                                                                                                                                   | One model to even a single kernel                                                                                                                                        |







8MB shared memory

Graph

Partitioning

Global Analysis Automatic transformation

TA=B

 $8 \times 4 = 32 \times 8$ 

Can't fit into the hardware, Split!









### Automatic transformation



### Post optimization



## **Experimental Setup**

- Software: Implementation based on TVM 0.8
- Hardware: NVIDIA A100 GPU, CUDA11.7

### Strong Baselines

- Ansor: we based on Ansor to generate code
- TensorRT: Vendor optimized compilers
- Rammer: Microsoft optimized compiler OSDI'20
- Apollo: MindSpore compilerMLSys'22
- •XLA: JIT compiler in TensorFlow
- IREE: MLIR based DNN compiler

# **Experimental Setup**

#### Models

| Model            | Dataset   | Parameters                                     |
|------------------|-----------|------------------------------------------------|
| ResNeXt          | ImageNet  | #layers:101, bottleneck width: 64d             |
| EfficientNet     | ImageNet  | Efficient-b0 from the source publication       |
| Swin-Transformer | ImageNet  | Base version, patch: 4 and window size: 7      |
| BERT             | SQuAD     | Base version with 12 layers from TensorRT      |
| LSTM             | synthetic | input length: 100, hidden size: 256, layer: 10 |
| MMoE             | synthetic | We use the base model                          |

## **Experimental Results**

### End-to-end latency

- 3.94  $\times$  on average (maximum 8.5  $\times$ ) over Ansor

• 4.0 $\times$ g-mean speedup (maximum 7.9 $\times$ ) over XLA



## **Experimental Results**

### Performance breakdown

Enable each optimization one-by-one



Each optimization can effectively reduce the latency

## **Experimental Results**

### Case study on LSTM

- 10 layers 100 timesteps
- Rammer 220 vs Our 1 kernel
- Rammer 1.72ms vs Ours 0.80ms cells

| Metrics                    | Rammer   | Souffle |
|----------------------------|----------|---------|
| Dram bytes from global     | 1911.0MB | 21.11MB |
| Pipeline Utilization (LSU) | 20.2%    | 35.4%   |
| Pipeline Utilization (FMA) | 8.0%     | 19.0%   |





- Tensor Expression As the Intermedia Representation
- Try to generate the whole model as a single kernel Top-Down Global Opt.

