ROLLER - Fast and Efficient Tensor Compilation for Deep Learning

Overview

This course aims to teach learners how to efficiently compile tensors for deep learning tasks, specifically focusing on reducing the time it takes to generate kernels for operators in deep neural networks on various accelerators. The course introduces ROLLER, a construction-based approach that utilizes a new tile abstraction called rTile to limit shape choices and improve execution efficiency. By adopting a recursive rTile-based construction algorithm, ROLLER can generate efficient kernels quickly, offering comparable performance to existing solutions on popular accelerators like GPUs and superior performance on less mature accelerators like IPUs. The course covers topics such as excessive compilation time, the black-box compiler approach, improving pipeline throughput, handling small and irregular shapes, and performance evaluations on GPUs. The intended audience for this course includes developers, researchers, and engineers working on deep learning projects who are interested in optimizing tensor compilation processes for improved performance on various hardware accelerators.

Syllabus

Intro
Excessive Compilation Time
Black-Box Compiler
Motivating Example: 8k^3 matmul
Roller: A White-Box Solution
Improving Pipeline Throughput
Abstracted GPU (V100 Example)
Small & Irregular Shapes
Evaluations - V100 Performance
Evaluation Compilation Time
Summary

Taught by

USENIX

Reviews

Start your review of ROLLER - Fast and Efficient Tensor Compilation for Deep Learning

BloomTech’s Downfall: A Long Time Coming

Most common

Popular subjects

Popular courses

ROLLER - Fast and Efficient Tensor Compilation for Deep Learning

Overview

Syllabus

Taught by

Reviews

BloomTech’s Downfall: A Long Time Coming

Taught by

Ansor - Generating High-Performance Tensor Programs for Deep Learning

10 Best Deep Learning Courses

Never Stop Learning.