GPU Acceleration for PyTorch on M1 Macs: A Comprehensive Guide
Written on
Chapter 1: Introduction to GPU Acceleration on M1 Macs
The launch of M1 Macs in November 2020 brought a remarkable enhancement in the processing capabilities of Apple computers. However, the integration of these advancements into PyTorch has only recently occurred.
Deep learning models have significantly benefited from increased performance, largely due to their expanding size. These larger models necessitate extensive computations for training and execution. Unfortunately, traditional CPU hardware, which processes tasks sequentially, struggles with these demands. Instead, these models require the parallel processing capabilities offered by GPUs.
GPUs, originally designed for rendering complex graphics, have become vital for deep learning due to their ability to perform numerous calculations simultaneously. This architecture is essential for handling the scale of contemporary models. Relying on CPUs can render many deep learning processes impractically slow, leading to frustration for users on M1 machines.
While TensorFlow has had GPU acceleration from the beginning, PyTorch lagged in support for M1 Macs. Fortunately, this gap has now been bridged.
Section 1.1: PyTorch v1.12 and Its New Capabilities
PyTorch version 1.12 introduces GPU-accelerated training specifically for Apple silicon, a result of collaboration between PyTorch and Apple's Metal engineering team. Utilizing Apple’s Metal Performance Shaders (MPS) as the backend for PyTorch operations, this integration is optimized for each M1 chip family, resulting in substantial speed improvements.
For instance, benchmarks using the M1 Ultra chip reveal a remarkable ~7x increase in training speed and ~14x in inference speed for the widely used BERT model. In contrast, my first-generation M1 MacBook Pro doesn't exhibit the same level of speedup, especially with a batch size of 64. Nevertheless, a 200% improvement is still a significant gain.
Now, instead of examining data and graphs, let’s explore how to implement the new MPS-enabled PyTorch effectively.
Subsection 1.1.1: Prerequisites for MPS-Enabled PyTorch
Before diving into the installation, it’s vital to meet certain prerequisites for MPS-enabled PyTorch. Users must have macOS 12.3 or later and an ARM-based Python installation. You can check these requirements with the following command:
import platform
platform.platform()
If the output indicates an outdated macOS version or an x86 architecture, updating to macOS 12.3 or creating a new ARM environment for Python is necessary.
If using Anaconda, the following command will help set up an ARM environment:
CONDA_SUBDIR=osx-arm64 conda create -n ml python=3.9 -c conda-forge
Once the environment is created, activate it with:
conda activate ml
Next, adjust the environment variable to ensure future installations default to the ARM environment:
conda env config vars set CONDA_SUBDIR=osx-arm64
You may need to reactivate the environment for these changes to take effect.
Section 1.2: Installing PyTorch
To proceed, installing PyTorch v1.12 is essential, which is currently only available as a nightly release:
During this process, ensure that the downloaded file indicates it is the arm64 version. Additionally, install the transformers and datasets libraries:
pip install transformers datasets
If you encounter errors related to the tokenizers library, you may need to install Rust in the same environment:
Then, retry the installation of the transformers and datasets libraries.
To verify that MPS is functioning correctly, run:
torch.has_mps
Chapter 2: Testing MPS with BERT
To evaluate the performance of MPS-enabled PyTorch, we’ll utilize the first 1000 rows of the TREC dataset while testing with a BERT model. This will involve tokenizing our data using a BERT tokenizer, starting with a smaller set of 64 rows.
In terms of inference tests, when processing tokens on a CPU, we observe an average processing time of 547ms. By switching to MPS and moving the tokens tensor and model to the MPS device, we notice a significant performance improvement, particularly with larger batch sizes.
For training, while using lower-end M1 chips for large models isn't ideal, it is still feasible. Fine-tuning BERT on the TREC dataset involves using the text feature as input and the label-coarse feature as target labels. Given that our label feature contains six unique classes, BERT must be initialized to accommodate these outputs.
In training tests, the CPU takes approximately 25 minutes, while MPS reduces this time to about 18 minutes. However, since MPS can only handle a batch size of one for the full BERT model, we typically utilize a batch size of 32 for the CPU.
For optimal results, we can choose to freeze the pretrained core of the model and fine-tune only the final layers tailored for specific tasks, such as classification.
In conclusion, this guide provides insights into the new MPS-enabled PyTorch and its application for inference and training with models like BERT. Stay updated on my latest projects by following my weekly YouTube posts or connecting with me on Discord. Hope to see you there!
Resources
[1] New York Times MacBook Reviews (2021)
[2] P. Kanwar, F. Alcober, Accelerating TensorFlow Performance on Mac (2020), TensorFlow Blog
[3] Article Notebooks
All images are by the author except where stated otherwise