datvuthanh.github.io

LoRA: Low-rank adaptation of large language models

June 2023

tl;dr: An Efficient Training Speed Improvement for Large Language Models (LLMs)

Overall impression

LoRA can reduce the number of trainable parameters of GPT-3 175B by 10,000 times and GPU memory requirement by 3 times.
LoRA can even outperform full finetuning training only 2% of the parameters.
Instead of taking a few months to train LLMs models, the training process may decrease to 1-2 weeks.
This study reminds me of dimension compression algorithms such as PCA and SVD.

Key ideas

Freeze Pretrained Weights .
Create new . Where is the weight decomposition, so the network only needs to derive the gradients of .
. Where , , is low-rank representation and is the scale factor. We keep the original weight frozen and only train the new matrices and .
However, choosing the rank has trade-off between model complexity, adaptation capacity, the risk of underfitting and overfitting.
It is necessary to perform various experiments to choose the right value to achieve the best possible performance.

Code ideas

import torch
import torch.nn

embed_dim = 128
output_dim = 512
rank = 4 # the rank 'r' 
scale_factor = 5 # alpha

W = ... # Pretrained weight from network with shape (embed_dim x output_dim)

W_A = nn.Parameter(torch.randn(embed_dim, rank)) # LoRA weight A
W_B = nn.Parameter(torch.randn(rank, output_dim)) # LoRA weight B

def lora_forward(self, x):
    hidden_layer = x @ W # the normal forward
    hidden_layer += x @ scale_factor * (W_A @ W_B) 
    
    return hidden_layer

Notes

Code on Github