m-LoRA | IDs Lab

m-LoRA: Efficient LLM Model Fine-Tune via Multi-LoRA Optimization

m-LoRA (a.k.a Multi-Lora Fine-Tune) is an open-source framework for fine-tuning Large Language Models (LLMs) using the efficient multiple LoRA/QLoRA methods. Key features of m-LoRA include:

Efficient LoRA/QLoRA: Optimizes the fine-tuning process, significantly reducing GPU memory usage by leveraging a shared frozen-based model.
Multiple LoRA Adapters: Support for concurrent fine-tuning of multiple LoRA/QLoRA adapters.

Overview

m-LoRA is a high-throughput LLM fine-tuning framework based on LoRA and QLoRA, compatible with HuggingFace-Transformers LLaMA Models and ChatGLM Models.

This picture shows the basic principle of LoRA and Multi-LoRA.

The system overview of m-LoRA is as follows.

m-LoRA requires PyTorch and NVIDIA CUDA compatible GPUs.

Main Contribution

Introduces the Multi-LoRA method, capable of enabling the sharing of pre-trained model weights during the fine-tuning process of large language models;
Proposes a task scheduling algorithm to enhance the overall throughput of the task training process and reduce total training latency;
Builds upon the above by implementing m-LoRA, a high-throughput large language model fine-tuning framework based on LoRA and QLoRA;
Evaluates m-LoRA in experiments against existing systems, confirming that m-LoRA effectively utilizes system computing resources, thereby improving training throughput and reducing training latency compared to current systems.

Experiment Results

Environment: NVIDIA RTX A6000 with Intel Xeon Silver 4314 on Ubuntu 22.04.3

Baseline: We utilized the widely adopted Alpaca-LoRA as a foundation. On a single GPU, we independently ran multiple Alpaca-LoRA processes in parallel (marked as Baseline@Alpaca-Parallel) and sequentially (marked as Baseline@Alpaca-Seq), forming two baseline methods for the experiments. We test this on A100, and rest of results are based on the same GPU configure.

Training Latency and Throughput

Method	Latency	Throughput
Baseline@Alpaca-Seq	10.51h	608.41 token/s
Baseline@Alpaca-Parallel	9.85h	649.30 token/s
m-LoRA	9.46h	674.58 token/s

We conducted four identical fine-tuning jobs with same dataset and same hyper-parameters, incorporating two baselines and m-LoRA. During the experimental process, we collected the completion times for each task in the baseline methods and calculated the time taken by the slowest task as the Training Latency. As shown in Table, m-LoRA exhibits lower Training Latency compared to both baseline methods. Specifically, m-LoRA is 9.99% faster than Baseline@Alpaca-Seq and 3.92% faster than Baseline@Alpaca-Parallel.

Video Memory Usage

We conducted several fine-tuning jobs with same dataset and batch_size = {2,4, 6, 8}, incorporating Baseline@Alpaca-Parallel and m-LoRA.

Baseline@Alpaca-Parallel triggered OOM error after 3 parallel tasks when batch size = 8, while m-LoRA can handle twice that amount.

Batching Strategies

Method	Training Latency	Peak Memory Usage	Average GPU Utilization	Training Throughput
Baseline@Alpaca-Seq	27.73h	10.68GB	79.39%	653.35 token/s
m-LoRA@M1	36.82h	23.82GB	96.52%	672.54 token/s
m-LoRA@M2	39.14h	23.86GB	96.41%	671.28 token/s
m-LoRA@M3	22.97h	23.85GB	95.22%	674.41 token/s

We conducted four fine-tuning jobs with different dataset but same hyper-parameters, incorporating Baseline@Alpaca-Seq and m-LoRA.

During the experimental process, we collected following metrics:

Training Latency = Job completion time
Throughput = The number of passed tokens in model forward process / training latency
Memory Usage = Peak video memory usage
GPU Utilization = Average GPU utilization

All metrics are computed for each job. M1, M2, M3 represent three batch strategies of m-LoRA: Optimal-Fit, Trivial, and Fast-Fit. BASELINE denotes Baseline@Alpaca-Seq.

The Optimal-Fit strategy performs the best across all four metrics, while the other two strategies also outperform the baseline method other than training latency.

Use Cases:

Domain-Specific Fine-Tuning: This involves adapting a single model with various parameters particularly for one domain.
Cross-Domain Fine-Tuning: This method leverages the base model to fine-tune multiple models, each intended for a different domain.