목록 List

SPipe: Hybrid GPU and CPU Pipeline for Training LLMs under Memory Pressure

개요

대규모 언어모델(LLM)을 제한된 컴퓨팅 자원으로 학습하면, 막대한 메모리 요구량으로 인해 GPU 메모리가 부족해지고, 이를 보완하기 위해 CPU 메모리를 활용할 경우 GPU와 CPU 간의 대규모 데이터 전송으로 성능이 제한됩니다. 본 논문에서는 이러한 한계를 극복하기 위해, 다수의 GPU와 CPU를 동시에 효율적으로 활용할 수 있는 LLM 학습 프레임워크인 SPipe를 제안합니다. SPipe는 GPU 간 파이프라인에서 발생하는 버블을 줄이기 위한 GPU 파이프라인과, 데이터 전송 오버헤드 및 연산 성능 차이로 인한 CPU 병목을 완화하기 위한 GPU-CPU 파이프라인으로 구성되어 있습니다. 다양한 크기의 LLM을 대상으로 한 평가에서, SPipe는 기존 최첨단 기법인 Mobius 대비 평균 1.26배의 성능 향상을 달성하였습니다.

SPipe의 핵심 기술은 (1) Decoupled Pass Assignment와 (2) Asynchronous Optimizer입니다. (1)은 모델 파라미터를 GPU 메모리가 아닌 CPU 메모리에 저장하고, 서로 다른 두 GPU가 해당 파라미터에 접근해 각각 forward pass와 backward pass를 수행하는 방식입니다. 이는 하나의 GPU가 두 pass를 모두 수행해야만 했던 기존 방식에서 발생하는 GPU 간 파이프라인의 버블을 효과적으로 감소시킵니다. (2)는 GPU에서의 backward pass와 먼저 완료된 파라미터들에 대한 CPU에서의 optimizer step을 병렬로 수행함으로써, GPU와 CPU 간 오버래핑이 불가능했던 상황에서 발생하던 GPU-CPU 파이프라인의 버블을 완화합니다. 이 외에도 SPipe는 GPU 간 파이프라인의 버블을 완전히 제거하는 Fine-grained Stage Partitioning 기법, GPU 간 통신을 최적화하는 Asynchronous Checkpoint Communication 기법, 그리고 Asynchronous Optimizer의 효율성과 정확성을 보장하는 Bypassing 및 Rollback 기법을 제공하고 있습니다.

Training large language models (LLMs) under limited compute resources often results in GPU memory exhaustion due to their massive memory requirements. While CPU memory can be used to compensate, the large volume of data transfers between GPU and CPU significantly limits performance. To overcome these limitations, we propose SPipe, a training framework that enables efficient utilization of multiple GPUs and CPUs for LLM training. SPipe consists of a GPU pipeline to reduce inter-GPU pipeline bubbles, and a GPU-CPU pipeline to mitigate CPU bottlenecks caused by data transfer overheads and compute imbalance. In our evaluation across various LLM sizes, SPipe achieves an average 1.26× speedup over the state-of-the-art Mobius framework.

SPipe introduces two core techniques: (1) Decoupled Pass Assignment and (2) Asynchronous Optimizer. (1) stores model parameters in CPU memory rather than GPU memory, and assigns forward and backward passes to two separate GPUs that access the shared CPU-resident parameters. This design effectively reduces inter-GPU pipeline bubbles that arise in conventional settings where a single GPU performs both passes. (2) performs the GPU backward passes and CPU optimizer steps for the already-completed parameters in parallel. This parallelism alleviates pipeline bubbles that occur when there is no overlap between GPU and CPU processing. Additionally, SPipe provides a set of supporting techniques: Fine-grained Stage Partitioning to eliminate inter-GPU pipeline bubbles, Asynchronous Checkpoint Communication to optimize GPU communication, and Bypassing and Rollback mechanisms to ensure the efficiency and correctness of the asynchronous optimizer.

Acknowledgements
Acknowledgements

This work was partially supported by the National Research Foundation of Korea (NRF) under Grant No. RS-2023-00222663 (Center for Optimizing Hyperscale AI Models and Platforms), and by the Institute for Information and Communications Technology Promotion (IITP) under Grant No. 2018-0-00581 (CUDA Programming Environment for FPGA Clusters) and No. RS-2025-02304554 (Efficient and Scalable Framework for AI Heterogeneous Cluster Systems), all funded by the Ministry of Science and ICT (MSIT) of Korea. Additional support was provided by the BK21 Plus Program for Innovative Data Science Talent Education (Department of Data Science, SNU, No. 5199990914569) and the BK21 FOUR Program for Intelligent Computing (Department of Computer Science and Engineering, SNU, No. 4199990214639), both funded by the Ministry of Education (MOE) of Korea. This work was also partially supported by the Artificial Intelligence Industrial Convergence Cluster Development Project, funded by the MSIT and Gwangju Metropolitan City. ICT at Seoul National University provided research facilities for this study.

Contributors
Contributors

유준열, 정유진, 박대영§, 김진표§, 김희훈, 이재진§

위스콘신 매디슨 대학교
삼성전자
§ 서울대학교
모레

Junyeol Ryu, Yujin Jeong, Daeyoung Park§, Jinpyo Kim§, Heehoon Kim, Jaejin Lee§

University of Wisconsin-Madison
Samsung Electronics
§ Seoul National University
Moreh Inc.