TY  - JOUR
T1  - An Analysis on Inference Time, Accuracy, Communication, and GPU Memory Usage for Inference Batch of Large Language Models
AU  - Shin, Changyong 
AU  - Go, Younghun 
AU  - Yoo, Yeonho 
AU  - Yang, Gyeongsik 
AU  - Yoo, Chuck 
JO  - The Journal of Korean Institute of Communications and Information Sciences
PY  - 2024
DA  - 2024/1/1
DO  - 10.7840/kics.2024.49.10.1377
KW  - Large language model
KW  - GPU utilization
KW  - Communication overhead
KW  - Model parallelism
KW  - Tensor parallelism
KW  - Kernel fusion
KW  - Batch size
AB  - Recently, large language models, such as GPT, LLaMA, and PaLM, have been actively applied in various fields such as medicine, education, finance, law, and marketing. These models have a vast number of parameters that require multiple GPUs to perform inference. For system administrators of inference services in clusters or clouds, it is critical to utilize the given GPU and network resources as efficiently as possible to quickly respond to numerous user requests. To achieve this, existing inference systems employ various parallelization and optimization strategies. This paper profiles and analyzes inference time, prediction accuracy, GPU communication amount, and GPU memory usage for different parallelization strategies, optimization techniques, and batch size changes. Notably, we develop a new resource profiler for precise resource measurement of GPU resources. Our profiling results reveal that increasing batch size can lead to inefficiencies due to increased GPU communication. In terms of GPU memory, larger batch sizes result in more aggressive memory utilization, but a specific threshold exists where out-of-memory issues arise for the limited GPU memory. Such observations are expected to serve as a baseline for designing efficient inference systems for large language models.