TY - JOUR T1 - An Analysis on Inference Time, Accuracy, Communication, and GPU Memory Usage for Inference Batch of Large Language Models AU - Shin, Changyong AU - Go, Younghun AU - Yoo, Yeonho AU - Yang, Gyeongsik AU - Yoo, Chuck JO - The Journal of Korean Institute of Communications and Information Sciences PY - 2024 DA - 2024/1/1 DO - 10.7840/kics.2024.49.10.1377 KW - Large language model KW - GPU utilization KW - Communication overhead KW - Model parallelism KW - Tensor parallelism KW - Kernel fusion KW - Batch size AB - Recently, large language models, such as GPT, LLaMA, and PaLM, have been actively applied in various fields such as medicine, education, finance, law, and marketing. These models have a vast number of parameters that require multiple GPUs to perform inference. For system administrators of inference services in clusters or clouds, it is critical to utilize the given GPU and network resources as efficiently as possible to quickly respond to numerous user requests. To achieve this, existing inference systems employ various parallelization and optimization strategies. This paper profiles and analyzes inference time, prediction accuracy, GPU communication amount, and GPU memory usage for different parallelization strategies, optimization techniques, and batch size changes. Notably, we develop a new resource profiler for precise resource measurement of GPU resources. Our profiling results reveal that increasing batch size can lead to inefficiencies due to increased GPU communication. In terms of GPU memory, larger batch sizes result in more aggressive memory utilization, but a specific threshold exists where out-of-memory issues arise for the limited GPU memory. Such observations are expected to serve as a baseline for designing efficient inference systems for large language models.