TY - JOUR T1 - ViT-Based Future Road Image Prediction: Evaluation via VLM AU - Kim, Donghyun AU - Kwon, Jaerock AU - Nam, Haewoon JO - The Journal of Korean Institute of Communications and Information Sciences PY - 2025 DA - 2025/1/1 DO - 10.7840/kics.2025.50.10.1532 KW - Autonomous Driving KW - Vision- KW - Language Model KW - Semantic KW - Evaluation KW - Vision Transformer AB - This paper proposes a Vision Transformer (ViT)-based model for predicting future driving scenes. The proposed ViT architecture processes input images as patches and leverages the attention mechanism to efficiently learn global visual information, while also integrating control inputs to effectively capture correlations between visual context and driving actions. Experimental results show that the ViT-based model generates sharper images than the baseline and achieves higher semantic similarity in explanation evaluations using a Vision-Language Model (VLM). These results suggest that the ViT architecture is effective not only for future prediction but also for explainable autonomous driving control.