14th International Conference on Image Processing, Theory, Tools and Applications, IPTA 2025, İstanbul, Türkiye, 13 - 16 Ekim 2025, (Tam Metin Bildiri)
Urban mobility demands intelligent systems that can perceive and interpret complex street environments. This study presents a deep learning-based, vision-only framework for traffic scene understanding, relying on 2D projections from 360-degree panoramic imagery. Our model combines YOLOv8 and YOLOv10 to detect roads, vehicles, and pedestrians in diverse urban settings. A custom dataset of 3,052 manually labeled images from two cities - Paris and Marseille - was collected from Google Street View. Results confirm that vision-only models can accurately recognize urban elements with Precision, Recall and mAP scores competitive with sensor-rich methods. While some limitations remain - such as detecting small or occluded pedestrians - the findings point to a future where scalable, lowcost vision systems power autonomous navigation. This work offers both a novel dataset and a benchmark toward that contribution of 3D panoramic urban city analysis.