Large Language Models (LLMs) are increasingly deployed in applications that interact with the physical world, such as navigation, robotics, or mapping, making robust geospatial reasoning a critical capability. Despite that, LLMs' ability to reason about GPS coordinates and real-world geography remains underexplored. We introduce GPSBench, a dataset of 57,800 samples across 17 tasks for evaluating geospatial reasoning in LLMs, spanning geometric coordinate operations (e.g., distance and bearing computation) and reasoning that integrates coordinates with world knowledge. Focusing on intrinsic model capabilities rather than tool use, we evaluate 14 state-of-the-art LLMs and find that GPS reasoning remains challenging, with substantial variation across tasks: models are generally more reliable at real-world geographic reasoning than at geometric computations. Geographic knowledge degrades hierarchically, with strong country-level performance but weak city-level localization, while robustness to coordinate noise suggests genuine coordinate understanding rather than memorization. We further show that GPS-coordinate augmentation can improve in downstream geospatial tasks, and that finetuning induces trade-offs between gains in geometric computation and degradation in world knowledge.
Overall accuracy of 14 models across all 17 tasks (18,362 test samples)
GPSBench evaluates GPS reasoning across two complementary tracks and six capability categories
GPSBench is organized into two complementary evaluation tracks:
git clone https://github.com/joey234/gpsbench.git
cd GPSBench
pip install -r requirements.txt
cp .env.example .env # add your API keysRun the benchmark:
# Evaluate all tasks
python run_benchmark.py --model gpt-4o --provider openai
# Specific track or task
python run_benchmark.py --model gpt-4o --track pure_gps
python run_benchmark.py --model gpt-4o --task distance_calculation
# Quick test (100 samples)
python run_benchmark.py --model gpt-4o --max-samples 100Supported providers: openai, anthropic, google, openrouter
If you use GPSBench in your research, please cite:
@article{gpsbench2025,
title = {GPSBench: Do Large Language Models
Understand GPS Coordinates?},
author = {Truong, Thinh Hung and Lau, Jey Han and Qi, Jianzhong},
journal = {arXiv preprint arXiv:2602.16105},
year = {2025}
}
| # | Model | Provider | Overall % | Pure GPS % | Applied % |
|---|
Performance separated by Pure GPS (coordinate manipulation) vs Applied (geographic reasoning)
Accuracy (%) per task per model. Darker green = higher accuracy.
Select models to compare across all tasks