GPSBench: Do Large Language Models
Understand GPS Coordinates?

Thinh Hung Truong, Jey Han Lau, Jianzhong Qi
The University of Melbourne

Abstract

Large Language Models (LLMs) are increasingly deployed in applications that interact with the physical world, such as navigation, robotics, or mapping, making robust geospatial reasoning a critical capability. Despite that, LLMs' ability to reason about GPS coordinates and real-world geography remains underexplored. We introduce GPSBench, a dataset of 57,800 samples across 17 tasks for evaluating geospatial reasoning in LLMs, spanning geometric coordinate operations (e.g., distance and bearing computation) and reasoning that integrates coordinates with world knowledge. Focusing on intrinsic model capabilities rather than tool use, we evaluate 14 state-of-the-art LLMs and find that GPS reasoning remains challenging, with substantial variation across tasks: models are generally more reliable at real-world geographic reasoning than at geometric computations. Geographic knowledge degrades hierarchically, with strong country-level performance but weak city-level localization, while robustness to coordinate noise suggests genuine coordinate understanding rather than memorization. We further show that GPS-coordinate augmentation can improve in downstream geospatial tasks, and that finetuning induces trade-offs between gains in geometric computation and degradation in world knowledge.

Results Overview

Overall accuracy of 14 models across all 17 tasks (18,362 test samples)

Benchmark Taxonomy

GPSBench evaluates GPS reasoning across two complementary tracks and six capability categories

Benchmark Overview

GPSBench is organized into two complementary evaluation tracks:

Pure GPS Track
  • Format Conversion (DD, DMS, UTM, MGRS, Plus Code)
  • Coordinate System Transformation
  • Distance Calculation (Haversine)
  • Bearing Computation
  • Coordinate Interpolation
  • Area & Perimeter Calculation
  • Bounding Box Computation
  • Route Geometry Analysis
  • Relative Position (coordinate-based)
Applied Track
  • Place Association (reverse geocoding)
  • Name Disambiguation
  • Relative Position (knowledge-based)
  • Proximity & Nearest Neighbor
  • Route Analysis
  • Boundary Analysis
  • Spatial Patterns
  • Terrain Classification
  • Missing Data Inference
57,800 total samples
17 tasks
18,362 test samples
14 models evaluated

Getting Started

git clone https://github.com/joey234/gpsbench.git
cd GPSBench
pip install -r requirements.txt
cp .env.example .env   # add your API keys

Run the benchmark:

# Evaluate all tasks
python run_benchmark.py --model gpt-4o --provider openai

# Specific track or task
python run_benchmark.py --model gpt-4o --track pure_gps
python run_benchmark.py --model gpt-4o --task distance_calculation

# Quick test (100 samples)
python run_benchmark.py --model gpt-4o --max-samples 100

Supported providers: openai, anthropic, google, openrouter

Citation

If you use GPSBench in your research, please cite:

@article{gpsbench2025,
  title   = {GPSBench: Do Large Language Models
             Understand GPS Coordinates?},
  author  = {Truong, Thinh Hung and Lau, Jey Han and Qi, Jianzhong},
  journal = {arXiv preprint arXiv:2602.16105},
  year    = {2025}
}

Leaderboard

# Model Provider Overall % Pure GPS % Applied %

Track Results

Performance separated by Pure GPS (coordinate manipulation) vs Applied (geographic reasoning)

Pure GPS Track 9 tasks · 9,182 test samples

Applied Track 9 tasks · 9,180 test samples

Task-Level Results

Accuracy (%) per task per model. Darker green = higher accuracy.

Model Comparison

Select models to compare across all tasks