GPSBench: Do Large Language Models Understand GPS Coordinates?

Abstract

Large Language Models (LLMs) are increasingly deployed in applications that interact with the physical world, such as navigation, robotics, or mapping, making robust geospatial reasoning a critical capability. Despite that, LLMs' ability to reason about GPS coordinates and real-world geography remains underexplored. We introduce GPSBench, a dataset of 57,800 samples across 17 tasks for evaluating geospatial reasoning in LLMs, spanning geometric coordinate operations (e.g., distance and bearing computation) and reasoning that integrates coordinates with world knowledge. Focusing on intrinsic model capabilities rather than tool use, we evaluate 14 state-of-the-art LLMs and find that GPS reasoning remains challenging, with substantial variation across tasks: models are generally more reliable at real-world geographic reasoning than at geometric computations. Geographic knowledge degrades hierarchically, with strong country-level performance but weak city-level localization, while robustness to coordinate noise suggests genuine coordinate understanding rather than memorization. We further show that GPS-coordinate augmentation can improve in downstream geospatial tasks, and that finetuning induces trade-offs between gains in geometric computation and degradation in world knowledge.

Results Overview

Overall accuracy of 14 models across all 17 tasks (18,362 test samples)

Benchmark Taxonomy

GPSBench evaluates GPS reasoning across two complementary tracks and six capability categories

Benchmark Overview

GPSBench is organized into two complementary evaluation tracks:

Pure GPS Track

Format Conversion (DD, DMS, UTM, MGRS, Plus Code)
Coordinate System Transformation
Distance Calculation (Haversine)
Bearing Computation
Coordinate Interpolation
Area & Perimeter Calculation
Bounding Box Computation
Route Geometry Analysis
Relative Position (coordinate-based)

Applied Track

Place Association (reverse geocoding)
Name Disambiguation
Relative Position (knowledge-based)
Proximity & Nearest Neighbor
Route Analysis
Boundary Analysis
Spatial Patterns
Terrain Classification
Missing Data Inference

57,800 total samples

17 tasks

18,362 test samples

14 models evaluated

Getting Started

git clone https://github.com/joey234/gpsbench.git
cd GPSBench
pip install -r requirements.txt
cp .env.example .env   # add your API keys

Run the benchmark:

# Evaluate all tasks
python run_benchmark.py --model gpt-4o --provider openai

# Specific track or task
python run_benchmark.py --model gpt-4o --track pure_gps
python run_benchmark.py --model gpt-4o --task distance_calculation

# Quick test (100 samples)
python run_benchmark.py --model gpt-4o --max-samples 100

Supported providers: openai, anthropic, google, openrouter

Citation

If you use GPSBench in your research, please cite:

@article{gpsbench2025,
  title   = {GPSBench: Do Large Language Models
             Understand GPS Coordinates?},
  author  = {Truong, Thinh Hung and Lau, Jey Han and Qi, Jianzhong},
  journal = {arXiv preprint arXiv:2602.16105},
  year    = {2025}
}

GPSBench: Do Large Language Models
Understand GPS Coordinates?

Abstract

Results Overview

Benchmark Taxonomy

Benchmark Overview

Getting Started

Citation

Leaderboard

Track Results

Pure GPS Track 9 tasks · 9,182 test samples

Applied Track 9 tasks · 9,180 test samples

Task-Level Results

Model Comparison

GPSBench: Do Large Language ModelsUnderstand GPS Coordinates?

Abstract

Results Overview

Benchmark Taxonomy

Benchmark Overview

Getting Started

Citation

Leaderboard

Track Results

Pure GPS Track 9 tasks · 9,182 test samples

Applied Track 9 tasks · 9,180 test samples

Task-Level Results

Model Comparison

GPSBench: Do Large Language Models
Understand GPS Coordinates?