Papers
Topics
Authors
Recent
Search
2000 character limit reached

GPS-enabled Taxi Trajectory Data

Updated 22 February 2026
  • GPS-enabled taxi trajectory data are detailed spatio-temporal records capturing taxi IDs, timestamps, and travel status for comprehensive urban mobility analysis.
  • Preprocessing methods like noise filtering, trip segmentation, and map matching align raw GPS data with digital road networks to ensure robust analytical results.
  • These data support advanced applications including traffic state estimation, congestion forecasting, demand-supply modeling, and emission analysis while addressing modality biases.

GPS-enabled taxi trajectory data are comprehensive spatio-temporal records collected by in-vehicle GPS devices on urban taxi fleets. Each dataset comprises fine-grained vehicle locations, timestamps, operational status, and sometimes occupancy and associated metadata, enabling detailed analysis of urban mobility, transportation network performance, and associated phenomena such as congestion, demand estimation, and emission modeling. Such data, amassed at citywide scales over months or years, have become foundational in empirical human mobility research, urban informatics, and transportation system design.

1. Data Sources, Coverage, and Raw Schema

Recent datasets encompass entire regulated metropolitan taxi fleets, such as New York City's TLC yellow/green taxis and ride-hail vehicles (≈50,000 vehicles across 262 zones, 536 million trips, 2019–2021 (Jiang et al., 2024); 868 million records, 2009–2013 (Zhang et al., 2019)) or major Asian fleets (Beijing: ≈39,000 vehicles, 3 billion points in one month (Li et al., 2017); Shanghai: 2,600–7,000 taxis, 5–300 million points monthly). Logging frequencies vary from one point per 5–60 s (high-rate floating car data) to event-based OD records only (pick-up/dropoff). Each record typically contains:

  • taxi or vehicle ID
  • timestamp (ISO/Unix/Epoch)
  • latitude, longitude (WGS84, positional accuracy 5–10 m open sky)
  • speed/heading (where available)
  • occupancy or service status flag
  • (sometimes) cumulative mileage, fare, battery state (for EV studies)

City regulatory agencies frequently aggregate records spatially, publishing only zone-level OD flows. Raw point-level datasets, when available, offer the basis for higher-level mobility reconstructions, traffic state inference, and detailed map-matching.

2. Preprocessing: Cleaning, Trip Segmentation, and Map Matching

GPS trajectory preprocessing pipelines are standardized to ensure robust analysis:

  • Noise Filtering: Removal of outlier points with implausible speeds (e.g., vi>120v_i > 120 km/h), missing fields, or invalid coordinates. Trips with zero length/duration or micro-trips below duration thresholds (e.g., <2<2 min) are discarded (Li et al., 2017, Jiang et al., 2024).
  • Trip Segmentation: Continuous sequences of movement are delimited by dwell events (vi<vthreshv_i < v_{thresh} for t>tthresht > t_{thresh}, e.g., vthresh=3v_{thresh}=3 km/h, tthresh=120t_{thresh}=120 s), identifying logical trips (Jiang et al., 2024).
  • Map Matching: Trajectory points are aligned to underlying digital road networks (e.g., OpenStreetMap) via Hidden Markov Model (HMM) algorithms ([Newson & Krumm 2009]), exploiting emission and transition costs to robustly "snap" noisy GPS to a plausible path (Rathore et al., 2018, Jiang et al., 2024, Liu et al., 2018, Yang et al., 2014). Feature-rich CRF variants can further improve low-sampling-rate alignment (Yang et al., 2014).
  • Spatial and Temporal Aggregation: Data can be aggregated to administrative zones, grid cells (e.g., 100 m, 1 km), or street network edges for analysis at desired spatial/temporal resolutions (Liu et al., 2013, Jiang et al., 2024, Liu et al., 2018, Hu et al., 2016).

3. Mobility Metrics and Analytical Frameworks

GPS-enabled taxi traces support a comprehensive suite of human mobility, traffic, and urban structure analyses:

  • Individual Mobility: Radius of gyration (rgr_g), describing typical range of movement, and trip distance distributions (often log-normal) are estimated from segmented, map-matched trajectories (Jiang et al., 2024).
  • OD Flow Matrices: FijF_{ij} counts of trips from zone ii to jj, normalized as relative frequencies RFijRF_{ij}, are the basis for network flow and community detection (Jiang et al., 2024, Liu et al., 2013).
  • Traffic State Estimation: Urban grids or street segments are assigned instantaneous occupancy, speed, and flux via binning of map-matched trajectories. Multiscale approaches include grid-based coarse-grained aggregation (100–200 m cells (Hu et al., 2016, Liu et al., 2018, Liu et al., 2018)) and network-based assignment to street segments (Zhang et al., 2019).
  • Advanced Metrics: Trip waiting time, inter-trip intervals, stay-point clustering (e.g., DBSCAN), and congestion quantification via vector field projections or queueing models (Liu et al., 2018, Jiang et al., 2024, Zhang et al., 2019).
  • Environmental Impact: Estimation of grid-level emissions employs segmentwise emission factor models applied to map-matched trajectories, with extrapolation to the entire vehicle fleet via LPR fusion and Gaussian process regression (Liu et al., 2018).

4. Clustering, Community Detection, and Urban Structure Recovery

Taxi OD flow networks enable algorithmic identification of urban structure:

  • Zone and Neighborhood Clustering: K-means on zone features (demographic, socioeconomic, commuting variables) or unsupervised community detection (Infomap, modularity maximization) on OD networks uncovers natural subregions reflecting mobility cohesion and underlying land use (Jiang et al., 2024, Liu et al., 2013).
  • Cluster Validation: Elbow method guides selection of cluster count KK; silhouette score quantifies clustering quality (Jiang et al., 2024).
  • Hierarchical Polycentric Structure: Short-trip-dominated OD networks exhibit multiple stable mobility regions (e.g., “Level-One Zones”), with hubs corresponding to downtown business districts, suburban residential areas, and major transport nodes (Liu et al., 2013). Entropy, degree, and strength metrics further characterize diversity and centrality within clusters.

5. Comparative Datasets and Modal Bias: Taxi GPS vs. Mobile and Other Data

Taxi trajectory data afford high spatio-temporal granularity but are mobility-mode specific. Comparative studies with mobile phone-sourced products such as SafeGraph reveal key representativeness gaps (Jiang et al., 2024):

  • Overrepresentation: Taxi data over-sample flows among high-demand, centrally located, and airport-connected zones (e.g., Manhattan–Manhattan, Manhattan–airport) (Jiang et al., 2024).
  • Underrepresentation: Suburban, car-dependent, and blue-collar areas are comparatively under-sampled—SafeGraph data fill this gap but understate flows in pedestrian- and transit-dense neighborhoods (since these populations rarely enable GPS-tracking apps) (Jiang et al., 2024).
  • Relative Frequency Ratio (LRFR): LRFRij=log2RFij(taxi)RFij(SG)LRFR_{ij} = \log_2 \frac{RF_{ij}(\text{taxi})}{RF_{ij}(\text{SG})} quantifies modal over/under-sampling for each OD pair, guiding cross-dataset validation (Jiang et al., 2024).
  • Sampling Rate Variation: Mobile datasets’ device sampling rates vary by cluster (~4–12% of residential population), with trip volume per device highest in car-owning suburbs (Jiang et al., 2024).

6. Algorithmic Applications and Modeling: Forecasting, Congestion, Demand, and Supply

GPS-enabled taxi trajectories are the raw substrate for diverse computational models:

  • Travel Time and Traffic Forecasting: Neighbor-based baselines (matching historical OD pairs in grid cells with temporal scaling) outperform online routing APIs and segment-based estimators for trip time and speed (Wang et al., 2015). Coarse-grained cellular automata (CA) models fitted to grid-aggregated historical taxi data produce citywide predictive speed/flux fields (Hu et al., 2016).
  • Congestion and Queueing Analysis: Preprocessing GPS vectors into 3D kernel-smoothed vector fields enables near-real-time projection of “travel momentum” on POIs for congestion and net-influx diagnosis (VectorKD package) (Liu et al., 2018).
  • Dynamic Supply/Demand Inference: Pickup/dropoff patterns on street segments, paired with imputed vacant-taxi search trajectories, underpin nonstationary Poisson field models for demand, supply, and matching at fine spatial/temporal resolutions (Zhang et al., 2019). Game-theoretic models based on driver behavior and queueing theory quantify pickup rates and optimal search strategies.
  • Trajectory Prediction: Large-scale clustering (e.g., Traj-clusiVAT), coupled with first-order Markov models, enables both short- and long-term route prediction at scale (Rathore et al., 2018).
  • EV Simulation and Dispatching: Integration of GPS taximeter logs, vehicle SOC, and geographic deployment models supports electric fleet dispatch and charger demand forecasting (Li et al., 2017).

7. Significance, Limitations, and Integration

GPS-enabled taxi data are integral for high-resolution, mode-specific urban mobility analysis but exhibit significant modality bias—high fidelity in taxi-dominated or high-demand corridors but poor coverage in suburban, private-car, and pedestrian regions (Jiang et al., 2024). Cross-validation with mobile phone, LPR, and transit-ticket datasets is therefore essential. The choice of dataset should align with dominant local travel modes to avoid misrepresentation (Jiang et al., 2024). Data privacy, spatial sampling heterogeneity, and regulatory aggregation further constrain utility for certain analytic resolutions. Despite these caveats, the methodological frameworks developed for trajectory segmentation, map-matching, clustering, and statistical inference underpin a large fraction of contemporary urban mobility, emissions, and network flow studies using empirical big data.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPS-enabled Taxi Trajectory Data.