Applying Embedding-Based Retrieval to Airbnb Search

Published 11 Jan 2026 in cs.IR and cs.LG | (2601.06873v1)

Abstract: The goal of Airbnb search is to match guests with the ideal accommodation that fits their travel needs. This is a challenging problem, as popular search locations can have around a hundred thousand available homes, and guests themselves have a wide variety of preferences. Furthermore, the launch of new product features, such as \textit{flexible date search,} significantly increased the number of eligible homes per search query. As such, there is a need for a sophisticated retrieval system which can provide high-quality candidates with low latency in a way that integrates with the overall ranking stack. This paper details our journey to build an efficient and high-quality retrieval system for Airbnb search. We describe the key unique challenges we encountered when implementing an Embedding-Based Retrieval (EBR) system for a two sided marketplace like Airbnb -- such as the dynamic nature of the inventory, a lengthy user funnel with multiple stages, and a variety of product surfaces. We cover unique insights when modeling the retrieval problem, how to build robust evaluation systems, and design choices for online serving. The EBR system was launched to production and powers several use-cases such as regular search, flexible date and promotional emails for marketing campaigns. The system demonstrated statistically-significant improvements in key metrics, such as booking conversion, via A/B testing.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a two-tower embedding-based retrieval model that significantly improves recall and reduces latency at scale.
The methodology uses trip-based sampling and hard negative mining to generate unbiased training data reflective of the complete user journey.
The system leverages an IVF-based ANN approach, achieving up to 87% recall and a 16% reduction in compute costs in large-scale marketplace search.

Embedding-Based Retrieval for Large-Scale Marketplace Search: The Airbnb Search Use Case

Introduction and Problem Context

Airbnb's search system operates under stringent scalability and relevance requirements due to massive candidate set sizes—even individual queries in high-traffic markets such as Paris or London may produce up to 100,000 eligible listings. The introduction of new user-facing features, including flexible date search and split stays, further inflates the candidate pool and intensifies demands on both compute and system architecture. Traditional architectures with heavyweight first-stage DNN rankers have become a bottleneck for candidate selection. Consequently, Airbnb shifted to Embedding-Based Retrieval (EBR) to reduce latency and infrastructure costs while maintaining or exceeding ranking quality (2601.06873).

Search System Architecture and Ranking Stack

Airbnb leverages a sharded scatter-gather system, wherein the ranking process is comprised of three main phases: an initial candidate retrieval step, a first-stage DNN ranker, and a setwise re-ranker (Figure 1). The EBR model is responsible for generating a tractable candidate list for downstream high-capacity scoring.

Figure 1: Core ranking models in the Airbnb search ranking stack, each phase scoring an order of magnitude fewer results.

In this architecture, design trade-offs at each stage are guided by the exponential reduction in results—first-stage retrieval must operate at low latency on 10–100k listings, necessitating the use of ANN-based approaches.

Training Data Construction and Modeling

Search-Based vs. Trip-Based Sampling

A central technical challenge is the construction of unbiased, high-signal training data reflecting the protracted, multi-stage nature of the Airbnb user journey. Classic search-based sampling, which relies on impressions and the final booked listing for labeled training data, systematically omits early-stage interactions and induces temporal bias (Figure 2).

Figure 2: Example of search-based sampling—early exploratory user searches lacking the booked listing are discarded.

This is rectified with trip-based sampling, which aggregates all searches in a trip-consistent session based on shared parameters, incorporates negatives from all stages, and introduces auxiliary labels for diverse user actions (views, wishlists). This enables the creation of harder negatives, enhancing model robustness to semantically similar, plausible distractors (Figure 3 and Figure 4).

Figure 3: Trip-based sampling retains early searches, providing a more representative set of negatives for retrieval model contrastive learning.

Figure 4: Retrieval model training data pipeline leveraging trip-based sampling and hard negative selection.

Two-Tower Model Architecture

The deployed EBR model is a symmetric two-tower architecture, with independent towers for query and listing features, unified via a similarity metric (dot product or Euclidean distance). Listing tower outputs are cached and recomputed offline daily, while query tower inference is performed online per request (Figure 5).

Figure 5: Two-tower EBR architecture; listing embeddings are precomputed offline.

Feature selection prioritizes historical engagement and stable, engagement-independent features (amenities, location) to support cold-start listing retrieval and robust generalization.

Offline Evaluation and Metrics

Challenges in Offline Evaluation

Offline evaluation is confounded by the non-stationary nature of inventory. Standard historical logs are sparse and do not capture the full candidate universe at serving time, which impairs the generality of conventional recall-based metrics.

Traffic Replay Framework

To address these limitations, Airbnb implemented a traffic replay framework, wherein production search traffic is mirrored to an offline cluster. Here, full candidate sets are rescored using both the retrieval and first-stage models, enabling recall metrics (recall@K) with the first-stage DNN scores as pseudo-ground truth (Figure 6). This framework demonstrates high empirical correlation with online A/B conversion lifts.

Figure 6: Replay system for offline recall metric computation using the first-stage model as ground truth.

ANN System Design and Serving

Inverted File Index (IVF) vs. HNSW

ANN candidate generation is critical for low-latency retrieval over million-scale datasets. HNSW exhibited high memory overhead and suboptimal compatibility with filter-intensive (e.g., geospatial) queries and high-velocity real-time listing updates, leading to challenges in maintaining index consistency under production workloads. In contrast, IVF supported efficient filtering and daily cluster recomputation by design, albeit at a modest recall penalty.

IVF Clustering and Similarity Function

IVF partitions the listing embedding space into $k$ -means clusters, whose centroids and assignment IDs are persisted in the search index (Figure 7). Offline simulation tuned the nprobes hyperparameter, balancing recall and latency. A key finding was that Euclidean distance delivered significantly more balanced clusters than dot product, reducing the nprobes required for high-recall retrieval (Figure 8).

Figure 7: Full serving stack with daily listing clustering and IVF-based indexing.

Figure 8: Cluster size distribution—Euclidean distance outperforms dot product by avoiding outlier “mega-clusters.”

Application to Multiple Product Surfaces

EBR’s generalizability was demonstrated with two novel product cases:

Flexible Date Search: Scaling candidate retrieval to enumerate all feasible date ranges required a compact, efficiently refreshable availability/pricing data structure, which leveraged EBR in the retrieval phase and performed post-filtering for up-to-date constraints.
Promotional Email Campaigns: Batch queries for email recommendations used EBR as an efficient, context-aware retrieval backbone across arbitrary filters (price, geography).

Experimental Results

Offline

Trip-based sampling and hard negative mining (EBR V3) sharply increased recall (from 53.3% to 93.4% on offline logs and from 40.5% to 87.0% on replayed recall@100), compared to the legacy and early EBR baselines.

Online

Cumulative conversion gain across EBR iterations was +0.31%, a highly statistically significant lift in a mature, high-base-rate production search. Compute costs decreased by 16% with IVF adoption. The system also increased bookings of new listings and those from wishlists, indicating improved surface coverage in early-stage user journeys and a reduction in historical engagement bias.

System Generalization and Future Directions

A modularized retrieval stack was implemented to support multiple candidate sources, with a strategy model under development for dynamic retrieval allocation per source type (Figure 9).

Figure 9: Generalized retrieval system with strategy layer overseeing multiple retrieval sources.

Planned model innovations include user context-aware retrieval (e.g., leveraging prior clicks or wishlist additions) and geographically intent-aware models for broad queries.

Conclusion

Airbnb's deployment of EBR in search addresses both engineering and modeling requirements for large-scale, dynamic, highly filtered retrieval settings. The technical advances include unbiased, journey-reflective training data, robust offline evaluation scalable to online conditions, and an ANN search stack adaptable to marketplace inventory volatility and context-specific filters. The approach yields strong lifts in recall, conversion, and system efficiency, and provides a scalable blueprint for other two-sided marketplaces confronting similar challenges. The research also surfaces open avenues in retrieval architecture—particularly strategy models for source mixing and deeper integration of structured user interaction data (2601.06873).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about (in simple terms)

Airbnb needs to quickly show each guest a short list of great homes from a huge pool of options. That got much harder when new features (like flexible dates) made many more homes “eligible” for each search. This paper explains how Airbnb built a faster “retrieval” system using embeddings (think: smart numerical fingerprints) to find the best candidates quickly, so the heavier ranking models can focus on a smaller set. The new system made search faster and more accurate, and it also helped with features like flexible date search and marketing emails.

What questions the researchers wanted to answer

How can we quickly pick high‑quality homes from tens of thousands of choices for each search?
How do we train the system to understand a person’s whole trip‑planning journey (which can take days or weeks), not just one search?
How can we evaluate new ideas offline (without slowing the live site), even though listings and their availability change constantly?
Which fast search technology should we use to balance speed, accuracy, filters (like map area), and real‑time updates?

How they approached the problem (methods, with plain‑language explanations)

The team redesigned the “first step” of search—called retrieval—so it quickly fetches the most promising homes.

Embeddings: Imagine turning every home and every search into a short list of numbers (a “fingerprint”). If a search and a home have fingerprints that are close together, that home is likely a good match. This is called Embedding‑Based Retrieval (EBR).
Two‑tower model: Think of two small neural networks (little brains). One learns to make fingerprints for searches (the “query tower”). The other learns to make fingerprints for homes (the “listing tower”). During search, the system compares the two fingerprints using a simple similarity score (e.g., how close they are).
Smart training data: People don’t book after one search. They explore, click, save to wishlists, and then book later. Earlier training methods only used searches where the booked home actually appeared, which favored the end of the journey. The new “trip‑based” method groups all the user’s related searches for a trip, includes earlier searches, and builds tougher comparisons:
- Positives: the home they eventually booked.
- Negatives: homes they saw, clicked, or wishlisted but didn’t book. Including these “hard negatives” teaches the model what “almost right” looks like, so it learns more.
Evaluating models offline (without hurting the live site):
- Log sampling: They added small random samples of “unshown” homes to the logs to simulate a bigger choice set.
- Traffic replay: They copied a tiny slice of real searches to a separate offline cluster and scored all eligible homes with both the fast retrieval model and the slow, accurate ranking model. Then they checked “recall” (how many of the good homes the retrieval step successfully surfaced). This predicted real A/B test results well.
Making it fast in production:
- ANN (Approximate Nearest Neighbor) search: A family of methods that quickly find the closest fingerprints without checking every single home.
- IVFs vs. HNSW: The team tried two popular ANN approaches.
- HNSW struggled with Airbnb’s needs: lots of real‑time updates (prices and availability change constantly) and heavy use of filters (like map regions).
- IVF (Inverted File Index) fit better with their system, used less memory, and played nicely with filters.
- Clustering and “nprobes”: They cluster home fingerprints daily (with k‑means). At search time, they only look inside a few nearby clusters (“nprobes”). More probes = better accuracy but slower. They tuned this to balance speed and quality.
- Similarity choice matters: Using Euclidean distance (measuring straight‑line closeness between fingerprints) made clusters more balanced than dot product, so they could search fewer clusters and still get good results.
Extra features supported:
- Flexible Date Search: Instead of issuing lots of separate date queries, they built a compact, updatable data structure of availability and price so the retrieval step stays fast.
- Promotional emails: Other teams can ask the retrieval system for “similar homes near X, in this price range, over a wide date window,” improving email recommendations.

What they found and why it matters

Here are the main results:

Better retrieval quality:
- Moving from the old system to EBR greatly improved “recall” (finding more of the homes the heavy ranking model or the guest would consider good).
- Their strongest version (V3: bigger model, pairwise training, Euclidean distance) had the best recall in both offline tests and replayed traffic.
Real impact in A/B tests:
- Together, the EBR launches increased booking conversion by about +0.31%.
- Using IVF reduced compute costs by ~16% versus the old system.
- The system increased bookings of new listings (good for fairness and freshness) and increased bookings from wishlisted homes (suggesting it surfaces useful options earlier in a user’s journey).
- Promotional emails saw +2.3% bookings.

Why this matters:

Guests get better matches faster, especially in crowded places or with flexible dates.
Hosts with newer listings get a fairer shot at being seen.
Airbnb runs more efficiently (lower compute costs) and can power more features.

What this could mean going forward (implications)

Smarter, earlier suggestions: Because the model understands the full trip journey (not just the last search), guests can discover good choices sooner.
Broader platform use: The same retrieval layer can power search, flexible dates, and marketing—one solid foundation for many features.
Room to grow: The team plans new retrieval models (like using your past interactions, or understanding broad geo-intent) and a strategy layer that dynamically decides how many results to pull from each model. That could make search even faster, fairer, and more personalized.

Key idea to remember

By teaching a fast “finder” to use smart fingerprints and realistic training data that mirrors how people actually plan trips, Airbnb can quickly grab a high‑quality set of homes for deeper ranking—making search both faster and better for everyone.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, focused list of what remains missing, uncertain, or unexplored in the paper, written to be concrete and actionable for future research.

Freshness of listing embeddings: Quantify the impact of daily offline embedding computation (stale price/availability) on retrieval quality, and evaluate streaming or incremental update strategies for listing embeddings with real-time features.
Reliance on first-stage model as offline “ground truth”: Assess bias introduced by treating the first-stage model as the proxy for positives in replay evaluation; develop offline metrics that better correlate with booking conversion, user satisfaction, and marketplace health.
Negative sampling bias and label noise: Measure false-negative rates when sampling from impressed/viewed/wishlisted pools, and design de-biasing or causal labeling (e.g., inverse propensity weighting, counterfactual inference) to reduce exposure and intent confounds.
Static nprobes tuning: Explore adaptive, per-query nprobes selection that responds to filter selectivity, eligible set size, and query difficulty to balance recall and latency/SLOs.
ANN choice and hybrid designs: Investigate whether filter-aware HNSW, IVF+HNSW hybrids, PQ-enhanced IVF, or GPU-based ANN can recover recall while meeting memory, update throughput (~10k/s), and filter constraints.
Embedding geometry alignment: Provide formal/empirical analysis of Euclidean vs dot-product performance with IVF; test L2-normalization, cosine similarity, whitening, or training-time regularization to produce balanced clusters without sacrificing retrieval quality.
Trip-based grouping sufficiency: Evaluate whether grouping by location/guests/length-of-stay misses key context (dates, seasonality, price bands, device/surface), and quantify effects on early-journey retrieval generalization.
Limited query features: Systematically study the trade-offs of richer query signals (user/session history, personalization, natural-language intent) versus inference cost; prototype lightweight personalization in the query tower.
Model class exploration: Compare two-tower to late-interaction (e.g., ColBERT), cross-encoders for reranking within retrieval budgets, and multi-modal embeddings (photos, text, reviews) to capture semantic relevance beyond tabular features.
End-to-end optimization: Test joint or cascaded training where retrieval embeddings are optimized with downstream ranking objectives (conversion/diversity) or via differentiable retrieval, and measure end-to-end gains.
Offline logging approximations: Develop scalable approximations (reservoir sampling, sketching) to better represent full eligible sets offline and reduce discrepancies between logs and true retrieval distributions.
Real-time update robustness: Quantify indexing delay tolerance, consistency guarantees, and rollback strategies under high QPS updates; compare daily k-means reclustering to incremental/online clustering for IVF stability.
Filter-aware retrieval behavior: Characterize recall degradation under highly discriminative multi-filter conjunctions; design ANN search that is intrinsically filter-aware (e.g., constrained ANN, multi-index fusion).
Cold-start evaluation: Provide dedicated metrics and benchmarks for new listings; test feature augmentation or meta-learning approaches that ensure exposure equity without engagement history.
Fairness and marketplace impacts: Measure and mitigate disparities across regions, price segments, property types, and host cohorts; define and optimize multi-objective retrieval that balances guest utility with supply fairness.
Diversity at retrieval: Determine whether retrieval over-concentrates near-duplicates; introduce diversity-aware objectives (e.g., MMR-style constraints) at retrieval to reduce downstream burden on setwise reranking.
Latency and tail performance: Report P50/P95/P99 retrieval latencies per shard and under peak QPS, and design SLO-aware throttling, batching, and caching strategies to control tail behavior.
Caching strategies: Investigate caching/reuse of query embeddings and candidate sets for repeated queries, map pans, and session continuity; quantify compute savings and staleness trade-offs.
Flexible Date Search precompute: Optimize lookahead horizon, storage/update cadence, and accuracy; extend retrieval-time availability checks to split stays and joint constraints across date combinations.
Adversarial robustness: Evaluate susceptibility to host feature manipulation (gaming) and design adversarial defenses, anomaly detection, and integrity checks in embedding generation and retrieval.
Internationalization and query understanding: Integrate multilingual/natural-language query embeddings, synonym/translation handling, and place disambiguation beyond placeId for text-heavy or ambiguous queries.
Cross-surface generalization: Measure retrieval quality and conversion impacts across web/mobile, map vs list views, and marketing surfaces; develop surface-specific retrieval strategies and knobs.
Ablation transparency in V3: Disentangle the contributions of added features, network size (+20%), pairwise loss, and Euclidean metric via controlled ablations to guide future model design.
Metric alignment and causality: Build causal evaluation frameworks linking retrieval changes to bookings, retention, and long-term user value; verify that recall@K gains consistently translate to business outcomes.
Operational resilience: Document and automate monitoring for cluster imbalance, embedding drift, pipeline failures, and index inconsistencies; implement rollback/playbook and alerting for serving stability.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are actionable applications that can be deployed now based on the paper’s methods, findings, and systems, along with sectors, likely tools/workflows, and key assumptions/dependencies for feasibility.

Two-tower EBR candidate retrieval for large-catalog search
- Sectors: travel/hospitality, e-commerce/marketplaces, real estate, job boards, classifieds, media/content
- Tools/workflows: two-tower model training with trip-based negative sampling; FAISS IVF integrated with Apache Lucene filters; daily offline embedding generation and k-means clustering; recall@K proxy evaluation against a first-stage model
- Assumptions/dependencies: multi-stage user logs with action signals (impressions, clicks, wishlists/saves); feature store and daily batch pipelines; online inference infra; privacy-compliant logging; sharded scatter–gather serving
Trip-based training data construction with action-weighted hard negatives
- Sectors: recommender systems across software, retail, media, and job matching
- Tools/workflows: session/trip attribution pipeline; negative sampler that upsamples engaged items (clicked/saved) versus non-engaged impressions; contrastive/pairwise loss training
- Assumptions/dependencies: reliable user-journey reconstruction across weeks; action taxonomy and logging quality; de-identified, privacy-compliant data governance
Offline “traffic replay” evaluation to predict A/B performance
- Sectors: search and recommendation platforms; ad tech; fintech offers; content feeds
- Tools/workflows: fork a slice of production traffic to an offline cluster; re-score full eligible sets with first-stage model as ground truth; compute retrieval recall@K; establish correlation to online wins
- Assumptions/dependencies: offline compute pool; deterministic feature materialization; robust replay fidelity; metrics governance; limited exposure to PII
IVF-based ANN retrieval with Euclidean distance and nprobes tuning
- Sectors: search infra vendors, cloud information retrieval, large marketplaces
- Tools/workflows: FAISS IVF indices; Euclidean similarity to produce balanced clusters; offline nprobes sweep to balance latency/recall; Lucene filter integration; monitoring for cluster size imbalance
- Assumptions/dependencies: compatibility with existing filter stack; stable daily clustering jobs; index size/memory budgets; alerting/telemetry on index quality
Real-time filter integration at retrieval time (geo and attribute filters)
- Sectors: travel/hospitality, real estate, retail product search
- Tools/workflows: IVF cluster IDs stored in the main search index; Lucene/Elasticsearch-style filters for geography, price, amenities; parallel retrieval + filtering
- Assumptions/dependencies: high-QPS index updates; fast filter execution; consistent embedding-to-listing join keys
Flexible date retrieval with precomputed availability/price cache
- Sectors: travel/hospitality; healthcare appointment booking (analogue); event ticketing
- Tools/workflows: compact date-window availability/price data structure with near-real-time updates; retrieval-aware prefiltering on availability; final post-filter refresh for accuracy
- Assumptions/dependencies: predictable lookahead window; fast incremental updates; cache coherence; edge cases for last-minute changes
Batch and triggered marketing retrieval (promotional emails and alerts)
- Sectors: marketing tech across travel, commerce, fintech
- Tools/workflows: batch retrieval API that combines EBR similarity and filters (e.g., “similar to X and within 10 miles”); campaign orchestration; lift measurement
- Assumptions/dependencies: user consent and compliance for marketing; stable batch compute; deduplication, frequency capping, and relevance thresholds
Cold-start-friendly retrieval with engagement-independent features
- Sectors: any marketplace onboarding new inventory (e.g., new sellers/providers)
- Tools/workflows: robust listing tower using amenities, location embeddings (e.g., S2Cells), capacity/attributes; dampened reliance on past engagement; offline daily embedding refresh
- Assumptions/dependencies: quality metadata coverage; geospatial encoding; schema governance to avoid sparsity
Compute cost reduction via retrieval-stage pruning
- Sectors: cloud-native search platforms; cost-sensitive infra teams
- Tools/workflows: EBR retrieval to shrink candidate set K before heavyweight ranking; IVF to reduce inference and I/O; per-segment nprobes tuning
- Assumptions/dependencies: end-to-end profiling; service-level error budgets; stable K/N/T limits coordination across stack
Monitoring and diagnostics for ANN indices and retrieval quality
- Sectors: platform/SRE teams running search at scale
- Tools/workflows: dashboards for cluster size distribution, recall vs. true KNN, latency vs. nprobes, index growth vs. update QPS; alarms for skew/imbalance
- Assumptions/dependencies: observability stack (metrics, logs, traces); synthetic probes; retained historical baselines
Practitioner education and curriculum modules in IR and RecSys
- Sectors: academia, professional training
- Tools/workflows: lab assignments on two-tower EBR, traffic replay, IVF/HNSW tradeoffs, trip-based negatives; student-friendly datasets simulating multi-stage journeys
- Assumptions/dependencies: sharable synthetic datasets; institutional review for data ethics; compute credits or local alternatives
Fairness/exposure reporting for new items
- Sectors: marketplaces, policy/compliance within platforms
- Tools/workflows: reporting that tracks exposure/clicks/bookings for new inventory pre/post EBR; guardrails to prevent popularity bias; audit-ready dashboards
- Assumptions/dependencies: standardized definitions (“new” listings/providers); variance-aware measurement; cross-team alignment on fairness KPIs

Long-Term Applications

These use cases likely require additional research, scaling, or system changes before broad deployment.

Strategy layer to allocate retrieval budgets across multiple sources
- Sectors: large-scale rankers across commerce, travel, media
- Tools/workflows: meta-model that dynamically assigns K per retrieval source (e.g., geo-intent, history-aware, diversity sources) based on query/user context; online learning from downstream metrics
- Assumptions/dependencies: multi-source retrieval infrastructure; low-latency arbitration; online bandit/ML governance
Retrieval with user’s past context and sequence modeling
- Sectors: personalization across marketplaces, media, fintech offers
- Tools/workflows: history-aware query tower that encodes recent interactions (views/saves); session-aware embeddings; contrastive training with temporal negatives
- Assumptions/dependencies: privacy-safe user histories; recency decay design; on-policy/off-policy evaluation methods
Geo-intent retrieval for broad-area queries
- Sectors: travel planning, real estate search, local commerce
- Tools/workflows: sub-area (city/region) intent prediction; multi-granularity geo embeddings; hierarchical retrieval over regions then listings
- Assumptions/dependencies: high-quality geo taxonomies; map interaction signals; feedback loops for region-level recall/precision
Incorporating real-time listing features directly into the listing tower
- Sectors: any domain with rapidly changing inventory attributes (price, availability)
- Tools/workflows: nearline feature store + streaming transformations; partial embedding refresh or hybrid towers (static + delta features); fast index updates
- Assumptions/dependencies: feature freshness SLAs; versioning and drift control; cost of frequent re-embeddings
Hybrid ANN designs for high-update, high-filter workloads
- Sectors: search infra providers and in-house platform teams
- Tools/workflows: IVF+PQ/HNSW hybrids, filter-aware ANN; shard-local caching of hot clusters; adaptive index maintenance
- Assumptions/dependencies: rigorous benchmarks on update QPS; complexity of ops; fallback mechanisms under skew
End-to-end optimization across retrieval, first-stage, and reranker
- Sectors: advanced ranking stacks in large platforms
- Tools/workflows: joint training/objectives across stages; distillation from reranker to retrieval; counterfactual evaluation and constraints (latency, diversity, fairness)
- Assumptions/dependencies: common feature/label schema; stable training pipelines; careful online rollout to manage risk
Dynamic nprobes and cost-aware serving
- Sectors: cost/perf-optimized search platforms
- Tools/workflows: per-query adaptive nprobes control using predicted headroom/latency/recall; multi-armed bandit for cost–quality tradeoffs; SLO-aware controllers
- Assumptions/dependencies: real-time latency predictors; guardrails for tail latency; robust telemetry
Privacy-preserving training and evaluation for multi-stage journeys
- Sectors: healthcare scheduling, finance, HR tech (sensitive logs)
- Tools/workflows: differential privacy for action logs; secure enclaves and federated learning for user-journey features; PII-minimized replay frameworks
- Assumptions/dependencies: legal/compliance alignment; performance tradeoffs under privacy noise; consent management
Cross-domain adoption patterns (healthcare, education, public sector)
- Healthcare: provider/appointment matching with flexible date windows (analogue to precomputed availability)
- Education: course/program matching with session-aware retrieval and schedule flexibility
- Public sector/policy portals: citizen–service matching across regions with filter-heavy queries
- Tools/workflows: domain-specific listing metadata schemas; availability caches; compliance-grade logging and auditability
- Assumptions/dependencies: domain ontologies; regulated data handling; stakeholder governance
Market and policy frameworks for equitable exposure
- Sectors: marketplace governance, regulators, standards bodies
- Tools/workflows: standardized exposure/quality metrics; audit protocols for retrieval-stage bias; disclosure guidelines for algorithmic changes that affect supply-side outcomes
- Assumptions/dependencies: access to platform telemetry; cross-industry consensus on fairness definitions; impact assessments
“EBR-as-a-Service” platform offerings
- Sectors: cloud providers, ML platform vendors
- Tools/workflows: managed two-tower training, trip-based negative mining templates, traffic replay SDKs, IVF/Lucene integration, monitoring packs, recall@K against client-provided rankers
- Assumptions/dependencies: clean APIs to client indices and filters; secure data exchange; SLOs for updates and latency
Sustainability-driven search optimization
- Sectors: platform sustainability initiatives, green computing
- Tools/workflows: compute-aware retrieval pruning; carbon budget telemetry for indexing and serving; cost–carbon co-optimization of nprobes/K/N/T
- Assumptions/dependencies: carbon accounting at service level; organizational buy-in; accurate perf–carbon models
Standardized educational benchmarks for multi-stage IR
- Sectors: academia, research consortia
- Tools/workflows: public, synthetic datasets emulating multi-week journeys with action labels; evaluation kits for replay-style metrics (recall vs. stage-1 ground truth)
- Assumptions/dependencies: dataset curation and maintenance; community adoption; licensing and ethics review

Notes:

Many long-term items become immediate in organizations that already have offline replay clusters, privacy-safe logging, and mature feature stores.
Feasibility hinges on data quality (multi-stage journeys), infra maturity (batch + nearline), and governance (privacy, marketing consent, fairness KPIs).

View Paper Prompt View All Prompts

Glossary

A/B testing: Controlled online experiments comparing alternative systems or features. "The system demonstrated statistically-significant improvements in key metrics, such as booking conversion, via A/B testing."
Apache Lucene: An open-source search library used for indexing and filtering in search systems. "allowing much lower memory usage and the ability to treat them as a normal filter in Apache Lucene"
Approximate Nearest Neighbor (ANN): Algorithms and data structures that quickly find near neighbors in high-dimensional spaces with acceptable approximation. "specialized ANN solutions such as Hierarchical Navigable Small Worlds (HNSW) and Inverted File Index (IVF)"
canonical placeId: A standardized identifier representing a geographic location for queries. "Key information such as the location (encoded via a canonical placeId and represented by a learned dense embedding)"
centroids: Representative points of clusters used in vector indices like IVF. "IVF only requires cluster IDs and centroids to be stored in the index at runtime"
cold start problem: The challenge of ranking or recommending new items with limited historical engagement data. "to mitigate the cold start problem and generalize across the diverse population of listings."
contrastive learning: A training approach that brings similar pairs closer and dissimilar pairs farther apart in embedding space. "construct hard negatives for contrastive learning along with practical trade-offs for model architectures and loss functions."
discriminative filters: Strong filtering criteria that sharply reduce the candidate set (e.g., geographic constraints). "which meant that discriminative filters were difficult to apply in parallel with retrieval without major system changes."
Embedding-Based Retrieval (EBR): Retrieval that represents queries and items as learned vectors in the same space and ranks by similarity. "leveraged Embedding-Based Retrieval (EBR) to represent both listings and search queries as numerical vectors in the same space."
Euclidean distance: A metric for measuring straight-line distance between embeddings, often used for similarity. "Specifically, even though dot product and Euclidean distance performed about the same in offline evaluation"
FAISS: A library for efficient similarity search and clustering of dense vectors. "We benchmarked several approaches using the FAISS library"
First-Stage Model: A heavyweight deep neural network producing relevance scores for listings after retrieval. "First-Stage Model (Leaf Service): A heavyweight DNN model which produces a single score per listing."
Flexible Date Search: A product feature that expands eligible results by allowing date flexibility. "A further challenge for our EBR system was scaling to support use cases such as Flexible Date Search"
ground truth: The reference labels or results treated as correct for evaluation. "compute recall metrics for a retrieval model using the first-stage model as the ground truth (runs in offline cluster)"
Hierarchical Navigable Small Worlds (HNSW): A graph-based ANN index enabling fast nearest-neighbor search. "specialized ANN solutions such as Hierarchical Navigable Small Worlds (HNSW) and Inverted File Index (IVF)"
inner product similarity: A similarity measure computed as the dot product between embeddings. "with a pointwise loss function and inner product similarity."
Inverted File Index (IVF): A clustering-based ANN method that searches a subset of clusters to retrieve candidates. "specialized ANN solutions such as Hierarchical Navigable Small Worlds (HNSW) and Inverted File Index (IVF)"
k-means clustering: An algorithm that partitions data into k clusters by minimizing within-cluster distances. "we introduced an offline k-means clustering stage in our offline data pipeline that ran daily"
KNN: Exact k-nearest neighbors retrieval, used as a benchmark against ANN. "Recall with IVF vs. True KNN for different similarity measures"
Leaf Service: A shard-level service that performs retrieval and ranking in parallel. "Each Leaf Service node performs retrieval and ranking and returns the top results to the Root Service"
NDCG: Normalized Discounted Cumulative Gain, a ranking metric emphasizing top positions. "We compute the recall (as opposed to NDCG which is better suited to evaluate the first-stage model)"
nprobes: An IVF parameter controlling how many clusters are probed during search. "One challenge was how to determine the value of an important parameter known as nprobes which defines the number of closest clusters (to the query embedding) to retrieve listings from."
pairwise loss function: A training loss that optimizes relative ordering of pairs of items. "with an updated EBR model which now includes more features, larger network size (20% more parameters) and a pairwise loss function with Euclidean distance."
pointwise loss function: A training loss that evaluates items independently rather than in pairs or lists. "with a pointwise loss function and inner product similarity."
recall@K: The proportion of relevant items retrieved within the top K results. "metrics obtained by processing such logs (such as recall@K) tend not to generalize well to a true online setting."
Root Service: The entry-point service that fans out requests to shards and aggregates results. "Each incoming search request first goes to the Root Service, which fans out to make parallel calls to each Leaf Service."
Setwise Re-Ranker Model: A model that jointly optimizes the ordering of a set of top results. "Setwise Re-Ranker Model (Root Service): A setwise ranking model to optimize the top $T$ results jointly"
sharded scatter-gather architecture: A distributed design that splits work across shards and then aggregates results. "Airbnb search uses a sharded scatter-gather architecture to scale horizontally."
S2Cell IDs: Hierarchical geospatial identifiers from the S2 geometry library used for location encoding. "S2Cell IDs of various resolutions"
Split Stays: A product feature that splits a trip across multiple listings to expand options. "Split Stays \cite{airbnb_split_stays_press_release} led to a significant increase in the number of eligible listings for a typical search query."
traffic replay framework: An offline system that replays production queries to evaluate models at scale. "we used a traffic replay framework to fork a small percentage of production search traffic to an offline cluster."
two-tower model: An architecture with separate networks for query and item embeddings combined by a similarity function. "we used two-tower model commonly seen in industry applications"

Applying Embedding-Based Retrieval to Airbnb Search

Summary

Embedding-Based Retrieval for Large-Scale Marketplace Search: The Airbnb Search Use Case

Introduction and Problem Context

Search System Architecture and Ranking Stack

Training Data Construction and Modeling

Search-Based vs. Trip-Based Sampling

Two-Tower Model Architecture

Offline Evaluation and Metrics

Challenges in Offline Evaluation

Traffic Replay Framework

ANN System Design and Serving

Inverted File Index (IVF) vs. HNSW

IVF Clustering and Similarity Function

Application to Multiple Product Surfaces

Experimental Results

Offline

Online

System Generalization and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (in simple terms)

What questions the researchers wanted to answer

How they approached the problem (methods, with plain‑language explanations)

What they found and why it matters

What this could mean going forward (implications)

Key idea to remember

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (13)

Collections

Tweets