Compressing Search with Language Models

Published 24 Jun 2024 in cs.IR and cs.LG | (2407.00085v2)

Abstract: Millions of people turn to Google Search each day for information on things as diverse as new cars or flu symptoms. The terms that they enter contain valuable information on their daily intent and activities, but the information in these search terms has been difficult to fully leverage. User-defined categorical filters have been the most common way to shrink the dimensionality of search data to a tractable size for analysis and modeling. In this paper we present a new approach to reducing the dimensionality of search data while retaining much of the information in the individual terms without user-defined rules. Our contributions are two-fold: 1) we introduce SLaM Compression, a way to quantify search terms using pre-trained LLMs and create a representation of search data that has low dimensionality, is memory efficient, and effectively acts as a summary of search, and 2) we present CoSMo, a Constrained Search Model for estimating real world events using only search data. We demonstrate the efficacy of our contributions by estimating with high accuracy U.S. automobile sales and U.S. flu rates using only Google Search data.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces SLaM, a novel approach that compresses search queries into semantic embeddings using pre-trained language models.
It presents CoSMo, a constrained probabilistic model that improves prediction accuracy by 30% for U.S. automobile sales and flu rates.
Ablation studies underline the impact of incorporating search volume and regional adjustments for robust, zero-shot nowcasting of real-world events.

Compressing Search with LLMs: An Expert Overview

Introduction

The paper "Compressing Search with LLMs" by Thomas Mulc and Jennifer L. Steele addresses the pressing issue of effectively leveraging the vast amount of data generated through Google Search queries. Traditional approaches to making sense of this data often involve summarizing search terms into categories or filtering individual terms—both approaches have significant limitations. This paper introduces two novel contributions: SLaM Compression and CoSMo, a Constrained Search Model, to enhance the modeling and estimation of real-world events using search data.

Framework

SLaM Compression

SLaM (Search LLM Compression) offers an innovative method to quantify and compress search terms using pre-trained LMs. This technique efficiently reduces the dimensionality of search data while preserving essential semantic information from the individual terms. SLaM generates "search embeddings" by mapping each search term to a fixed-length vector using a LLM and then aggregating these vectors. The aggregation process can either be a simple summation or involve more complex statistical methods such as marginal distributions of embeddings. This approach sidesteps the need for user-defined filters, enabling the generation of memory-efficient, high-information representations that are well-suited for downstream machine learning tasks.

CoSMo: Constrained Search Model

The second contribution, CoSMo, is a specialized model designed to utilize the search embeddings produced by SLaM. CoSMo outputs a scalar between zero and one, interpreted as the probability of a target event occurring (e.g., probability of flu prevalence, automobile sales). Notably, CoSMo incorporates inductive biases and constraints tailored to search data, allowing it to account for variations in search volumes and regional differences while limiting overfitting.

The overall efficacy of SLaM and CoSMo is evaluated through case studies in nowcasting U.S. automobile sales and flu rates. These applications highlight improvements in predictive power and provide a means to extract significant insights from search data.

Results and Implications

U.S. Automobile Sales

Using Google's vast search query logs, the authors demonstrate a 30% improvement in prediction accuracy for U.S. automobile sales over traditional classification-based methods. Specifically, the model achieves an R² of 0.91 with 3.03% MAPE at a monthly granularity—significantly higher than prior models that utilized coarser Google Trends data. This leap in accuracy is achieved despite the exclusion of other economic indicators or historical sales data. The model's efficacy in capturing consumer interest through detailed search embeddings illustrates a substantial advancement in nowcasting capabilities for economic activities.

U.S. Flu Rates

For flu rate predictions, the introduced models achieve impressive correlations and low error rates. When benchmarked against existing models that incorporate lagged flu rate data (e.g., autoregressive models), CoSMo's performance is on par or superior, boasting a test MAPE as low as 3.9% and nearly perfect correlation (r ≈ 1). This robustness, despite relying solely on search data, underscores the potential of LLM embeddings in improving public health surveillance through timely and accurate flu trend estimation.

Ablation Studies and Zero-Shot Inference

Ablation studies underscore the importance of search volume inclusion and regional multipliers in enhancing model fidelity. Moreover, the model exhibits surprising efficacy in zero-shot inference, accurately predicting state-level flu rates despite being trained only on national data. This adaptability points to the flexibility and generalizability of the proposed modeling approach, validating its application across various geolocated datasets without necessitating extensive retraining or region-specific customization.

Future Directions

The findings point to several promising avenues for future research. Integrating SLaM and CoSMo across different languages and regions can enable better global understanding of trends and events. Additionally, merging these methods with more sophisticated neural architectures or incorporating temporally adaptive models could further enhance predictive power. Beyond search engines, applying SLaM and CoSMo in different text-rich environments, such as social media or customer feedback, could unlock new insights and applications in both commercial and public sectors.

Conclusion

"Compressing Search with LLMs" presents a significant methodology shift in leveraging search data. By combining SLaM's efficient data compression and CoSMo's tailored probabilistic modeling, Mulc and Steele provide a robust framework for predicting real-world events with improved accuracy and interpretability. This dual-framework approach enhances the utility of search data, ensuring more meaningful integration into forecasting models while maintaining computational tractability. These contributions hold substantial promise for future applications in both economic forecasting and public health, amongst other domains.

Markdown Report Issue