Papers
Topics
Authors
Recent
Search
2000 character limit reached

Optimal size, freshness and time-frame for voice search vocabulary

Published 31 Oct 2012 in cs.CL and cs.IR | (1210.8436v1)

Abstract: In this paper, we investigate how to optimize the vocabulary for a voice search LLM. The metric we optimize over is the out-of-vocabulary (OoV) rate since it is a strong indicator of user experience. In a departure from the usual way of measuring OoV rates, web search logs allow us to compute the per-session OoV rate and thus estimate the percentage of users that experience a given OoV rate. Under very conservative text normalization, we find that a voice search vocabulary consisting of 2 to 2.5 million words extracted from 1 week of search query data will result in an aggregate OoV rate of 1%; at that size, the same OoV rate will also be experienced by 90% of users. The number of words included in the vocabulary is a stable indicator of the OoV rate. Altering the freshness of the vocabulary or the duration of the time window over which the training data is gathered does not significantly change the OoV rate. Surprisingly, a significantly larger vocabulary (approximately 10 million words) is required to guarantee OoV rates below 1% for 95% of the users.

Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.