Human-powered Sorts and Joins

Published 30 Sep 2011 in cs.DB | (1109.6881v1)

Abstract: Crowdsourcing markets like Amazon's Mechanical Turk (MTurk) make it possible to task people with small jobs, such as labeling images or looking up phone numbers, via a programmatic interface. MTurk tasks for processing datasets with humans are currently designed with significant reimplementation of common workflows and ad-hoc selection of parameters such as price to pay per task. We describe how we have integrated crowds into a declarative workflow engine called Qurk to reduce the burden on workflow designers. In this paper, we focus on how to use humans to compare items for sorting and joining data, two of the most common operations in DBMSs. We describe our basic query interface and the user interface of the tasks we post to MTurk. We also propose a number of optimizations, including task batching, replacing pairwise comparisons with numerical ratings, and pre-filtering tables before joining them, which dramatically reduce the overall cost of running sorts and joins on the crowd. In an experiment joining two sets of images, we reduce the overall cost from $67 in a naive implementation to about $3, without substantially affecting accuracy or latency. In an end-to-end experiment, we reduced cost by a factor of 14.5.

Abstract PDF Upgrade to Chat

Citations (317)

View on Semantic Scholar

Summary

The paper develops Qurk, a workflow engine that integrates human computation with traditional databases for more efficient sorting and joining operations.
It employs innovative strategies like batching HITs and hybrid sorting to drive down cost, demonstrated by reducing join costs from $67 to about $3.
The research sets the stage for future human-in-the-loop systems, balancing subjective human input with deterministic machine processing for enhanced data handling.

Human-powered Sorts and Joins in Qurk

This paper explores the integration of human computation into traditional database management systems (DBMSs) through an innovative platform known as Qurk. Specifically, it investigates the implementation of human-assisted sorting and joining operations using crowdsourcing services like Amazon's Mechanical Turk (MTurk). The central aim is to address the challenges and inefficiencies inherent in designing workflows that involve human intelligence tasks (HITs).

In contrast to the deterministic nature of digital computing, humans can introduce variability and error into database operations. The authors establish that utilizing humans for tasks like filtering, sorting, and joining can bring a degree of subjectivity due to human error, bias, and the variable time required for task completion. Despite these challenges, the paper demonstrates that human-powered operations can be optimized to cost-effectively achieve accurate results.

Key Contributions

Declarative Workflow Engine (Qurk): The authors introduce Qurk, a system engineered to blend human computation and traditional database queries. This system abstracts the complexities of manually handling crowdsourced tasks and promises improvements in task parameterization and cost efficiency.
Join and Sort Implementations: A significant portion of the paper focuses on practical implementations for sorting and joining data using human intelligence via Qurk. The paper systematically explores a variety of ways to run these operations as HITs. For sorting, it discusses interfaces that ask users to either rank items visually or to provide numerical scores. For joins, methods range from simple pairwise comparisons to more sophisticated batching approaches that bundle tasks together.
Optimization Techniques: Several strategies are employed to optimize procedures, like batching multiple HITs to decrease the number required, pre-filtering with constraints to avoid unnecessary comparisons, and adopting a hybrid sorting technique that refines rough orders through selective comparisons. Such strategies are evidenced to markedly reduce execution costs and, in some cases, mitigate latencies without significant loss of accuracy.
Impressive Cost Reduction: Notably, the paper cites an empirical study showing a reduction in cost from $67 to approximately$3 for a naively implemented join operation. This demonstrates the effectiveness of the proposed methods in decreasing the number of HITs and thus the overall cost.

Implications and Speculative Outlook

The implications of integrating human computation in databases extend beyond simple cost-saving measures. They earmark a new frontier for query execution research, emphasizing the need for advanced optimization strategies that blend human and automated processing efficiently. The proposed methods open up a rich field for exploring human-in-the-loop systems, particularly in data processing tasks where algorithmic solutions fall short.

Theoretically, the paper sparks an intriguing discussion on how the crowdsourced database paradigm could evolve. One plausible future development could include integrating machine learning models to predict worker performance, thereby enhancing the accuracy of answers and further reducing unnecessary human tasks.

In practice, Qurk sets a precedent for how future systems may manage subjective and ambiguous queries, utilizing human-assisted computation as a valuable tool for nuanced decision-making. It promises utility in diverse domains such as content moderation, data annotation, and where task ambiguity necessitates sophisticated crowd intelligence.

In conclusion, this research advances the dialogue on human-centered data processing, showcasing that a hybrid approach, leveraging the strengths of both human and machine-provided computation, offers a viable pathway toward more efficient database management systems. As these systems evolve, they will likely require continuous refinement of the balance between crowdsourcing's cost dynamics and the inherent quality of human judgment.

Markdown Report Issue