Sherlock: A Deep Learning Approach to Semantic Data Type Detection

Published 25 May 2019 in cs.LG, cs.DB, cs.IR, and stat.ML | (1905.10688v1)

Abstract: Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on $686,765$ data columns retrieved from the VizNet corpus by matching $78$ semantic types from DBpedia to column headers. We characterize each matched column with $1,588$ features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F$_1$ score of $0.89$, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.

Abstract PDF Upgrade to Chat

Citations (168)

View on Semantic Scholar

Summary

The paper introduces Sherlock, a deep neural network that accurately detects semantic types in large-scale tabular data with a 0.89 F1 score.
It leverages 1,588 diverse feature descriptors, including statistical measures and text embeddings, to overcome traditional matching limitations.
Sherlock outperforms baseline machine learning models and crowdsourced annotations, offering significant improvements for automated data cleaning.

Overview of "Sherlock: A Deep Learning Approach to Semantic Data Type Detection"

The paper "Sherlock: A Deep Learning Approach to Semantic Data Type Detection" presents a novel methodology for detecting semantic data types within tabular data columns, advancing current practices in automated data cleaning, schema matching, and data discovery. The authors propose a multi-input deep neural network, named Sherlock, trained to identify semantic types from an extensive dataset comprised of real-world data columns—specifically, 686,765 columns spanning 78 semantic types from the VizNet corpus.

Methodology

Sherlock addresses the limitations of traditional matching-based approaches that rely on dictionary lookups or regular expression matching. These conventional systems often struggle with dirty or malformed data and are typically constrained by the breadth of identifiable semantic types. Instead, Sherlock capitalizes on a robust feature extraction process—delivering 1,588 feature descriptors per column. These feature descriptors encompass global statistical characteristics, character distributions, word embeddings, and paragraph vectors, effectively capturing diverse properties and semantic content.

Sherlock frames semantic type detection as a multiclass classification problem. The authors report that their model achieves a support-weighted F $_1$ score of 0.89, which surpasses that of machine learning baseline models, matched-based approaches, and even crowdsourced annotations.

Results

This performance is indicative of the effectiveness of deep learning architectures in handling the inherent complexity and variability in column data semantics. The paper underscores the predictive superiority of Sherlock over other machine learning classifiers—such as decision trees and random forests—as well as traditional dictionary and learned regular expression methods.

One notable aspect of Sherlock's development is the rigorous benchmarking against crowdsourced annotations. Here, human annotators exhibited significant difficulty in correctly classifying semantic types, despite receiving focused training and selection assignments from a pool of 78 distinct types. Such difficulties highlight the challenges faced by humans in distinguishing between similar types, a barrier less prominent in machine learning models with access to large-scale data.

Implications and Future Directions

Theoretical implications of this research emphasize the utility of deep neural networks in semantic type detection, particularly how feature-rich inputs enhance model accuracy. The practical integration of Sherlock could improve automated data preparation systems by streamlining processes that rely heavily on accurate semantic typing, thus influencing data quality, consistency, and integration methods.

Future directions might explore expanding the set of semantic types beyond the initial 78, leveraging more data sources, and enhancing feature extraction methodologies to encompass inter-column relationships and dataset-wide properties. Moreover, the development of standardized benchmarks for comparative studies in semantic type detection would further advance research and application in this domain.

Conclusion

Sherlock presents a compelling advancement in semantic data type detection using deep learning, with broader implications for data science applications. This approach demonstrates significant promise in overcoming existing barriers in data preparation systems and sets a foundational framework for future explorations into feature enriched, ontology-driven type detection models.

Markdown Report Issue