Papers
Topics
Authors
Recent
Search
2000 character limit reached

An unsupervised and customizable misspelling generator for mining noisy health-related text sources

Published 4 Jun 2018 in cs.CL | (1806.00910v1)

Abstract: In this paper, we present a customizable datacentric system that automatically generates common misspellings for complex health-related terms. The spelling variant generator relies on a dense vector model learned from large unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. Weighting of intra-word character sequence similarities allows further problem-specific customization of the system. On a dataset prepared for this study, our system outperforms the current state-of-the-art for medication name variant generation with best F1-score of 0.69 and F1/4-score of 0.78. Extrinsic evaluation of the system on a set of cancer-related terms showed an increase of over 67% in retrieval rate from Twitter posts when the generated variants are included. Our proposed spelling variant generator has several advantages over the current state-of-the-art and other types of variant generators-(i) it is capable of filtering out lexically similar but semantically dissimilar terms, (ii) the number of variants generated is low as many low-frequency and ambiguous misspellings are filtered out, and (iii) the system is fully automatic, customizable and easily executable. While the base system is fully unsupervised, we show how supervision maybe employed to adjust weights for task-specific customization. The performance and significant relative simplicity of our proposed approach makes it a much needed misspelling generation resource for health-related text mining from noisy sources. The source code for the system has been made publicly available for research purposes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (13)
  1. doi:10.1007/s40264-015-0379-4. URL https://doi.org/10.1007/s40264-015-0379-4
  2. doi:doi.org/10.1093/pubmed/fdx020.
  3. doi:https://doi.org/10.1016/j.jbi.2015.08.002. URL http://www.sciencedirect.com/science/article/pii/S1532046415001690
  4. doi:https://doi.org/10.1016/j.jbi.2017.07.006. URL http://www.sciencedirect.com/science/article/pii/S1532046417301624
  5. doi:https://doi.org/10.1016/j.jbi.2012.07.012. URL http://www.sciencedirect.com/science/article/pii/S1532046412001268
  6. doi:10.2196/medinform.4211.
  7. doi:https://doi.org/10.1016/j.jbi.2015.04.008. URL http://www.sciencedirect.com/science/article/pii/S1532046415000751
  8. doi:https://doi.org/10.1016/j.jbi.2015.11.004. URL http://www.sciencedirect.com/science/article/pii/S1532046415002415
  9. doi:10.1007/BF01889984. URL https://doi.org/10.1007/BF01889984
  10. doi:10.3115/1075218.1075255. URL https://doi.org/10.3115/1075218.1075255
  11. doi:10.1145/2414425.2414430. URL http://doi.acm.org/10.1145/2414425.2414430
  12. doi:10.1007/s13278-017-0464-z. URL https://doi.org/10.1007/s13278-017-0464-z
  13. doi:https://doi.org/10.1016/j.dib.2016.11.056. URL http://www.sciencedirect.com/science/article/pii/S2352340916307168
Citations (49)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.