Identifying Banking Transaction Descriptions via Support Vector Machine Short-Text Classification Based on a Specialized Labelled Corpus
Abstract: Short texts are omnipresent in real-time news, social network commentaries, etc. Traditional text representation methods have been successfully applied to self-contained documents of medium size. However, information in short texts is often insufficient, due, for example, to the use of mnemonics, which makes them hard to classify. Therefore, the particularities of specific domains must be exploited. In this article we describe a novel system that combines Natural Language Processing techniques with Machine Learning algorithms to classify banking transaction descriptions for personal finance management, a problem that was not previously considered in the literature. We trained and tested that system on a labelled dataset with real customer transactions that will be available to other researchers on request. Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce training set size based on the Jaccard distance. Experimental results with a two-stage classifier combining this detector with a SVM indicate a high accuracy in comparison with alternative approaches, taking into account complexity and computing time. Finally, we present a use case with a personal finance application, CoinScrap, which is available at Google Play and App Store.
- B. L. Derby, “Data mining for improper payments,” The Journal of Government Financial Management, vol. 52, no. 4, p. 10, 2003.
- E. W. Ngai, L. Xiu, and D. C. Chau, “Application of data mining techniques in customer relationship management: A literature review and classification,” Expert Systems With Applications, vol. 36, no. 2, pp. 2592–2602, 2009.
- X. Hu, “A data mining approach for retailing bank customer attrition analysis,” Applied Intelligence, vol. 22, no. 1, pp. 47–60, 2005.
- M. R. Islam and M. A. Habib, “A Data Mining Approach to Predict Prospective Business Sectors for Lending in Retail Banking Using Decision Tree,” International Journal of Data Mining & Knowledge Management Process, vol. 5, no. 2, pp. 13–22, 2015.
- C. Chekuri, M. H. Goldwasser, P. Raghavan, and E. Upfal, “Web search using automatic classification,” in Proceedings of the Sixth International Conference on the World Wide Web, 1997, pp. 1–10.
- D.-T. Vo and Y. Zhang, “Target-Dependent Twitter Sentiment Classification with Rich Automatic Features.” in Proc. IJCAI, 2015, pp. 1347–1353.
- G. Kumaran and J. Allan, “Text classification and named entities for new event detection,” in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2004, pp. 297–304.
- Y. Cai, W.-H. Chen, H.-F. Leung, Q. Li, H. Xie, R. Y. Lau, H. Min, and F. L. Wang, “Context-aware ontologies generation with basic level concepts from collaborative tags,” Neurocomputing, vol. 208, pp. 25–38, 2016.
- Q. Du, H. Xie, Y. Cai, H.-F. Leung, Q. Li, H. Min, and F. L. Wang, “Folksonomy-based personalized search by hybrid user profiles in multiple levels,” Neurocomputing, vol. 204, pp. 142–152, 2016.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
- A. M. Hormozi and S. Giles, “Data mining: A competitive weapon for banking and retail industries,” Information Systems Management, 2004.
- O. Aregbeyen, “The determinants of bank selection choices by customers: Recent and extensive evidence from Nigeria,” International Journal of Business and Social Science, vol. 2, no. 2, pp. 276–288, 2011.
- H. U. Rehmann and S. Ahmed, “An empirical analysis of the determinants of bank selection in Pakistan: A customer view,” Pakistan Economic and Social Review, vol. 46, no. 2, pp. 147–160, 2008.
- V. Dinh and L. Pickler, “Examining service quality and customer satisfaction in the retail banking sector in Vietnam,” Journal of Relationship Marketing, vol. 11, no. 4, pp. 199–214, 2012.
- A. Keramati, H. Ghaneei, and S. M. Mirmohammadi, “Developing a prediction model for customer churn from electronic banking services using data mining,” Financial Innovation, vol. 2, no. 1, p. 10, dec 2016.
- A. Sharma and P. Kumar Panigrahi, “A Neural Network based Approach for Predicting Customer Churn in Cellular Network Services,” International Journal of Computer Applications, vol. 27, no. 11, pp. 26–31, 2011.
- K. Chen, Y.-H. Hu, and Y.-C. Hsieh, “Predicting customer churn from valuable B2B customers in the logistics industry: A case study,” Inf. Syst. E-bus. Manag., vol. 13, no. 3, pp. 475–494, 2015.
- S. Barman, U. Pal, M. A. Sarfaraj, B. Biswas, A. Mahata, and P. Mandal, “A complete literature review on financial fraud detection applying data mining techniques,” International Journal of Trust Management in Computing and Communications, vol. 3, no. 4, pp. 336–359, 2016.
- J. West and M. Bhattacharya, “Intelligent financial fraud detection: A comprehensive review,” Computers & Security, vol. 57, pp. 47 – 66, 2016.
- Y. Yoshimura, A. Amini, S. Sobolevsky, J. Blat, and C. Ratti, “Analysis of customers’ spatial distribution through transaction datasets,” in Transactions on Large-Scale Data and Knowledge-Centered Systems XXVII - Volume 9860. New York, NY, USA: Springer-Verlag New York, Inc., 2016, pp. 177–189.
- R. Vahidov and X. He, “Situated DSS for personal finance management: Design and evaluation,” Information & Management, vol. 47, no. 2, pp. 78–86, 2010.
- D. A. Zetzsche, D. W. Arner, R. P. Buckley, and R. H. Weber, “The Future of Data-Driven Finance and RegTech: Lessons from EU Big Bang II,” University of New South Wales Law Research Series, Australasian Legal Information Institute Level 14, 61 Broadway, Australia, Tech. Rep. [2019] UNSWLRS 22, 2019.
- M. F. Caropreso, S. Matwin, and F. Sebastiani, “A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization,” Text Databases and Document Management: Theory and Practice, vol. 5478, pp. 78–102, 2001.
- P. S. Jacobs, “Joining statistics with NLP for text categorization,” in Proceedings of the Third Conference on Applied Natural Language Processing. Association for Computational Linguistics, 1992, pp. 178–185.
- H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, “Text classification using string kernels,” Journal of Machine Learning Research, vol. 2, no. Feb, pp. 419–444, 2002.
- L. D. Baker and A. K. McCallum, “Distributional clustering of words for text classification,” in Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1998, pp. 96–103.
- Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp. 146–162, 1954.
- A. McCallum, K. Nigam et al., “A comparison of event models for naive bayes text classification,” in Proc. AAAI-98 Workshop on Learning for Text Categorization, vol. 752, 1998, pp. 41–48.
- K. Nigam, J. Lafferty, and A. McCallum, “Using maximum entropy for text classification,” in Proc. IJCAI-99 Workshop on Machine Learning for Information Filtering, vol. 1, 1999, pp. 61–67.
- T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in European Conference on Machine Learning. Germany: Springer, 1998, pp. 137–142.
- F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys (CSUR), vol. 34, no. 1, pp. 1–47, 2002.
- D. D. Lewis, “An evaluation of phrasal and clustered representations on a text categorization task,” in Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1992, pp. 37–50.
- M. Post and S. Bergsma, “Explicit and implicit syntactic features for text classification,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 2, 2013, pp. 866–872.
- T. Nakagawa, K. Inui, and S. Kurohashi, “Dependency tree-based sentiment classification using CRFs with hidden variables,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010, pp. 786–794.
- M. Karo and P. Stephen, “Sentiment composition,” in Proc. of Recent Advances in Natural Language Processing (RANLP), 2007, pp. 378–382.
- S. Wang and C. D. Manning, “Baselines and bigrams: Simple, good sentiment and topic classification,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 2012, pp. 90–94.
- Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003.
- R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011.
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
- R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning, “Semi-supervised recursive autoencoders for predicting sentiment distributions,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011, pp. 151–161.
- N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,” arXiv preprint arXiv:1404.2188, 2014.
- C. dos Santos and M. Gatti, “Deep convolutional neural networks for sentiment analysis of short texts,” in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 69–78.
- Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proc. International Conference on Machine Learning, 2014, pp. 1188–1196.
- C. C. Aggarwal and C. Zhai, “A Survey of Text Clustering Algorithms,” in Mining Text Data. Boston: Springer, 2012, pp. 77–128.
- S. Banerjee, K. Ramanathan, and A. Gupta, “Clustering short texts using Wikipedia,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2007, pp. 787–788.
- S. Fodeh, B. Punch, and P.-N. Tan, “On ontology-driven document clustering using core semantic features,” Knowledge and Information Systems, vol. 28, no. 2, pp. 395–421, 2011.
- J. Yin and J. Wang, “A Dirichlet multinomial mixture model-based approach for short text clustering,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014, pp. 233–242.
- D. Cai, X. He, and J. Han, “Document clustering using locality preserving indexing,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 12, pp. 1624–1637, 2005.
- S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent Convolutional Neural Networks for Text Classification.” in Proc. AAAI, vol. 333, 2015, pp. 2267–2273.
- W. Wu, H. Li, H. Wang, and K. Q. Zhu, “Probase: A probabilistic taxonomy for text understanding,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012, pp. 481–492.
- E. Gabrilovich and S. Markovitch, “Computing semantic relatedness using Wikipedia-based explicit semantic analysis,” in Proc. IJCAI, vol. 7, 2007, pp. 1606–1611.
- X. Han and J. Zhao, “Named entity disambiguation by leveraging Wikipedia semantic knowledge,” in Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 2009, pp. 215–224.
- X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou, “Exploiting Wikipedia as external knowledge for document clustering,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2009, pp. 389–396.
- X. Ni, J.-T. Sun, J. Hu, and Z. Chen, “Mining multilingual topics from Wikipedia,” in Proceedings of the 18th International Conference on World Wide Web. ACM, 2009, pp. 1155–1156.
- C. Seung-Seok, C. Sung-Hyuk, and C. C. Tappert, “A Survey of Binary Similarity and Distance Measures,” Journal of Systemics, Cybernetics & Informatics, vol. 8, no. 1, pp. 43–48, 2010.
- S. R. Harsule and M. K. Nighot, “N-Gram Classifier System to Filter Spam Messages from OSN User Wall,” in Advances in Intelligent Systems and Computing. Singapore: Springer, 2016, pp. 21–28.
- S. Bajaj, N. Garg, and S. K. Singh, “A Novel User-based Spam Review Detection,” Procedia Computer Science, vol. 122, pp. 1009–1015, 2017.
- S. Temma, M. Sugii, and H. Matsuno, “The Document Similarity Index based on the Jaccard Distance for Mail Filtering,” in Proceedings of the 34th International Technical Conference on Circuits/Systems, Computers and Communications. IEEE, 2019, pp. 1–4.
- C. Yin, “Towards Accurate Node-Based Detection of P2P Botnets,” The Scientific World Journal, pp. 1–10, 2014.
- C. Yin, M. Zou, D. Iko, and J. Wang, “Botnet detection based on correlation of malicious behaviors,” Int J Hybrid Inf Technol, vol. 6, no. 6, pp. 291–300, 2013.
- A. Veeraswamy and D. S. A. Balamurugan, “A Survey of Feature Selection Algorithms in Data Mining,” in Proceedings of the 3rd International Conference on Trends in Information Sciences and Computing (TISC-2011), 2011, pp. 40–46.
- X.-J. Tong, M.-G. Cui, and G.-L. Song, “Research on Chinese Text Automatic Categorization Based on VSM,” in Proc. International Conference on Wireless Communications, Networking and Mobile Computing. IEEE, 2007, pp. 3863–3866.
- M. Medvedeva, M. Kroon, and B. Plank, “When sparse traditional models outperform dense neural networks: the curious case of discriminating between similar languages,” in VarDial, 2017, pp. 156–163.
- S. Malmasi, K. Evanini, A. Cahill, J. Tetreault, R. Pugh, C. Hamill, D. Napolitano, and Y. Qian, “A report on the 2017 native language identification shared task,” in Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 2017, pp. 62–75.
- A. Kulmizev, B. Blankers, J. Bjerva, M. Nissim, G. van Noord, B. Plank, and M. Wieling, “The power of character n-grams in native language identification,” in BEA@EMNLP. Association for Computational Linguistics, 2017, pp. 382–389.
- W. Cavnar, “Using an n-gram-based document representation with a vector processing retrieval model,” NIST Special Publication, pp. 269–269, 1995.
- S. Huffman, “Acquaintance: Language-independent document categorization by n-grams,” Department of Defense Ft. George G. Meade, 4409 Llewellyn Ave, Fort Meade, MD 20755, United States, Tech. Rep. 0704-0188, 1995.
- M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Inf. Process. Manage., vol. 45, no. 4, pp. 427–437, Jul. 2009.
- R. G. Rossi, A. d. A. Lopes, and S. O. Rezende, “Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts,” Inf. Process. Manage., vol. 52, no. 2, pp. 217–257, Mar. 2016.
- G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining multi-label data,” in Data Mining and Knowledge Discovery Handbook. Boston: Springer, 2009, pp. 667–685.
- B. Plank, “All-in-1: Short text classification with one model for all languages,” in Proceedings of the International Joint Conference on Natural Language Processing (Shared Task 4). Association for Computational Linguistics, December 2017.
- D. Gupta, P. Lenka, H. Bedi, A. Ekbal, and P. Bhattacharyya, “IITP at IJCNLP-2017 task 4: Auto analysis of customer feedback using CNN and GRU network,” in Proceedings of the IJCNLP 2017, Shared Tasks. Asian Federation of Natural Language Processing, 2017, pp. 184–193.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.