Papers
Topics
Authors
Recent
Search
2000 character limit reached

Performance of Distributed File Systems on Cloud Computing Environment: An Evaluation for Small-File Problem

Published 29 Dec 2023 in cs.DC | (2312.17524v1)

Abstract: Various performance characteristics of distributed file systems have been well studied. However, the performance efficiency of distributed file systems on small-file problems with complex machine learning algorithms scenarios is not well addressed. In addition, demands for unified storage of big data processing and high-performance computing have been crucial. Hence, developing a solution combining high-performance computing and big data with shared storage is very important. This paper focuses on the performance efficiency of distributed file systems with small-file datasets. We propose an architecture combining both high-performance computing and big data with shared storage and perform a series of experiments to investigate the performance of these distributed file systems. The result of the experiments confirms the applicability of the proposed architecture in terms of complex machine learning algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. G. C. Fox, J. Qiu, S. Kamburugamuve, S. Jha, and A. Luckow, “HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack,” in 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).   IEEE, pp. 1057–1066.
  2. D. Moise, “Experiences with Performing MapReduce Analysis of Scientific Data on HPC Platforms,” in DIDC ’16: Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing, Cray Inc.   New York, New York, USA: ACM, Jun. 2016, pp. 11–18.
  3. D. Zhao, Z. Zhang, X. Zhou, T. Li, K. Wang, D. Kimpe, P. Carns, R. Ross, and I. Raicu, “FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems,” in 2014 IEEE International Conference on Big Data (Big Data).   IEEE, pp. 61–70.
  4. A. Bhat, N. S. Islam, X. Lu, M. Wasi-ur Rahman, D. Shankar, and D. K. DK Panda, “A Plugin-Based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS,” in Big Data Benchmarks, Performance Optimization, and Emerging Hardware.   Cham: Springer International Publishing, Jan. 2016, pp. 119–132.
  5. P. Xuan, J. Denton, P. K. Srimani, R. Ge, and F. Luo, “Big data analytics on traditional HPC infrastructure using two-level storage,” in the 2015 International Workshop.   New York, New York, USA: ACM Press, 2015, pp. 1–8.
  6. B. T. Rao and L. S. S. Reddy, “Survey on Improved Scheduling in Hadoop MapReduce in Cloud Environments,” arXiv.org, Jul. 2012.
  7. D. Borthakur, “The hadoop distributed file system: Architecture and design,” Hadoop Project Website, 2007.
  8. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing with working sets,” in HotCloud’10: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, University of California, Berkeley.   USENIX Association, Jun. 2010, pp. 10–10.
  9. N. Chaimov, A. Malony, S. Canon, C. Iancu, K. Z. Ibrahim, and J. Srinivasan, “Scaling Spark on HPC Systems,” in HPDC ’16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, Lawrence Berkeley National Laboratory.   New York, New York, USA: ACM, May 2016, pp. 97–110.
  10. M. Daz, C. Martn, and B. Rubio, “State-of-the-art, challenges, and open issues in the integration of Internet of things and cloud computing,” Journal of Network and Computer Applications, vol. 67, no. C, pp. 99–117, May 2016.
  11. Lustre to DAOS: Machine Learning on Intel’s Platform.
  12. Spider – the Center-Wide Lustre File System.
  13. W. Yu, R. Noronha, S. Liang, and D. K. Panda, “Benefits of high speed interconnects to cluster file systems: a case study with lustre,” in IPDPS’06: Proceedings of the 20th international conference on Parallel and distributed processing, Ohio State University.   IEEE Computer Society, Apr. 2006, pp. 273–273.
  14. J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
  15. M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling,” in EuroSys ’10: Proceedings of the 5th European conference on Computer systems, University of California, Berkeley.   New York, New York, USA: ACM, Apr. 2010, pp. 265–278.
  16. M. Wasi-ur Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, “High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA,” in IPDPS ’15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium.   IEEE Computer Society, May 2015, pp. 291–300.
  17. T. Zhao, Z. Zhang, and X. Ao, “Application Performance Analysis of Distributed File Systems under Cloud Computing Environment,” Information Science and Control, pp. 152–155, 2015.
  18. X. Lu, M. W. U. Rahman, N. Islam, D. Shankar, and D. K. Panda, “Accelerating Spark with RDMA for Big Data Processing: Early Experiences,” in HOTI ’14: Proceedings of the 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.   IEEE Computer Society, Aug. 2014, pp. 9–16.
  19. D. Shankar, X. Lu, M. Wasi-ur Rahman, N. Islam, and D. K. Panda, “Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters,” The Journal of Supercomputing, Jun. 2016.
  20. Intel Corporation. (2015) Intel Rolls Out Enhanced Lustre File System.
  21. ——. (2015) Lustre at the Core of HPC and Big Data Convergence.
  22. (2015) Seagate Apache Hadoop on Lustre Connector.
  23. C. McDonald. (2015) Parallel and Iterative Processing for Machine Learning Recommendations with Spark.
  24. H. Li, A. Ghodsi, M. Zaharia, and E. Baldeschwieler, “Tachyon: Memory throughput i/o for cluster computing frameworks,” memory, 2013.
  25. J. Sparks, H. Pritchard, and M. Dumler, “The Cray Framework for Hadoop for the Cray XC30,” Cray User Group Conference (CUG’14), 2014.
  26. The Google file system, 2003.
  27. The Hadoop Distributed File System, 2010.
  28. F. Wang, S. Oral, G. Shipman, and O. Drokin, “Understanding lustre filesystem internals,” 2009.
  29. D. Moise, G. Antoniu, and L. Bougé, “Improving the Hadoop map/reduce framework to support concurrent appends through the BlobSeer BLOB management system,” 2010.
  30. V. K. Vavilapalli, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, E. Baldeschwieler, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, and H. Shah, “Apache Hadoop YARN,” in the 4th annual Symposium.   New York, New York, USA: ACM Press, 2013, pp. 1–16.
  31. W. Xu, W. Luo, and N. Woodward, “Analysis and optimization of data import with hadoop,” Parallel and Distributed, 2012.
  32. S. Kipp. (2012) Exponential bandwidth growth and cost declines.
  33. D. J. Law, W. W. Diab, A. Healey, and S. B. Carlson, “IEEE 802.3 industry connections Ethernet bandwidth assessment,” 2012.
  34. T. P. Morgan, “InfiniBand Too Quick For Ethernet To Kill,” Apr. 2015.
  35. V. Meshram, X. Ouyang, and D. K. Panda, “Minimizing Lookup RPCs in Lustre File System using Metadata Delegation at Client Side,” Tech Rep OSU-CISRC-7/11, 2011.
  36. D. M. Stearman, “ZFS on RBODs Leveraging RAID Controllers for Metrics and Enclosure Management,” 2015.
  37. R. Brueckner, “Building a CIFS/NFS Gateway to Lustre - insideHPC,” Oct. 2014.
  38. K. V. Shvachko, “HDFS scalability: the limits to growth,” ; login:: the magazine of USENIX SAGE, vol. 35, no. 2, pp. 6–16, 2010.
  39. T. White. (2009) The Small Files Problem. [Online]. Available: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
  40. X. Liu, J. Han, Y. Zhong, C. Han, and X. He, “Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS,” in 2009 IEEE International Conference on Cluster Computing and Workshops.   IEEE, 2009, pp. 1–8.
  41. M. Pershin. Intel Lustre Data on MDT/Small File I/O.
  42. S. Ihara. Lustre Metadata Fundamental Benchmark and Performance.
  43. Criteo. (2014) Kaggle display advertising challenge. [Online]. Available: http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper looks at how two big “shared storage” systems—HDFS (used in Hadoop) and Lustre (used in supercomputers)—handle lots of tiny files when running machine-learning jobs in the cloud. The authors also suggest a way to combine high‑performance computing (HPC) and big data tools using one shared storage system, so scientists and businesses don’t have to move huge amounts of data back and forth.

What questions did the researchers ask?

They wanted to answer simple, practical questions:

  • When we have lots of small files, which storage system (HDFS or Lustre) lets Spark (a popular data tool) run machine‑learning jobs faster?
  • Can we use one shared storage system for both supercomputer-style work (HPC) and big data processing, instead of keeping them separate?
  • Do we really need to “keep computation next to the data” (data locality), or can modern networks and smart caching make that less important?

How did they do the study?

Think of a distributed file system like a giant, shared hard drive spread across many computers. The two systems they compared were:

  • HDFS: common in big data (Hadoop). It’s built to store huge files, and it usually keeps three copies of each file for safety. It’s “write once, read many,” meaning you don’t constantly change files after you write them.
  • Lustre: common in supercomputers. It’s very fast, lets many users read and write at the same time, and works like a regular mounted folder on Linux.

They ran Spark (a tool that processes data fast, often by keeping it in memory) on two small test clusters:

  • One cluster stored data on HDFS.
  • One cluster stored data on Lustre (and the Hadoop cluster could also mount Lustre).

They used a real advertising dataset (from Criteo) and trained a simple machine-learning model (logistic regression) using Spark’s ML library. To focus on the “small-file problem,” they split the dataset into many tiny files (each about 460 lines) and tested three sizes:

  • 10,000 files (~1.1 GB)
  • 20,000 files (~2.2 GB)
  • 30,000 files (~3.3 GB)

Then they measured how long Spark took to run the job on each storage system as the number of files increased.

To make the technical ideas clearer, here are two kinds of “data movement” they discuss:

  • Vertical movement: like using an elevator in a building—data moves between long‑term storage (disk), fast memory (RAM), and the CPU. Spark helps by caching (saving) data in memory to avoid reloading it from disk over and over.
  • Horizontal movement: like passing papers between classmates—data is shuffled across computers during certain steps. This can be sped up with faster networks (e.g., InfiniBand/RDMA), but often vertical movement (disk ↔ memory) dominates the cost.

What did they find?

  • As the number of small files increased (from 10k to 30k), both HDFS and Lustre took much longer. Small files are hard on distributed systems because tracking each file’s “metadata” (like a file’s name and location) adds a lot of overhead, and the system ends up doing too many tiny read operations.
  • Lustre was consistently faster than HDFS for these small‑file Spark jobs, and the gap grew as the number of files increased.
  • This suggests Lustre can be a good shared storage choice for both HPC and big data workloads, even for complex machine‑learning tasks with lots of small files.

Why this matters:

  • Many real projects (scientific and business) end up with tons of small files.
  • If Lustre handles these better than HDFS, it could simplify how teams store and process data across different types of computing systems.

Why does this matter? Implications and impact

  • A unified storage approach: Using Lustre for both HPC and big data could reduce the need to copy massive outputs from one cluster to another, saving time and simplifying workflows.
  • Data locality may matter less: Spark’s in‑memory caching and faster modern networks can reduce the need to keep computation exactly where the data lives, especially for iterative ML algorithms.
  • Small-file challenge remains: Even with Lustre doing better than HDFS, small files are still tough on distributed systems. Better tools (like memory‑centric systems such as Tachyon/Alluxio) and improved metadata handling could help further.

Limitations and future plans

The authors note:

  • Their test was small scale (virtual machines, 1 Gb Ethernet), not a giant production setup.
  • They tested reads for small files, not writes (writing can be very important in scientific workloads).
  • They didn’t include Tachyon/Alluxio, which might improve performance by keeping hot data in memory across the cluster.

They plan to:

  • Test on larger clusters with more files.
  • Add Tachyon/Alluxio to see the gains.
  • Benchmark small‑file writes on both Lustre and HDFS.

In short, this paper suggests that Lustre can be a strong shared storage option for both supercomputing and big data tasks, especially when dealing with many small files, and it points the way to building simpler, faster data systems that don’t waste time moving data around.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.