Performance of Distributed File Systems on Cloud Computing Environment: An Evaluation for Small-File Problem

Published 29 Dec 2023 in cs.DC | (2312.17524v1)

Abstract: Various performance characteristics of distributed file systems have been well studied. However, the performance efficiency of distributed file systems on small-file problems with complex machine learning algorithms scenarios is not well addressed. In addition, demands for unified storage of big data processing and high-performance computing have been crucial. Hence, developing a solution combining high-performance computing and big data with shared storage is very important. This paper focuses on the performance efficiency of distributed file systems with small-file datasets. We propose an architecture combining both high-performance computing and big data with shared storage and perform a series of experiments to investigate the performance of these distributed file systems. The result of the experiments confirms the applicability of the proposed architecture in terms of complex machine learning algorithms.

Abstract PDF HTML Upgrade to Chat

References (43)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper looks at how two big “shared storage” systems—HDFS (used in Hadoop) and Lustre (used in supercomputers)—handle lots of tiny files when running machine-learning jobs in the cloud. The authors also suggest a way to combine high‑performance computing (HPC) and big data tools using one shared storage system, so scientists and businesses don’t have to move huge amounts of data back and forth.

What questions did the researchers ask?

They wanted to answer simple, practical questions:

When we have lots of small files, which storage system (HDFS or Lustre) lets Spark (a popular data tool) run machine‑learning jobs faster?
Can we use one shared storage system for both supercomputer-style work (HPC) and big data processing, instead of keeping them separate?
Do we really need to “keep computation next to the data” (data locality), or can modern networks and smart caching make that less important?

How did they do the study?

Think of a distributed file system like a giant, shared hard drive spread across many computers. The two systems they compared were:

HDFS: common in big data (Hadoop). It’s built to store huge files, and it usually keeps three copies of each file for safety. It’s “write once, read many,” meaning you don’t constantly change files after you write them.
Lustre: common in supercomputers. It’s very fast, lets many users read and write at the same time, and works like a regular mounted folder on Linux.

They ran Spark (a tool that processes data fast, often by keeping it in memory) on two small test clusters:

One cluster stored data on HDFS.
One cluster stored data on Lustre (and the Hadoop cluster could also mount Lustre).

They used a real advertising dataset (from Criteo) and trained a simple machine-learning model (logistic regression) using Spark’s ML library. To focus on the “small-file problem,” they split the dataset into many tiny files (each about 460 lines) and tested three sizes:

10,000 files (~1.1 GB)
20,000 files (~2.2 GB)
30,000 files (~3.3 GB)

Then they measured how long Spark took to run the job on each storage system as the number of files increased.

To make the technical ideas clearer, here are two kinds of “data movement” they discuss:

Vertical movement: like using an elevator in a building—data moves between long‑term storage (disk), fast memory (RAM), and the CPU. Spark helps by caching (saving) data in memory to avoid reloading it from disk over and over.
Horizontal movement: like passing papers between classmates—data is shuffled across computers during certain steps. This can be sped up with faster networks (e.g., InfiniBand/RDMA), but often vertical movement (disk ↔ memory) dominates the cost.

What did they find?

As the number of small files increased (from 10k to 30k), both HDFS and Lustre took much longer. Small files are hard on distributed systems because tracking each file’s “metadata” (like a file’s name and location) adds a lot of overhead, and the system ends up doing too many tiny read operations.
Lustre was consistently faster than HDFS for these small‑file Spark jobs, and the gap grew as the number of files increased.
This suggests Lustre can be a good shared storage choice for both HPC and big data workloads, even for complex machine‑learning tasks with lots of small files.

Why this matters:

Many real projects (scientific and business) end up with tons of small files.
If Lustre handles these better than HDFS, it could simplify how teams store and process data across different types of computing systems.

Why does this matter? Implications and impact

A unified storage approach: Using Lustre for both HPC and big data could reduce the need to copy massive outputs from one cluster to another, saving time and simplifying workflows.
Data locality may matter less: Spark’s in‑memory caching and faster modern networks can reduce the need to keep computation exactly where the data lives, especially for iterative ML algorithms.
Small-file challenge remains: Even with Lustre doing better than HDFS, small files are still tough on distributed systems. Better tools (like memory‑centric systems such as Tachyon/Alluxio) and improved metadata handling could help further.

Limitations and future plans

The authors note:

Their test was small scale (virtual machines, 1 Gb Ethernet), not a giant production setup.
They tested reads for small files, not writes (writing can be very important in scientific workloads).
They didn’t include Tachyon/Alluxio, which might improve performance by keeping hot data in memory across the cluster.

They plan to:

Test on larger clusters with more files.
Add Tachyon/Alluxio to see the gains.
Benchmark small‑file writes on both Lustre and HDFS.

In short, this paper suggests that Lustre can be a strong shared storage option for both supercomputing and big data tasks, especially when dealing with many small files, and it points the way to building simpler, faster data systems that don’t waste time moving data around.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Performance of Distributed File Systems on Cloud Computing Environment: An Evaluation for Small-File Problem

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What questions did the researchers ask?

How did they do the study?

What did they find?

Why does this matter? Implications and impact

Limitations and future plans

Open Problems

Continue Learning

Authors (3)

Collections

Performance of Distributed File Systems on Cloud Computing Environment: An Evaluation for Small-File Problem

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What questions did the researchers ask?

How did they do the study?

What did they find?

Why does this matter? Implications and impact

Limitations and future plans

Open Problems

Continue Learning

Related Papers

Authors (3)

Collections