Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

Published 27 Apr 2018 in cs.PF | (1804.10563v2)

Abstract: In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such data processing at scale. Specifically, Spark leverages distributed memory to cache the intermediate results, represented as Resilient Distributed Datasets (RDDs). This gives Spark an advantage over other parallel frameworks for implementations of iterative machine learning and data mining algorithms, by avoiding repeated computation or hard disk accesses to retrieve RDDs. By default, caching decisions are left at the programmer's discretion, and the LRU policy is used for evicting RDDs when the cache is full. However, when the objective is to minimize total work, LRU is woefully inadequate, leading to arbitrarily suboptimal caching decisions. In this paper, we design an algorithm for multi-stage big data processing platforms to adaptively determine and cache the most valuable intermediate datasets that can be reused in the future. Our solution automates the decision of which RDDs to cache: this amounts to identifying nodes in a direct acyclic graph (DAG) representing computations whose outputs should persist in the memory. Our experiment results show that our proposed cache optimization solution can improve the performance of machine learning applications on Spark decreasing the total work to recompute RDDs by 12%.

Abstract PDF Upgrade to Chat

Citations (35)

View on Semantic Scholar

Summary

The paper presents an adaptive caching algorithm that leverages Spark's DAG structure to intelligently optimize intermediate data storage.
It formulates caching as a submodular optimization problem using greedy approximations to achieve near-optimal performance.
The strategy demonstrates up to a 70% improvement in cache hit ratios and reduced recomputation time through simulations and real-world tests.

Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

Introduction

The paper addresses the optimization of caching intermediate data in multi-stage, parallel big data frameworks like Apache Spark. The aim is to automate caching decisions to enhance the performance of machine learning applications by minimizing the total recomputation workload of Resilient Distributed Datasets (RDDs) compared to traditional caching mechanisms like LRU and FIFO. The proposed solution, an adaptive caching algorithm, intelligently decides which intermediate datasets to cache by leveraging the execution graph's Directed Acyclic Graph (DAG) structure (Figure 1).

Figure 1: Job arrivals with computational overlaps demonstrated using DAGs in Spark.

Background and Motivation

Apache Spark uses RDDs to enable distributed in-memory caching, which improves the performance of iterative data processing tasks. However, the current caching mechanism relies on simplistic policies like LRU, which do not consider the intricate dependencies between operations represented in DAGs (Figure 2).

Figure 2: Job DAG example illustrating precedence and parallel execution of operations.

Algorithm Design

The proposed caching strategy utilizes the complete knowledge of Spark’s job DAGs to make informed caching decisions. RDDs are selectively cached based on their recomputation cost, size, and reuse frequency. This approach deviates from traditional cache algorithms that operate without detailed insights into computational dependencies or future access patterns. The key idea is to treat the caching decision as a submodular optimization problem, which can be approximated effectively using greedy algorithms with probabilistic guarantees of performance within a $1 - \frac{1}{e}$ factor of the optimal solution.

Performance Evaluation

Through numerical analysis, simulation, and real-world implementation in Apache Spark, the proposed algorithm demonstrates superiority over conventional methods in improving cache hit ratios and reducing recomputation time (Figure 3). Specifically, the adaptive caching algorithm adapts to varying workload patterns and efficiently leverages available memory space, achieving up to a 70% improvement in cache hit rates under various conditions.

Figure 3: Hit ratio, access number, and total work makespan results of large-scale simulation experiments.

Implementation Considerations

Implementing this caching strategy involves extending Spark’s internal cache management to track the cost implications of caching particular RDDs. This entails modifications to recognize computational overlaps across different jobs and dynamically adjust caching decisions based on real-time execution statistics (Figure 4). The architecture of the Spark Unified Memory Manager is adapted to integrate the proposed RDDCacheManager, which coordinates cache updates across distributed worker nodes in the cluster.

Figure 4: Architecture of Apache Spark Unified Memory Manager with integrated adaptive caching mechanisms.

Conclusion

The adaptive caching algorithm offers a significant advancement in optimizing memory usage in big data frameworks by intelligently managing intermediate data storage. By reducing reliance on suboptimal eviction policies like LRU and instead employing a principled, graph-aware caching strategy, the algorithm enhances the performance of data-parallel applications. Future work may focus on extending these concepts to other big data platforms beyond Spark, incorporating additional computational models and workloads.