Papers
Topics
Authors
Recent
Search
2000 character limit reached

BPF-oF: Storage Function Pushdown Over the Network

Published 11 Dec 2023 in cs.OS | (2312.06808v1)

Abstract: Storage disaggregation, wherein storage is accessed over the network, is popular because it allows applications to independently scale storage capacity and bandwidth based on dynamic application demand. However, the added network processing introduced by disaggregation can consume significant CPU resources. In many storage systems, logical storage operations (e.g., lookups, aggregations) involve a series of simple but dependent I/O access patterns. Therefore, one way to reduce the network processing overhead is to execute dependent series of I/O accesses at the remote storage server, reducing the back-and-forth communication between the storage layer and the application. We refer to this approach as \emph{remote-storage pushdown}. We present BPF-oF, a new remote-storage pushdown protocol built on top of NVMe-oF, which enables applications to safely push custom eBPF storage functions to a remote storage server. The main challenge in integrating BPF-oF with storage systems is preserving the benefits of their client-based in-memory caches. We address this challenge by designing novel caching techniques for storage pushdown, including splitting queries into separate in-memory and remote-storage phases and periodically refreshing the client cache with sampled accesses from the remote storage device. We demonstrate the utility of BPF-oF by integrating it with three storage systems, including RocksDB, a popular persistent key-value store that has no existing storage pushdown capability. We show BPF-oF provides significant speedups in all three systems when accessed over the network, for example improving RocksDB's throughput by up to 2.8$\times$ and tail latency by up to 2.6$\times$.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. 100g kernel and user space NVMe/TCP using Chelsio offload. https://www.chelsio.com/wp-content/uploads/resources/t6-100g-nvmetcp-offload.pdf.
  2. BPF: introduce function-by-function verification. https://lore.kernel.org/bpf/[email protected]/.
  3. Cilium. https://github.com/cilium/cilium.
  4. Cloudflare architecture and how BPF eats the world. https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/.
  5. Filtering and retrieving data using Amazon S3 Select. https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html.
  6. LevelDB. https://github.com/google/leveldb.
  7. NVMe base specification. https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4b-2020.09.21-Ratified.pdf.
  8. NVMe computational storage - an update on the standard. https://www.snia.org/educational-library/nvme-computational-storage-update-standard-2022.
  9. NVMe over fabrics specification. https://nvmexpress.org/wp-content/uploads/NVMe-over-Fabrics-1.1a-2021.07.12-Ratified.pdf.
  10. Redis functions. https://redis.io/docs/manual/programmability/functions-intro/.
  11. RocksDB. https://rocksdb.org/.
  12. RocksDB users. https://github.com/facebook/rocksdb/blob/main/USERS.md.
  13. Rockset: real-time analytics at cloud scale. https://rockset.com/.
  14. SQLite pluggable storage engine. https://sqlite.org/src4/doc/trunk/www/storage.wiki.
  15. WiredTiger storage engine. https://docs.mongodb.com/manual/core/wiredtiger/.
  16. Adaptive placement for in-memory storage functions. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 127–141. USENIX Association, July 2020.
  17. Extension framework for file systems in user space. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 121–134, Renton, WA, July 2019. USENIX Association.
  18. Hailstorm: Disaggregated compute and storage for distributed LSM-based databases. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, page 301–316, New York, NY, USA, 2020. Association for Computing Machinery.
  19. SplinterDB: Closing the bandwidth gap for NVMe key-value stores. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 49–63, 2020.
  20. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing, pages 143–154, 2010.
  21. Evolution of development priorities in key-value stores serving large-scale applications: The RocksDB experience. In 19th USENIX Conference on File and Storage Technologies (FAST 21), pages 33–49. USENIX Association, February 2021.
  22. The design and operation of CloudLab. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 1–14, Renton, WA, July 2019. USENIX Association.
  23. Simple and precise static analysis of untrusted Linux kernel extensions. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, page 1069–1084, New York, NY, USA, 2019. Association for Computing Machinery.
  24. BMC: Accelerating memcached using safe in-kernel caching and pre-stack processing. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 487–501. USENIX Association, April 2021.
  25. Cornus: Atomic commit for a cloud DBMS with storage disaggregation. Proc. VLDB Endow., 16(2):379–392, oct 2022.
  26. Performance characterization of NVMe-over-Fabrics storage disaggregation. ACM Trans. Storage, 14(4), dec 2018.
  27. Predicate migration: Optimizing queries with expensive predicates. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pages 267–276, 1993.
  28. TCP ≈ RDMA: CPU-efficient remote storage access with i10. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 127–140, Santa Clara, CA, February 2020. USENIX Association.
  29. J Kim. iSCSI - is it the future of cloud storage or doomed by NVMe-oF. https://www.snia.org/sites/default/files/news/iSCSI-Future-Cloud-Storage-Doomed-NVMe-oF.pdf.
  30. Flash storage disaggregation. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys ’16, New York, NY, USA, 2016. Association for Computing Machinery.
  31. ReFlex: Remote flash ≈ local flash. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’17, page 345–359, New York, NY, USA, 2017. Association for Computing Machinery.
  32. Safe and efficient remote application code execution on disaggregated NVM storage with eBPF. arXiv preprint arXiv:2002.11528, 2020.
  33. Splinter: Bare-Metal extensions for Multi-Tenant Low-Latency storage. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 627–643, Carlsbad, CA, October 2018. USENIX Association.
  34. Privbox: Faster system calls through sandboxed privileged execution. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), Carlsbad, CA, July 2022. USENIX Association.
  35. Cassandra: a decentralized structured storage system. ACM SIGOPS operating systems review, 44(2):35–40, 2010.
  36. SkyhookDM: Data processing in Ceph with programmable storage. USENIX login;, 45(2), 2020.
  37. Understanding Rack-Scale disaggregated storage. In 9th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 17), Santa Clara, CA, July 2017. USENIX Association.
  38. Query optimization by predicate move-around. In VLDB, pages 96–107, 1994.
  39. WiscKey: Separating keys from values in SSD-conscious storage. In 14th USENIX Conference on File and Storage Technologies (FAST 16), pages 133–148, Santa Clara, CA, February 2016. USENIX Association.
  40. Gimbal: Enabling multi-tenant storage disaggregation on SmartNIC JBOFs. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, SIGCOMM ’21, page 106–122, New York, NY, USA, 2021. Association for Computing Machinery.
  41. Decibel: Isolation and sharing in disaggregated Rack-Scale storage. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 17–33, Boston, MA, March 2017. USENIX Association.
  42. The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4):351–385, 1996.
  43. PebblesDB: Building key-value stores using fragmented log-structured merge trees. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 497–514, 2017.
  44. CockroachDB: The resilient geo-distributed SQL database. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 1493–1509, 2020.
  45. Amazon Aurora: Design considerations for high throughput cloud-native relational databases. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, page 1041–1052, New York, NY, USA, 2017. Association for Computing Machinery.
  46. Building an elastic query engine on disaggregated storage. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 449–462, Santa Clara, CA, February 2020. USENIX Association.
  47. Cache modeling and optimization using miniature simulations. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 487–498, Santa Clara, CA, July 2017. USENIX Association.
  48. Synthesizing safe and efficient kernel extensions for packet processing. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, SIGCOMM ’21, page 50–64, New York, NY, USA, 2021. Association for Computing Machinery.
  49. A large scale analysis of hundreds of in-memory cache clusters at twitter. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 191–208. USENIX Association, November 2020.
  50. FlexPushdownDB: Hybrid pushdown and caching in a cloud DBMS. Proc. VLDB Endow., 14(11):2101–2113, jul 2021.
  51. λ-IO: A unified IO stack for computational storage. In 21st USENIX Conference on File and Storage Technologies (FAST 23), pages 347–362, Santa Clara, CA, February 2023. USENIX Association.
  52. Ship compute or ship data? why not both? In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 633–651. USENIX Association, April 2021.
  53. The demikernel datapath os architecture for microsecond-scale datacenter systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP ’21, page 195–211, New York, NY, USA, 2021. Association for Computing Machinery.
  54. XRP: In-Kernel storage functions with eBPF. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 375–393, Carlsbad, CA, July 2022. USENIX Association.
  55. BPF for storage: An exokernel-inspired approach. In Proceedings of the Workshop on Hot Topics in Operating Systems, HotOS ’21, page 128–135, New York, NY, USA, 2021. Association for Computing Machinery.
  56. Electrode: Accelerating distributed protocols with eBPF. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023.
Citations (3)

Summary

  • The paper introduces a protocol that uses eBPF to execute custom storage functions on remote servers, mitigating network overhead and reducing CPU load.
  • The methodology splits I/O operations into in-memory cache lookups and disk transactions while ensuring metadata synchronization for safe remote execution.
  • Performance tests demonstrate up to 2.8× throughput improvement, 37% fewer CPU cycles, and a 23% reduction in network traffic across storage systems.

BPF-oF: Storage Function Pushdown Over the Network

Introduction

"BPF-oF: Storage Function Pushdown Over the Network" introduces a novel approach to reduce CPU resource usage in storage disaggregation environments via a new protocol, BPF-oF, that leverages eBPF for executing custom storage functions on remote storage servers. The core idea is to mitigate the network processing overhead caused by disaggregation by shifting complex, series-dependent I/O operations closer to the storage layer. This strategy not only lessens the back-and-forth communication but also aligns with the increasing deployment of NVMe-oF (NVMe over Fabrics) in data centers. Figure 1

Figure 1: NVMe-oF overview.

Motivation and Challenges

Storage disaggregation allows applications to separate scaling of compute and storage resources, promoting flexibility and efficiency. However, this setup burdens systems with additional network latency and CPU processing costs, especially notable in NVMe-oF environments using TCP, which can halve throughput compared to local disk access.

BPF-oF addresses the problem by enabling safe execution of eBPF functions on remote servers, effectively grouping multiple I/O operations into single transactions executed closer to the data. This requires overcoming several challenges:

  • Versioning and Metadata Synchronization: Ensuring that file-to-block mappings remain consistent between the client and server, particularly when inodes are updated.
  • Integration with In-Memory Caching: Striking a balance between leveraging in-memory data structures and executing storage operations remotely without losing the performance benefits those caches provide.

System Design

BPF-oF operates on three key components:

  1. Metadata Synchronization: An asynchronous mechanism using versioning to maintain up-to-date inode mappings which helps in safe node-read operations.
  2. Query Splitting: Dividing operations into in-memory cache lookups and disk operations to optimize performance. This effectively reduces the disk I/O operations required on remote execution.
  3. eBPF Execution Framework: Incorporating a programmable layer for running eBPF functions that interpret and act on received data, effectively filtering and aggregating it before completing the query response. Figure 2

    Figure 2: BPF-oF architecture.

Performance Analysis

BPF-oF’s implementation within RocksDB, WiredTiger, and BPF-KV demonstrates significant performance boosts. For instance, the integration with RocksDB improves throughput by up to 2.8× and lowers tail latency by 2.6× across various read-intensive workloads. These improvements are largely attributed to effective elimination of redundant network traversal and CPU processing reduction.

Additionally, comparative analysis with NVMe-oF setups affirms BPF-oF's efficacy across network protocols (TCP/RDMA) and storage types (NAND/Optane SSDs):

  • NAND SSDs: BPF-oF provided 1.4-2.2× improvement over NVMe/TCP.
  • Optane SSDs: Enhanced throughput by 2-2.8×, underscoring the application’s capability to leverage high IOPS devices more efficiently. Figure 3

Figure 3

Figure 3

Figure 3: RocksDB (a) uniform read throughput-latency and throughput (b) without and (c) with data blocks cached of BPF-oF vs. NVMe/TCP on NAND SSD.

CPU and Network Considerations

One of BPF-oF's key advantages is reducing CPU load on both client and server sides, thus lowering overall power consumption. Studies show up to 37% fewer CPU cycles and 23% less network traffic per request under TCP setups. Energy savings further substantiate BPF-oF's efficiency by showing a 36% decrease in energy consumption per operation. Figure 4

Figure 4: RocksDB throughput with different sampling rates with cache under BPF-oF using TCP and NAND SSD.

Future Work

Current limitations of BPF-oF include its compatibility with ext4 and reliance on predetermined file accesses. Addressing support for additional file systems, integrating advanced storage configurations (like RAID), and refining support for encrypted storage remain key areas for future research. Additionally, further optimizing the deployment within SmartNICs presents an opportunity to harness hardware acceleration further, advancing the utility of storage disaggregation models in complex datacenter environments.

Conclusion

BPF-oF's architecture and implementation realizes substantial efficiency improvements in storage-disaggregated environments, specifically through network and CPU optimizations. The protocol's design and results advocate for broader adoption of function pushdowns in storage systems, offering promising avenues for future advancements in both software methodologies and hardware enhancements.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 4859 likes about this paper.