Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Large-scale Fine-grained Analysis of Packages in Open-Source Software Ecosystems

Published 17 Apr 2024 in cs.SE and cs.CR | (2404.11467v1)

Abstract: Package managers such as NPM, Maven, and PyPI play a pivotal role in open-source software (OSS) ecosystems, streamlining the distribution and management of various freely available packages. The fine-grained details within software packages can unveil potential risks within existing OSS ecosystems, offering valuable insights for detecting malicious packages. In this study, we undertake a large-scale empirical analysis focusing on fine-grained information (FGI): the metadata, static, and dynamic functions. Specifically, we investigate the FGI usage across a diverse set of 50,000+ legitimate and 1,000+ malicious packages. Based on this diverse data collection, we conducted a comparative analysis between legitimate and malicious packages. Our findings reveal that (1) malicious packages have less metadata content and utilize fewer static and dynamic functions than legitimate ones; (2) malicious packages demonstrate a higher tendency to invoke HTTP/URL functions as opposed to other application services, such as FTP or SMTP; (3) FGI serves as a distinguishable indicator between legitimate and malicious packages; and (4) one dimension in FGI has sufficient distinguishable capability to detect malicious packages, and combining all dimensions in FGI cannot significantly improve overall performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Empirical analysis of security vulnerabilities in python packages. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 446–457.
  2. Alibaba. accessible by 2023. Alibaba Cloud RubyGems mirror for expedited downloads. https://mirrors.aliyun.com/rubygems/.
  3. Aliyun. accessible by 2023. Aliyun NPM mirror by Alibaba Cloud. https://npm.aliyun.com/.
  4. The evolution of project inter-dependencies in a software ecosystem: The case of apache. In 2013 IEEE international conference on software maintenance. IEEE, 280–289.
  5. Bertus. 2018. Cryptocurrency clipboard hijacker discovered in pypi repository. https://medium.com@bertusk/cryptocurrency-clipboard-hijacker-discovered-in-pypi-repository-b66b8a534a8.
  6. A look in the mirror: Attacks on package managers. In Proceedings of the 15th ACM conference on Computer and communications security. 565–574.
  7. Vitaly Chaykovsky. 1991. Linux syscall tracer. https://strace.io/
  8. Ruby community. 2020. RubyGems.org is the Ruby community’s gem hosting service. https://rubygems.org/.
  9. Eleni Constantinou and Tom Mens. 2017. An empirical comparison of developer retention in the RubyGems and npm software ecosystems. Innovations in Systems and Software Engineering 13, 2 (2017), 101–115.
  10. An empirical comparison of dependency issues in OSS packaging ecosystems. In 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER). IEEE, 2–12.
  11. On the evolution of technical lag in the npm package dependency network. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 404–414.
  12. On the impact of security vulnerabilities in the npm package dependency network. In Proceedings of the 15th international conference on mining software repositories. IEE, 181–191.
  13. Tapajit Dey and Audris Mockus. 2018. Are software dependency supply chain metrics useful in predicting change of popularity of npm packages?. In Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering. IEEE, 66–69.
  14. Towards measuring supply chain attacks on package managers for interpreted languages. In Network and Distributed Systems Security (NDSS) Symposium. IEEE.
  15. Containing malicious package updates in npm with a lightweight permission system. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1334–1346.
  16. Postmortem for malicious packages. https://eslint.org/blog/2018/07/postmortem-for-malicious-package-publishes.
  17. Django Software Foundation. 2005. Django makes it easier to build better web apps more quickly and with less code. https://www.djangoproject.com/
  18. Python Software Foundation. 2020a. The Python Package Index (PyPI) is a repository of software for the Python programming language. https://pypi.org.
  19. The Apache Software Foundation. 2020b. Apache Maven is a software project management and comprehension tool. https://maven.apache.org/.
  20. The evolution of the R software ecosystem. In 2013 17th European Conference on Software Maintenance and Reengineering. IEEE, 243–252.
  21. GitHub. 2023. Github Security Advisory Database. . https://github.com/advisories.
  22. An Empirical Study of Malicious Code In PyPI Ecosystem. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 166–177.
  23. Jaap Kabbedijk and Slinger Jansen. 2011. Steering insight: An exploration of the ruby software ecosystem. In Software Business: Second International Conference, ICSOB 2011, Brussels, Belgium, June 8-10, 2011. Proceedings 2. Springer, 44–55.
  24. Structure and evolution of package dependency networks. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 102–112.
  25. J. Koljonen. 2019. Warning! is rest-client 1.6.13 hijacked? https://github.com/rest-client/rest-client/issues/713.
  26. Sok: Taxonomy of attacks on open-source software supply chains. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 1509–1526.
  27. Towards the Detection of Malicious Java Packages. In Proceedings of the 2022 ACM Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses (Los Angeles, CA, USA) (SCORED’22). Association for Computing Machinery, New York, NY, USA, 63 – 72. https://doi.org/10.1145/3560835.3564548
  28. Thou shalt not depend on me: Analysing the use of outdated javascript libraries on the web. arXiv preprint arXiv:1811.00918 (2018).
  29. Arbitrar: User-guided api misuse detection. In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 1400–1415.
  30. Yuxing Ma. 2018. Constructing supply chains in open source software. In 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion). IEEE, 458–459.
  31. World of code: an infrastructure for mining the universe of open source VCS data. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 143–154.
  32. Microsoft. 2020. NuGet is the package manager for .NET. https://www.nuget.org/.
  33. PyPI mirror in tsinghua. accessible by 2023. TUNA PyPI mirror for users in China. https://pypi.tuna.tsinghua.edu.cn/.
  34. NPM. 2020. npm is the package manager for Node.js. https://www.npmjs.com/.
  35. Marc Ohm. 2020. Backstabber’s Knife Collection. https://dasfreak.github.io/Backstabbers-Knife-Collection/.
  36. Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks. In Detection of Intrusions and Malware, and Vulnerability Assessment, Clémentine Maurice, Leyla Bilge, Gianluca Stringhini, and Nuno Neves (Eds.). Springer International Publishing, Cham, 23–43.
  37. Backstabber’s knife collection: A review of open source software supply chain attacks. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 23–43.
  38. PRIVATE PACKAGIST. 2020. Packagist is the main Composer repository. https://packagist.org/.
  39. Preliminary Findings on FOSS Dependencies and Security. (2020).
  40. A qualitative study of dependency management and its security implications. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1513–1531.
  41. Conflictjs: finding and understanding conflicts between javascript libraries. In Proceedings of the 40th International Conference on Software Engineering. 741–751.
  42. Massimo Di Pierro. 2007. Ffull-stack framework for rapid development web applications. https://github.com/web2py/web2py
  43. Beyond metadata: Code-centric and usage-based analysis of known vulnerabilities in open-source software. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 449–460.
  44. Pytorch. 2018. An open source machine learning framework that accelerates the path from research prototyping to production deployment. https://pytorch.org/.
  45. Malicious repositories detection with adversarial heterogeneous graph contrastive learning. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 1645–1654.
  46. Call graph construction for java libraries. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 474–486.
  47. How Do Developers React to API Deprecation? The Case of a Smalltalk Ecosystem. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. Association for Computing Machinery. https://doi.org/10.1145/2393596.2393662
  48. Armin Ronacher. 2010. A lightweight WSGI web application framework. https://github.com/pallets/flask
  49. Scikit-learn. 2007. Machine Learning Library for the Python Language. http://scikit-learn.org/stable/index.html.
  50. Adriana Sejfia and Max Schäfer. 2022. Practical Automated Detection of Malicious Npm Packages. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1681 – 1692. https://doi.org/10.1145/3510003.3510104
  51. Adriana Sejfia and Max Schäfer. 2022. Practical automated detection of malicious npm packages. In Proceedings of the 44th International Conference on Software Engineering. 1681–1692.
  52. Alexander Serebrenik and Tom Mens. 2015. Challenges in software ecosystems research. In Proceedings of the 2015 European Conference on Software Architecture Workshops. 1–6.
  53. Sonatype. 2021. State of the software supply chain. https://www.sonatype.com/resources/state-of-the-software-supply-chain-2021.
  54. TUNA. accessible by 2023. TUNA RubyGems mirror aiming to accelerate installations in China. https://mirrors.tuna.tsinghua.edu.cn/rubygems/.
  55. USTC. accessible by 2023. PyPI mirror for users in China. https://pypi.mirrors.ustc.edu.cn/.
  56. USTC-NPM. accessible by 2023. USTC NPM mirror for users in China. https://mirrors.ustc.edu.cn/npm/.
  57. Security issues in language-based sofware ecosystems. arXiv preprint arXiv:1903.02613 (2019).
  58. Bad Snakes: Understanding and Improving Python Package Index Malware Scanning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 499–511.
  59. Watchman: Monitoring dependency conflicts for python library ecosystem. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 125–135.
  60. {{\{{V0Finder}}\}}: Discovering the Correct Origin of Publicly Reported Software Vulnerabilities. In 30th USENIX Security Symposium (USENIX Security 21). 3041–3058.
  61. Wolf at the Door: Preventing Install-Time Attacks in Npm with Latch. In Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security (Nagasaki, Japan) (ASIA CCS ’22). Association for Computing Machinery, New York, NY, USA, 1139 – 1153. https://doi.org/10.1145/3488932.3523262
  62. Abusing hidden properties to attack the node. js ecosystem. In 30th USENIX Security Symposium (USENIX Security 21). 2951–2968.
  63. What are Weak Links in the npm Supply Chain? arXiv preprint arXiv:2112.10165 (2021).
  64. Cyber-guided deep neural network for malicious repository detection in GitHub. In 2020 IEEE International Conference on Knowledge Graph (ICKG). IEEE, 458–465.
  65. Small world with high risks: A study of security threats in the npm ecosystem. In 28th USENIX Security Symposium (USENIX Security 19). 995–1010.
  66. Daniel Zwillinger and Stephen Kokoska. 1999. CRC standard probability and statistics tables and formulae. Crc Press.
Citations (1)

Summary

  • The paper's main contribution is demonstrating that fine-grained metadata effectively distinguishes malicious from legitimate packages in OSS ecosystems.
  • The methodology combines static and dynamic function analysis with metadata extraction from 50,000 legitimate and 1,000 malicious packages across NPM, RubyGems, and PyPI.
  • Results show that employing Random Forest on combined features achieves over 95% classification accuracy, highlighting metadata's critical role in malware detection.

A Comprehensive Examination of Package Fine-Grained Information in OSS Ecosystems

Introduction

The study "A Large-scale Fine-grained Analysis of Packages in Open-Source Software Ecosystems" (2404.11467) offers an empirically grounded investigation into the intricate metadata and functional properties of open-source software (OSS) packages. The core thrust is to delineate the characteristic differences between legitimate and malicious OSS packages across significant ecosystems such as NPM, RubyGems, and PyPI. By dissecting 50,000 legitimate and 1,000 malicious packages through three focal dimensions—metadata, static functions, and dynamic functions—the research delineates how these dimensions can aid in malware detection.

Fine-Grained Information in OSS Packages

This analysis predicates on extracting and scrutinizing fine-grained information (FGI) from software packages. Metadata encompasses core details such as package description, authorship, dependencies, and homepages. Static functions represent the coded instructions that are integral to the package’s source files, often revealing typical application behaviors. Dynamic functions are those executable operations observed within a runtime environment, providing insight into the package’s interactive capabilities with other system components. Figure 1

Figure 1: The OSS package extraction at the FGI level.

Results and Interpretation

Metadata Analysis

The metadata assessment reveals salient disparities between legitimate and malicious packages. Malicious packages typically have sparse metadata, exemplified by shorter descriptions and fewer listed maintainers or authors, compared to their legitimate counterparts. For instance, in 80% of malicious packages, descriptions contained fewer than 40 words, highlighting a significant deficit in informational content. Figure 2

Figure 2: The CDF of the package description length.

The presence of homepage and versioning URLs further sets apart legitimate packages, with 91.4% including such links, predominantly pointing to GitHub repositories. Contrarily, malicious packages often lack these references, with some simulating or providing spurious URLs to mislead users. Figure 3

Figure 3: The distribution of URL from software packages.

Static Function Characteristics

Static function analysis indicates distinct calling patterns between legitimate and malicious packages. Legitimate packages exhibit extensive use of HTTP and URL functions, but malicious packages distinctively prioritize network functions like HTTP over other protocols such as FTP or SMTP, marking a significant behavioral trait. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: SsameS_{same}: File-related functions.

Dynamic Function Insights

Dynamic function extraction, achieved through sandbox execution environments, demonstrates that malicious packages typically execute fewer dynamic operations to evade detection. Yet, they show a pronounced correlation between file and process-related functions, inferencing a tactical approach to malicious activity execution. Figure 5

Figure 5: The CDF of dynamic functions in the NPM ecosystem.

Malware Detection Application

The study probed machine learning models' efficacy in classifying packages based on the extracted FGI. Metadata proved a highly effective indicator, yet the integration of static and dynamic function data barely improved detection accuracy, underscoring the robustness of metadata alone as a discriminator. Classic models like Random Forest excelled, achieving accuracy rates upwards of 95% when all FGIs were considered.

Conclusions and Future Directions

The presented research validates the methodology of employing FGI for discerning malicious intent within OSS packages. The compelling distinctions noted between legitimate and malicious packages in terms of metadata and functional capabilities provide a vital input into crafting more nuanced security monitoring and package validation processes in OSS ecosystems.

While the study reinforces FGI's role in malware identification, future research must address the dynamism in package behavior and expand datasets to encompass wider ecological samples, thus enhancing the predictive power and reliability of these detection methodologies.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.