An Empirical Analysis of Vulnerability Detection Tools for Solidity Smart Contracts Using Line Level Manually Annotated Vulnerabilities

Published 21 May 2025 in cs.SE | (2505.15756v1)

Abstract: The rapid adoption of blockchain technology highlighted the importance of ensuring the security of smart contracts due to their critical role in automated business logic execution on blockchain platforms. This paper provides an empirical evaluation of automated vulnerability analysis tools specifically designed for Solidity smart contracts. Leveraging the extensive SmartBugs 2.0 framework, which includes 20 analysis tools, we conducted a comprehensive assessment using an annotated dataset of 2,182 instances we manually annotated with line-level vulnerability labels. Our evaluation highlights the detection effectiveness of these tools in detecting various types of vulnerabilities, as categorized by the DASP TOP 10 taxonomy. We evaluated the effectiveness of a LLM-based detection method on two popular datasets. In this case, we obtained inconsistent results with the two datasets, showing unreliable detection when analyzing real-world smart contracts. Our study identifies significant variations in the accuracy and reliability of different tools and demonstrates the advantages of combining multiple detection methods to improve vulnerability identification. We identified a set of 3 tools that, combined, achieve up to 76.78\% found vulnerabilities taking less than one minute to run, on average. This study contributes to the field by releasing the largest dataset of manually analyzed smart contracts with line-level vulnerability annotations and the empirical evaluation of the greatest number of tools to date.

Abstract PDF Upgrade to Chat

Summary

The paper empirically evaluates 19 automated tools for Solidity smart contract vulnerability detection using a large, manually annotated line-level dataset.
Analyzed tools show significant variability in detecting different vulnerability types, and LLM-based tools like ChatGPT-4o struggle with real-world contracts.
Findings suggest combining specific tools (Conkas, Slither, Smartcheck) can detect up to 76.78% of vulnerabilities, emphasizing the need for multi-tool approaches.

An Empirical Analysis of Vulnerability Detection Tools for Solidity Smart Contracts

The paper "An Empirical Analysis of Vulnerability Detection Tools for Solidity Smart Contracts Using Line Level Manually Annotated Vulnerabilities" presents an in-depth empirical evaluation of vulnerability detection tools specifically for Solidity smart contracts. The central focus of the study is the comparison of various automated analysis tools against a manually curated dataset designed to identify line-level vulnerabilities within smart contracts.

Overview

Smart contracts have become integral in blockchain platforms, where they execute business logic automatically. However, this automation is fraught with risks due to the potential presence of vulnerabilities. To tackle this issue, numerous tools have emerged to aid developers and researchers in identifying such vulnerabilities. The paper explores the detection efficacy of a suite of 19 tools, leveraging the SmartBugs 2.0 framework, excluding the HoneyBadger tool as it targets honeypots rather than genuine vulnerabilities.

The authors constructed a comprehensive dataset by amalgamating multiple sources, including the manually assessed SmartBugs Curated and existing labels from the ZEUS dataset. The new dataset achieved a substantial size of 2,182 contracts with meticulously manual tagging, focusing on the line-of-code level vulnerabilities, the largest dataset of its kind to date.

Methodological Insights

The study employs a meticulous evaluation procedure. Each smart contract's vulnerabilities are labeled manually by multiple evaluators, ensuring consensus in the interpretation of vulnerable code lines. The experimental setup includes executing the SmartBugs tools in both bytecode and source code analysis modes, considering tool-specific execution constraints, such as memory allocation and process limits.

Vulnerability categories are based on the established DASP TOP 10 taxonomy, with the evaluation highlighting discrepancies in tool effectiveness across the different vulnerability classes. The manual annotations provide ground truth against which the tools' efficacy is measured.

Key Findings

Tool Efficacy Variability: The evaluated tools exhibit significant disparity in detection capabilities, with no single tool able to effectively detect all vulnerability classes within the DASP TOP 10 taxonomy. Osiris showed strengths in identifying arithmetic vulnerabilities, while Smartcheck was proficient with Denial of Service vulnerabilities. Slither revealed high performance post-Solidity 0.8.0 default overflow checks, highlighting advancements in its capabilities since previous studies.
LLM-Based Detection: Evaluation of ChatGPT-4o demonstrated a notable difference in detection effectiveness between benchmark contracts from SmartBugs Curated and real-world contracts. While able to locate vulnerabilities effectively in controlled datasets, performance plummeted on actual Ethereum contracts. This suggests potential model overfitting to perhaps more straightforwardly annotated contracts within public datasets used for training.
Optimal Tool Combinations: By employing a clustering strategy based on detection characteristics, the authors identify a combination of 3 tools — Conkas, Slither, and Smartcheck — able to detect up to 76.78% of vulnerabilities tagged in line-level annotations. This result substantially improves detection rates compared to previous combinations suggested by earlier studies.

Implications

Practically, the findings underscore the necessity of using multiple complementary tools to enhance the detection of vulnerabilities within smart contracts. The demonstrated variability among automated tools clarifies the importance of manual code analysis for creating robust security assessments. The study suggests an urgent need for improvement in FP rates and broad detection capabilities in existing tools, potentially through hybrid model applications involving machine learning enhancements.

Theoretically, it emphasizes the value of fine-grained data in vulnerability detection, illustrated by the survey indicating developers' preference for line-level vulnerability details. This insight could influence future research and tool development, suggesting pathways for incorporating LLM capabilities alongside traditional methods.

Future Developments

Recognizing the limitations of existing tools and approaches, future research should explore reducing false positive rates and enhancing detection across all vulnerability types. The paper also highlights the potential of AI-based and explicit rule-driven approaches as complementary forces in vulnerability detection. Furthermore, probing the impact of publicly available datasets on model training and resultant detection performance offers a rich vein of inquiry likely to benefit both academia and industry.

In conclusion, this study represents a comprehensive assessment of current methodologies in smart contract analysis and opens avenues for refined detection techniques, leveraging both manual expertise and automated tools in concert.