Meta-Learning Reinforcement Learning for Crypto-Return Prediction

Published 11 Sep 2025 in cs.LG and cs.AI | (2509.09751v1)

Abstract: Predicting cryptocurrency returns is notoriously difficult: price movements are driven by a fast-shifting blend of on-chain activity, news flow, and social sentiment, while labeled training data are scarce and expensive. In this paper, we present Meta-RL-Crypto, a unified transformer-based architecture that unifies meta-learning and reinforcement learning (RL) to create a fully self-improving trading agent. Starting from a vanilla instruction-tuned LLM, the agent iteratively alternates between three roles-actor, judge, and meta-judge-in a closed-loop architecture. This learning process requires no additional human supervision. It can leverage multimodal market inputs and internal preference feedback. The agent in the system continuously refines both the trading policy and evaluation criteria. Experiments across diverse market regimes demonstrate that Meta-RL-Crypto shows good performance on the technical indicators of the real market and outperforming other LLM-based baselines.

Abstract PDF Upgrade to Chat

Summary

The paper presents a closed-loop system where a single LLM acts as Actor, Judge, and Meta-Judge to continuously improve crypto predictions.
It leverages multimodal data inputs and a multi-objective reward design to enhance trading strategies across various market regimes.
Empirical results demonstrate superior performance, particularly in bearish conditions, outperforming state-of-the-art models on key metrics.

Meta-Learning Reinforcement Learning for Crypto-Return Prediction

The paper "Meta-Learning Reinforcement Learning for Crypto-Return Prediction" presents the Meta-RL-Crypto framework—a novel approach for predicting cryptocurrency returns by integrating meta-learning with reinforcement learning (RL). This framework introduces a unique closed-loop system where a single LLM operates as an Actor, Judge, and Meta-Judge to continuously self-improve without human supervision. The system is designed to address challenges in cryptocurrency prediction driven by dynamic on-chain data, news, and social sentiment, utilizing a multi-objective reward design to enhance trading strategies.

Meta-RL-Crypto Architecture

The Meta-RL-Crypto system employs a transformer-based architecture that cyclically performs the roles of Actor, Judge, and Meta-Judge. The Actor processes market signals from a multimodal input comprising on-chain metrics, news reports, and sentiment scores to generate cryptocurrency forecasts. Each forecast is evaluated by the Judge using a multi-dimensional reward vector, which includes metrics such as absolute returns, the Sharpe ratio, and sentiment alignment. The Meta-Judge refines these evaluations, ensuring preference consistency and preventing reward drift.

Figure 1: Overall Architecture of Meta-RL-Crypto, demonstrating the cyclical roles of Actor, Judge, and Meta-Judge in a closed-loop system.

Data Collection and Reward Framework

Data Collection

The system integrates both on-chain and off-chain data sources to capture cryptocurrency market dynamics. On-chain metrics are sourced from platforms like CoinMarketCap and Dune Analytics, providing transaction data, active wallet counts, and network congestion indicators. Off-chain data is retrieved from the GNews API, focusing on high-credibility financial reports, which are sentiment-scored using a sentiment-aware LLM (Sentilm).

Reward Construction and Aggregation

The framework constructs multiple reward channels to evaluate financial performance, risk, market impact, and sentiment utilization. These include Return-Based Reward, Risk-Adjusted Reward, Drawdown Reward, Liquidity Reward, and Sentiment Alignment Reward. Aggregation of these signals is managed by the Meta-Judge, which employs Generalized Preference-based Reinforcement Optimization (GPRO) to optimize the Actor and Judge roles via a closed feedback loop, employing a preference-based learning strategy for robust policy development.

Figure 2: Meta-RL-Crypto Architecture detailing the data processing cycle and role-specific contributions to model performance improvement.

Empirical Evaluation

Experiment Settings

The evaluation of Meta-RL-Crypto employs a real-market dataset for cryptocurrencies BTC, ETH, and SOL, across distinct market regimes: bearish, sideways, and bullish. The experimental setup features a $1,000,000 starting portfolio with dynamical rebalancing based on the Actor's position signals, assessing total return, Sharpe ratio, and daily return mean.

Performance Metrics and Outcomes

As shown in the comparative analysis, Meta-RL-Crypto outperforms state-of-the-art models, especially in bearish market conditions, achieving a superior Sharpe ratio and total return across all tested regimes. Additionally, the model's interpretability scores, evaluated on metrics like Market Relevance, Risk-Awareness, and Adaptive Rationale, outperform other models, including GPT-4, highlighting its capability to adapt and reason over volatile market scenarios effectively.

Conclusion

The Meta-RL-Crypto framework offers a robust self-improving system for cryptocurrency return prediction, integrating on-chain and off-chain data with an innovative reinforcement learning approach within a multi-objective reward design. The model effectively manages market complexity without the need for human-labeled data, demonstrating high adaptability and performance across dynamic market environments. Future research can explore extending this framework to other financial markets or integrating additional types of market signals to further enhance prediction accuracy and model robustness.