Harbor Risk Score (HRS)
- Harbor Risk Score (HRS) is a discrete assessment tool that consolidates multimodal behavioral, physiological, and self-reported data into a seven-point ordinal scale.
- It categorizes clinical states from severe depression (-3) to severe mania (+3), providing clear and actionable insights for psychiatric risk monitoring.
- HRS leverages a 20-billion–parameter transformer with supervised and reinforcement learning to ensure robust temporal prediction and calibration in clinical settings.
The Harbor Risk Score (HRS) is a discrete, clinically interpretable mood- and risk assessment scale designed for structured behavioral healthcare applications. Developed as the core outcome measure for the HARBOR LLM, HRS consolidates multimodal patient data—spanning behavioral, physiological, and self-reported mental health signals—into a seven-point ordinal scale ranging from –3 (severe depression) to +3 (severe mania). Analogous in its clinical utility to physiological scoring systems such as NEWS2, HRS provides a unified categorical label for psychiatric risk and mood monitoring based on data-driven inference (Siddhant, 21 Dec 2025).
1. Definition and Clinical Interpretation
The HRS leverages a Likert-style integer scaling where each class corresponds to a specific categorical mood and functional state:
| Class (c) | Description | Functional Annotation |
|---|---|---|
| –3 | Severe depression | Unable to function or work |
| –2 | Moderate depression | Significant impairment |
| –1 | Mild depressive symptoms | Mild impairment |
| 0 | Neutral/stable mood | No impairment |
| +1 | Mildly elevated mood | Mildly impaired/judgment intact |
| +2 | Moderate mania/hypomania | Impaired judgment or functioning |
| +3 | Severe mania | Unable to function or work |
HRS is intended for longitudinal mood and risk tracking in clinical psychiatry, aggregating heterogeneous signal streams into a low-cardinality score suitable for both pointwise risk evaluation and temporal trend analysis.
2. Mathematical Formulation and Model Interface
HARBOR operationalizes HRS prediction using a 20-billion–parameter GPT-style transformer, architected as a sequence-model–based multi-class classifier over seven discrete states. The scoring pipeline operates as follows:
- Input vector (standardized, see section 3) is mapped via the transformer to a hidden representation .
- Class logits are computed: for .
- Probabilities assigned via softmax: .
- The final prediction is either (class) or (continuous expectation).
Training proceeds via supervised fine-tuning (cross-entropy, ) and reinforcement learning (policy gradient on calibration reward, ). Overall objective: , optimized with AdamW and learning-rate warmup.
3. Input Modalities and Feature Engineering
HRS predictions utilize monthly longitudinal records from the PEARL dataset, each entry comprising 16 strictly defined features:
- Sleep minutes
- Calories intake (kcal)
- Calories burned (kcal)
- Step count
- Blood-glucose level
- Vitamin D level
- Cholesterol
- Thyroid-stimulating hormone (TSH)
- Weight 10. Body-fat percentage
- Number of photos taken
- Location entropy
- Monthly expenses normalized by income
- PHQ-9 score
- GAD-7 score
- Timestamp (month index)
Feature engineering consists of standardization to zero mean and unit variance (from training-split statistics), serialization as "feature_name:value" tokens or embedding via feature-type encoders, and explicit timestamp encoding (token or positional). This facilitates temporal reasoning and chain-of-thought mechanisms within the transformer.
4. Modeling Paradigm and Comparative Baselines
HARBOR’s modeling strategy is characterized by domain-adapted language modeling, supervised Q–A fine-tuning on structured clinical guidelines, reinforcement learning with calibration-oriented rewards, and self-taught reasoning (STaR) for tabular data interpretation.
Baseline comparisons deployed proprietary LLMs (e.g., GPT-5.2, Claude 4.5 Sonnet) in zero- and few-shot regimes, fed identical serialized inputs but lacking domain-adaptation, calibration, and STaR. Classical baselines—L1/L2-regularized logistic regression and random forests—worked directly on standardized feature vectors but exhibited limited representational power and no temporal continuity.
Key architectural distinctions include:
- Enhanced domain adaptation (psychiatric corpus mid-training)
- Structured symptom interpretation via Q–A pairs
- Consistent categorical calibration enforced by custom RL rewards
- Temporal continuity via in-context evidence and chain-of-thought ablations
5. Training, Inference, and Temporal Reasoning
Training is staged as follows:
- Mid-training on unlabeled behavioral-health text corpus using language-modeling objectives.
- Supervised fine-tuning: over seven-class HRS labels.
- RL optimization (e.g., PPO): custom reward penalizes deviations , encourages calibration on the full scale.
Inference admits both independent pointwise and temporally-conditioned evaluation. Temporal reasoning is enhanced via STaR and in-context techniques (prior month’s features/labels), yielding improved forward prediction horizons (next month, three months ahead). Evaluation ablations demonstrate robust retention of predictive signal across extended horizons in contrast to baseline degradation.
6. Empirical Validation and Robustness
HARBOR demonstrates superior performance in structured HRS prediction under diverse conditions:
| Split/Evaluation | HARBOR Accuracy | LogReg Accuracy | Proprietary LLM Accuracy |
|---|---|---|---|
| Default random (t₀) | 0.69 | 0.54 | 0.29 |
| One-month ahead (t₋₁) | 0.61 | 0.46 | 0.26 |
| Three months ahead (t₋₃) | 0.52 | 0.38 | 0.23 |
Further, HARBOR achieves macro-F1 of 0.63 (vs. 0.33 for logistic regression/random forest, 0.26 for top LLM), Pearson and Spearman correlations of 0.91 (both), substantiating both discriminative accuracy and rank-order calibration.
Robustness is evident across ablations:
- In-context: 0-shot to 72-shot, HARBOR rises from 0.69 to 0.72 (vs. LLMs Δ ∼0.14)
- Inference mode: HARBOR maintains 0.69–0.70
- Aggregation (5 samples, majority-vote): 0.72 for HARBOR (LLMs ∼0.33)
- Split strategy: time-based 0.60 vs. 0.45 (LogReg), patient-based 0.56 vs. 0.41
A plausible implication is that domain-specific adaptation, reinforcement learning for calibration, and chain-of-thought self-taught reasoning confer substantial advantages over both classical machine learning and general-purpose LLMs for psychiatric risk scoring.
7. Significance and Future Directions
HRS operationalizes interpretable psychiatric risk assessment, demonstrating the feasibility of integrating multimodal data streams into a unified ordinal scale for real-world clinical monitoring. The model’s robust empirical performance and flexibility in temporal reasoning suggest strong potential for translational deployment in behavioral health contexts. While initial validation is constrained to the PEARL dataset and a limited patient cohort, future extensions might involve broader population studies, additional modalities, and continuous risk modeling. The paradigm established by HARBOR—linking discrete risk scores to domain-adapted, calibration-aware, and self-explanatory LLM predictions—may inform both the design and evaluation of subsequent behavioral healthcare models (Siddhant, 21 Dec 2025).