Harbor Risk Score (HRS)

Updated 28 December 2025

Harbor Risk Score (HRS) is a discrete assessment tool that consolidates multimodal behavioral, physiological, and self-reported data into a seven-point ordinal scale.
It categorizes clinical states from severe depression (-3) to severe mania (+3), providing clear and actionable insights for psychiatric risk monitoring.
HRS leverages a 20-billion–parameter transformer with supervised and reinforcement learning to ensure robust temporal prediction and calibration in clinical settings.

The Harbor Risk Score (HRS) is a discrete, clinically interpretable mood- and risk assessment scale designed for structured behavioral healthcare applications. Developed as the core outcome measure for the HARBOR LLM, HRS consolidates multimodal patient data—spanning behavioral, physiological, and self-reported mental health signals—into a seven-point ordinal scale ranging from –3 (severe depression) to +3 (severe mania). Analogous in its clinical utility to physiological scoring systems such as NEWS2, HRS provides a unified categorical label for psychiatric risk and mood monitoring based on data-driven inference (Siddhant, 21 Dec 2025).

1. Definition and Clinical Interpretation

The HRS leverages a Likert-style integer scaling where each class corresponds to a specific categorical mood and functional state:

Class (c)	Description	Functional Annotation
–3	Severe depression	Unable to function or work
–2	Moderate depression	Significant impairment
–1	Mild depressive symptoms	Mild impairment
0	Neutral/stable mood	No impairment
+1	Mildly elevated mood	Mildly impaired/judgment intact
+2	Moderate mania/hypomania	Impaired judgment or functioning
+3	Severe mania	Unable to function or work

HRS is intended for longitudinal mood and risk tracking in clinical psychiatry, aggregating heterogeneous signal streams into a low-cardinality score suitable for both pointwise risk evaluation and temporal trend analysis.

2. Mathematical Formulation and Model Interface

HARBOR operationalizes HRS prediction using a 20-billion–parameter GPT-style transformer, architected as a sequence-model–based multi-class classifier over seven discrete states. The scoring pipeline operates as follows:

Input vector $x$ (standardized, see section 3) is mapped via the transformer to a hidden representation $h = \mathrm{Transformer}_{20\mathrm{B}}(x)$ .
Class logits are computed: $l_c = w_c^\top h + b_c$ for $c \in \{-3, ..., +3\}$ .
Probabilities assigned via softmax: $p(c|x) = \exp(l_c) / \sum_{k=-3}^{+3} \exp(l_k)$ .
The final prediction is either $\hat{y} = \arg\max_{c} p(c|x)$ (class) or $E[c|x] = \sum_{c=-3}^{+3} c \cdot p(c|x)$ (continuous expectation).

Training proceeds via supervised fine-tuning (cross-entropy, $L_{CE}$ ) and reinforcement learning (policy gradient on calibration reward, $L_{RL}$ ). Overall objective: $L_\text{total} = L_{CE} + \lambda L_{RL}$ , optimized with AdamW and learning-rate warmup.

3. Input Modalities and Feature Engineering

HRS predictions utilize monthly longitudinal records from the PEARL dataset, each entry comprising 16 strictly defined features:

Sleep minutes
Calories intake (kcal)
Calories burned (kcal)
Step count
Blood-glucose level
Vitamin D level
Cholesterol
Thyroid-stimulating hormone (TSH)
Weight 10. Body-fat percentage
Number of photos taken
Location entropy
Monthly expenses normalized by income
PHQ-9 score
GAD-7 score
Timestamp (month index)

Feature engineering consists of standardization to zero mean and unit variance (from training-split statistics), serialization as "feature_name:value" tokens or embedding via feature-type encoders, and explicit timestamp encoding (token or positional). This facilitates temporal reasoning and chain-of-thought mechanisms within the transformer.

4. Modeling Paradigm and Comparative Baselines

HARBOR’s modeling strategy is characterized by domain-adapted language modeling, supervised Q–A fine-tuning on structured clinical guidelines, reinforcement learning with calibration-oriented rewards, and self-taught reasoning (STaR) for tabular data interpretation.

Baseline comparisons deployed proprietary LLMs (e.g., GPT-5.2, Claude 4.5 Sonnet) in zero- and few-shot regimes, fed identical serialized inputs but lacking domain-adaptation, calibration, and STaR. Classical baselines—L1/L2-regularized logistic regression and random forests—worked directly on standardized feature vectors but exhibited limited representational power and no temporal continuity.

Key architectural distinctions include:

Enhanced domain adaptation (psychiatric corpus mid-training)
Structured symptom interpretation via Q–A pairs
Consistent categorical calibration enforced by custom RL rewards
Temporal continuity via in-context evidence and chain-of-thought ablations

5. Training, Inference, and Temporal Reasoning

Training is staged as follows:

Mid-training on unlabeled behavioral-health text corpus using language-modeling objectives.
Supervised fine-tuning: $h = \mathrm{Transformer}_{20\mathrm{B}}(x)$ 0 over seven-class HRS labels.
RL optimization (e.g., PPO): custom reward $h = \mathrm{Transformer}_{20\mathrm{B}}(x)$ 1 penalizes deviations $h = \mathrm{Transformer}_{20\mathrm{B}}(x)$ 2, encourages calibration on the full scale.

Inference admits both independent pointwise and temporally-conditioned evaluation. Temporal reasoning is enhanced via STaR and in-context techniques (prior month’s features/labels), yielding improved forward prediction horizons (next month, three months ahead). Evaluation ablations demonstrate robust retention of predictive signal across extended horizons in contrast to baseline degradation.

6. Empirical Validation and Robustness

HARBOR demonstrates superior performance in structured HRS prediction under diverse conditions:

Split/Evaluation	HARBOR Accuracy	LogReg Accuracy	Proprietary LLM Accuracy
Default random (t₀)	0.69	0.54	0.29
One-month ahead (t₋₁)	0.61	0.46	0.26
Three months ahead (t₋₃)	0.52	0.38	0.23

Further, HARBOR achieves macro-F1 of 0.63 (vs. 0.33 for logistic regression/random forest, 0.26 for top LLM), Pearson and Spearman correlations of 0.91 (both), substantiating both discriminative accuracy and rank-order calibration.

Robustness is evident across ablations:

In-context: 0-shot to 72-shot, HARBOR rises from 0.69 to 0.72 (vs. LLMs Δ ∼0.14)
Inference mode: HARBOR maintains 0.69–0.70
Aggregation (5 samples, majority-vote): 0.72 for HARBOR (LLMs ∼0.33)
Split strategy: time-based 0.60 vs. 0.45 (LogReg), patient-based 0.56 vs. 0.41

A plausible implication is that domain-specific adaptation, reinforcement learning for calibration, and chain-of-thought self-taught reasoning confer substantial advantages over both classical machine learning and general-purpose LLMs for psychiatric risk scoring.

7. Significance and Future Directions

HRS operationalizes interpretable psychiatric risk assessment, demonstrating the feasibility of integrating multimodal data streams into a unified ordinal scale for real-world clinical monitoring. The model’s robust empirical performance and flexibility in temporal reasoning suggest strong potential for translational deployment in behavioral health contexts. While initial validation is constrained to the PEARL dataset and a limited patient cohort, future extensions might involve broader population studies, additional modalities, and continuous risk modeling. The paradigm established by HARBOR—linking discrete risk scores to domain-adapted, calibration-aware, and self-explanatory LLM predictions—may inform both the design and evaluation of subsequent behavioral healthcare models (Siddhant, 21 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

HARBOR: Holistic Adaptive Risk assessment model for BehaviORal healthcare (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Harbor Risk Score (HRS).