HemBLIP: VLM for Hematology Diagnostics
- HemBLIP is a family of vision-language models that integrates Vision Transformer and Transformer decoder for accurate cell morphology captioning and non-invasive haemoglobin estimation.
- It employs both full fine-tuning and LoRA adaptations, reducing trainable parameters by over 95% and improving caption BLEU scores from 0.24 to 0.27.
- The system supports clinical workflows with explainable outputs and mobile point-of-care screening, achieving an MAE of 0.85 g/dL for haemoglobin prediction.
HemBLIP denotes a family of vision-LLMs (VLMs) and mobile-health systems for interpretable hematological diagnostics, specifically designed to describe and quantify cell morphology for leukemia diagnosis and to enable non-invasive estimation of blood biomarkers such as haemoglobin. HemBLIP employs deep learning, structured expert annotation, and parameter-efficient adaptation techniques to deliver clinically relevant, explainable predictions, supporting both expert workflows and point-of-care applications (Logtestijn et al., 7 Jan 2026, Sarah et al., 2020).
1. Core Model Architecture and Adaptation
HemBLIP builds on the BLIP vision-language paradigm, integrating a Vision Transformer (ViT) as image encoder and a Transformer-based language decoder. The architecture is instantiated and fine-tuned for cell-level morphological captioning:
- Image Encoder: ViT as proposed by Dosovitskiy et al. (2020), mapping high-resolution microscopy images to embeddings .
- Language Decoder: Transformer decoder generates token sequences corresponding to morphological descriptions.
Adaptation modalities include:
- Full Fine-Tuning: All parameters updated to specialize for cell description generation.
- LoRA Parameter-Efficient Adaptation: Applies Low-Rank Adaptation (LoRA), updating only low-rank matrices such that , significantly reducing trainable parameters (∼95–99%) and computational requirements.
Pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
for epoch in range(N): for image, caption in Train: z = ViT(image) y_hat = Decoder(z) L = CrossEntropy(y_hat, caption) L.backward() optimizer.step() for epoch in range(N): for image, caption in Train: z = ViT(image) # encoder frozen y_hat = Decoder(z, A, B) # only LoRA active L = CrossEntropy(y_hat, caption) L.backward() optimizer.step() |
The MedGEMMA baseline employs a SigLIP vision tower with medically aligned decoder for benchmarking (Logtestijn et al., 7 Jan 2026).
2. Dataset Construction and Annotation Protocols
HemBLIP leverages a combined morphology-rich dataset comprising 14,659 annotated peripheral blood cell images. The dataset consists of two main subsets:
- WBCAtt (Healthy): 7,037 expertly annotated white blood cells across five morphologies. Eleven categorical attributes are encoded, encompassing nuclear shape, chromatin texture, nucleoli, cytoplasmic indicators, granular features, cell size, and others.
- LeukemiaAttri (Leukemic): 7,622 cells spanning acute (ALL, AML, APML) and chronic (CLL, CML) leukemia types. Seven attributes, including Auer rods and nucleus-to-cytoplasm ratio, are specified.
Each image receives an expert-derived caption structured around a fixed protocol: “A [size] cell with [chromatin texture] chromatin, [nucleoli description], [cytoplasmic amount], [granularity], [diagnosis if obvious].” The vocabulary consists of ~120 tokens with mean caption length ≈18 words.
Class distributions (training):
| Healthy | Leukemic | Cell Size (S/M/L) | Chromatin (Coarse/Fine) | Leukemia Subtypes | |
|---|---|---|---|---|---|
| % train samples | ~50 | ~50 | 23 / 54 / 23 | 49 / 51 | 20/25/15/20/20 |
The dataset split is 80% train, 10% validation, 10% internal test.
3. Training Objectives and Optimization Strategies
HemBLIP adopts multi-task objectives:
- Caption Generation Loss:
- Morphological Attribute Classification (Auxiliary):
- Combined Objective:
With typical , .
Optimization is performed via AdamW (learning rate ), employing early stopping on validation loss.
4. Evaluation Metrics and Quantitative Results
HemBLIP is evaluated using both natural language and morphological accuracy metrics:
- Caption Quality:
- BLEU-1…4: n-gram overlap measure.
- ROUGE-L: longest common subsequence (LCS).
- BERTScore: cosine similarity of token-level BERT embeddings.
- Morphological Feature Accuracy:
- Regex-driven attribute extraction from generated captions; accuracy computed as fraction of correct matches over ground-truth mentions.
Performance results (internal test):
| Model | BLEU | ROUGE-L | BERTScore | Cell Size (%) | Chromatin Texture (%) | Cytoplasm Amount (%) | Diagnosis Mention (%) |
|---|---|---|---|---|---|---|---|
| HemBLIP Full | 0.24 | 0.42 | 0.83 | 91.8 | 52.7 | 70.4 | 79.1 |
| HemBLIP LoRA | 0.27 | 0.49 | 0.86 | 85.4 | 55.5 | 69.9 | 82.3 |
| MedGEMMA LoRA | 0.31 | 0.52 | 0.87 | 81.2 | 59.3 | 57.4 | 54.1 |
LoRA adaptation trains ≈1.5M parameters (versus 180M full), decreasing GPU memory by ∼4× and training time by ∼3×, with caption BLEU rising from 0.24 to 0.27 while reducing costs by >95%.
5. Interpretability, Clinical Rationale, and Example Outputs
HemBLIP produces “explainable-by-design” outputs that enumerate key cytological attributes (nuclear shape, chromatin texture, nucleoli, cytoplasm features), mirroring hematologist reasoning and facilitating transparent downstream classification.
Example:
- Input: medium-sized blast cell
- Reference: “A medium cell with open chromatin, two prominent nucleoli, scant agranular cytoplasm; consistent with acute lymphoblastic leukemia.”
- HemBLIP LoRA output: “A medium-sized round cell showing open, fine chromatin, visible prominent nucleoli, minimal clear cytoplasm—highly suggestive of a lymphoblast.”
Clinical implications include support for pathologist review, increased trust in automated systems, and utility as a teaching aid in resource-limited environments.
6. HemBLIP for Non-Invasive Haemoglobin Estimation
An independent mobile-health workflow extends the HemBLIP paradigm to non-invasive Hb measurement via smartphone-based multi-input analysis (Sarah et al., 2020):
- Client App: Guides sequential capture of fingernail beds, palpebral conjunctiva, and tongue. Implements on-device OCR/NLP to extract CBC report values for supervised calibration.
- Image Preprocessing: Illumination correction (RGB→YCbCr; CLAHE), white reference calibration using sclera pixels, ROI segmentation via SLIC and UNet-style architectures (encoder: 16–128 channels, separable conv, ReLU, batch norm).
- Feature Extraction: Computes channel-wise statistics (mean, std, skewness, kurtosis), colour ratios (R/G, G/B), erythema index, CIELab and HSV parameters, and morphometric features.
- Machine Learning Pipeline: Two-stage approach: (1) anaemia-severity classifier (Random Forest, SVM), (2) regression model (XGBoostRegressor, multi-linear). End-to-end CNN+MLP models deployed as alternatives.
- Calibration: Per-subject linear correction () fit via least squares, improves individualized accuracy.
- Performance: MAE = 0.85 g/dL, RMSE = 1.10 g/dL, on cross-validated cohort (120 subjects); multi-ROI outperforming single-ROI (paired t-test ).
Deployment uses quantized (8-bit) models for on-device inference and integrates a cloud backend for active learning and calibration storage.
7. Limitations and Prospective Extensions
Current HemBLIP models are constrained by dataset composition (peripheral blood smears; exclusion of bone marrow aspirates and multi-stain modalities), domain shift effects (external test ROUGE-L drops to ~0.25), and baseline demographic integration.
Potential advancements include:
- Expanding datasets to more tissue types and stains.
- Incorporating patient metadata (age, sex, lab context).
- Refining mobile workflows for diverse skin pigmentation and ambient lighting conditions.
- Exploring federated learning for privacy-preserving continual model refinement.
- Integrating low-cost illumination and higher-resolution lens attachments for enhanced acquisition.
This suggests HemBLIP systems can be generalized for scalable point-of-care hematology, with explainable outputs bridging the gap between automated analysis and expert interpretation in leukemia diagnosis and anaemia screening (Logtestijn et al., 7 Jan 2026, Sarah et al., 2020).