HemBLIP: VLM for Hematology Diagnostics

Updated 14 January 2026

HemBLIP is a family of vision-language models that integrates Vision Transformer and Transformer decoder for accurate cell morphology captioning and non-invasive haemoglobin estimation.
It employs both full fine-tuning and LoRA adaptations, reducing trainable parameters by over 95% and improving caption BLEU scores from 0.24 to 0.27.
The system supports clinical workflows with explainable outputs and mobile point-of-care screening, achieving an MAE of 0.85 g/dL for haemoglobin prediction.

HemBLIP denotes a family of vision-LLMs (VLMs) and mobile-health systems for interpretable hematological diagnostics, specifically designed to describe and quantify cell morphology for leukemia diagnosis and to enable non-invasive estimation of blood biomarkers such as haemoglobin. HemBLIP employs deep learning, structured expert annotation, and parameter-efficient adaptation techniques to deliver clinically relevant, explainable predictions, supporting both expert workflows and point-of-care applications (Logtestijn et al., 7 Jan 2026, Sarah et al., 2020).

1. Core Model Architecture and Adaptation

HemBLIP builds on the BLIP vision-language paradigm, integrating a Vision Transformer (ViT) as image encoder and a Transformer-based language decoder. The architecture is instantiated and fine-tuned for cell-level morphological captioning:

Image Encoder: ViT as proposed by Dosovitskiy et al. (2020), mapping high-resolution microscopy images to embeddings $x \in \mathbb{R}^d$ .
Language Decoder: Transformer decoder generates token sequences $y = (y_1, ..., y_T)$ corresponding to morphological descriptions.

Adaptation modalities include:

Full Fine-Tuning: All parameters updated to specialize for cell description generation.
LoRA Parameter-Efficient Adaptation: Applies Low-Rank Adaptation (LoRA), updating only low-rank matrices $A \in \mathbb{R}^{d \times r}, B \in \mathbb{R}^{r \times d}$ such that $W' = W + AB$ , significantly reducing trainable parameters (∼95–99%) and computational requirements.

Pseudocode:

for epoch in range(N):
    for image, caption in Train:
        z = ViT(image)
        y_hat = Decoder(z)
        L = CrossEntropy(y_hat, caption)
        L.backward()
        optimizer.step()

for epoch in range(N):
    for image, caption in Train:
        z = ViT(image)  # encoder frozen
        y_hat = Decoder(z, A, B)  # only LoRA active
        L = CrossEntropy(y_hat, caption)
        L.backward()
        optimizer.step()

The MedGEMMA baseline employs a SigLIP vision tower with medically aligned decoder for benchmarking (Logtestijn et al., 7 Jan 2026).

2. Dataset Construction and Annotation Protocols

HemBLIP leverages a combined morphology-rich dataset comprising 14,659 annotated peripheral blood cell images. The dataset consists of two main subsets:

WBCAtt (Healthy): 7,037 expertly annotated white blood cells across five morphologies. Eleven categorical attributes are encoded, encompassing nuclear shape, chromatin texture, nucleoli, cytoplasmic indicators, granular features, cell size, and others.
LeukemiaAttri (Leukemic): 7,622 cells spanning acute (ALL, AML, APML) and chronic (CLL, CML) leukemia types. Seven attributes, including Auer rods and nucleus-to-cytoplasm ratio, are specified.

Each image receives an expert-derived caption structured around a fixed protocol: “A [size] cell with [chromatin texture] chromatin, [nucleoli description], [cytoplasmic amount], [granularity], [diagnosis if obvious].” The vocabulary consists of ~120 tokens with mean caption length ≈18 words.

Class distributions (training):

	Healthy	Leukemic	Cell Size (S/M/L)	Chromatin (Coarse/Fine)	Leukemia Subtypes
% train samples	~50	~50	23 / 54 / 23	49 / 51	20/25/15/20/20

The dataset split is 80% train, 10% validation, 10% internal test.

3. Training Objectives and Optimization Strategies

HemBLIP adopts multi-task objectives:

Caption Generation Loss:

$L_\text{caption} = -\sum_{t=1}^T \log P(y_t \mid y_{<t}, x)$

Morphological Attribute Classification (Auxiliary):

$L_\text{attr} = -\sum_{k=1}^K y^*_k \log y_k$

Combined Objective:

$L = \lambda_{cap} L_\text{caption} + \lambda_{morph} L_\text{attr}$

With typical $\lambda_{cap}=1.0$ , $\lambda_{morph}=0.5$ .

Optimization is performed via AdamW (learning rate $5 \times 10^{-5}$ ), employing early stopping on validation loss.

4. Evaluation Metrics and Quantitative Results

HemBLIP is evaluated using both natural language and morphological accuracy metrics:

Caption Quality:
- BLEU-1…4: n-gram overlap measure.
- ROUGE-L: longest common subsequence (LCS).
- BERTScore: cosine similarity of token-level BERT embeddings.
Morphological Feature Accuracy:
- Regex-driven attribute extraction from generated captions; accuracy computed as fraction of correct matches over ground-truth mentions.

Performance results (internal test):

Model	BLEU	ROUGE-L	BERTScore	Cell Size (%)	Chromatin Texture (%)	Cytoplasm Amount (%)	Diagnosis Mention (%)
HemBLIP Full	0.24	0.42	0.83	91.8	52.7	70.4	79.1
HemBLIP LoRA	0.27	0.49	0.86	85.4	55.5	69.9	82.3
MedGEMMA LoRA	0.31	0.52	0.87	81.2	59.3	57.4	54.1

LoRA adaptation trains ≈1.5M parameters (versus 180M full), decreasing GPU memory by ∼4× and training time by ∼3×, with caption BLEU rising from 0.24 to 0.27 while reducing costs by >95%.

5. Interpretability, Clinical Rationale, and Example Outputs

HemBLIP produces “explainable-by-design” outputs that enumerate key cytological attributes (nuclear shape, chromatin texture, nucleoli, cytoplasm features), mirroring hematologist reasoning and facilitating transparent downstream classification.

Example:

Input: medium-sized blast cell
Reference: “A medium cell with open chromatin, two prominent nucleoli, scant agranular cytoplasm; consistent with acute lymphoblastic leukemia.”
HemBLIP LoRA output: “A medium-sized round cell showing open, fine chromatin, visible prominent nucleoli, minimal clear cytoplasm—highly suggestive of a lymphoblast.”

Clinical implications include support for pathologist review, increased trust in automated systems, and utility as a teaching aid in resource-limited environments.

6. HemBLIP for Non-Invasive Haemoglobin Estimation

An independent mobile-health workflow extends the HemBLIP paradigm to non-invasive Hb measurement via smartphone-based multi-input analysis (Sarah et al., 2020):

Client App: Guides sequential capture of fingernail beds, palpebral conjunctiva, and tongue. Implements on-device OCR/NLP to extract CBC report values for supervised calibration.
Image Preprocessing: Illumination correction (RGB→YCbCr; CLAHE), white reference calibration using sclera pixels, ROI segmentation via SLIC and UNet-style architectures (encoder: 16–128 channels, separable conv, ReLU, batch norm).
Feature Extraction: Computes channel-wise statistics (mean, std, skewness, kurtosis), colour ratios (R/G, G/B), erythema index, CIELab and HSV parameters, and morphometric features.
Machine Learning Pipeline: Two-stage approach: (1) anaemia-severity classifier (Random Forest, SVM), (2) regression model (XGBoostRegressor, multi-linear). End-to-end CNN+MLP models deployed as alternatives.
Calibration: Per-subject linear correction ( $Hb_cal = \alpha_j f + \beta_j$ ) fit via least squares, improves individualized accuracy.
Performance: MAE = 0.85 g/dL, RMSE = 1.10 g/dL, $R^2 = 0.88$ on cross-validated cohort (120 subjects); multi-ROI outperforming single-ROI (paired t-test $p<0.01$ ).

Deployment uses quantized (8-bit) models for on-device inference and integrates a cloud backend for active learning and calibration storage.

7. Limitations and Prospective Extensions

Current HemBLIP models are constrained by dataset composition (peripheral blood smears; exclusion of bone marrow aspirates and multi-stain modalities), domain shift effects (external test ROUGE-L drops to ~0.25), and baseline demographic integration.

Potential advancements include:

Expanding datasets to more tissue types and stains.
Incorporating patient metadata (age, sex, lab context).
Refining mobile workflows for diverse skin pigmentation and ambient lighting conditions.
Exploring federated learning for privacy-preserving continual model refinement.
Integrating low-cost illumination and higher-resolution lens attachments for enhanced acquisition.

This suggests HemBLIP systems can be generalized for scalable point-of-care hematology, with explainable outputs bridging the gap between automated analysis and expert interpretation in leukemia diagnosis and anaemia screening (Logtestijn et al., 7 Jan 2026, Sarah et al., 2020).

Markdown Report Issue Upgrade to Chat

References (2)

HemBLIP: A Vision-Language Model for Interpretable Leukemia Cell Morphology Analysis (2026)

A smartphone based multi input workflow for non-invasive estimation of haemoglobin levels using machine learning techniques (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HemBLIP.

HemBLIP: VLM for Hematology Diagnostics

1. Core Model Architecture and Adaptation

2. Dataset Construction and Annotation Protocols

3. Training Objectives and Optimization Strategies

4. Evaluation Metrics and Quantitative Results

5. Interpretability, Clinical Rationale, and Example Outputs

6. HemBLIP for Non-Invasive Haemoglobin Estimation

7. Limitations and Prospective Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

HemBLIP: VLM for Hematology Diagnostics

1. Core Model Architecture and Adaptation

2. Dataset Construction and Annotation Protocols

3. Training Objectives and Optimization Strategies

4. Evaluation Metrics and Quantitative Results

5. Interpretability, Clinical Rationale, and Example Outputs

6. HemBLIP for Non-Invasive Haemoglobin Estimation

7. Limitations and Prospective Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research