LabOS: The AI-XR Co-Scientist That Sees and Works With Humans

Published 16 Oct 2025 in cs.AI | (2510.14861v1)

Abstract: Modern science advances fastest when thought meets action. LabOS represents the first AI co-scientist that unites computational reasoning with physical experimentation through multimodal perception, self-evolving agents, and Entended-Reality(XR)-enabled human-AI collaboration. By connecting multi-model AI agents, smart glasses, and human-AI collaboration, LabOS allows AI to see what scientists see, understand experimental context, and assist in real-time execution. Across applications--from cancer immunotherapy target discovery to stem-cell engineering -- LabOS shows that AI can move beyond computational design to participation, turning the laboratory into an intelligent, collaborative environment where human and machine discovery evolve together.

Abstract PDF Upgrade to Chat

Summary

The paper presents LabOS, a fully integrated platform that combines agentic AI reasoning with XR interfaces for dynamic, real-time lab collaboration.
It employs a multi-agent system and specialized vision-language models to achieve over 90% error detection accuracy and superior protocol generation.
LabOS is validated across cancer immunotherapy, cell fusion, and stem cell engineering, demonstrating enhanced reproducibility and skill transfer.

LabOS: An Integrated AI-XR Co-Scientist for Human-Machine Collaboration in Biomedical Research

System Architecture and Design

LabOS introduces a unified platform for human-AI collaboration in scientific laboratories, integrating agentic AI reasoning with extended reality (XR) interfaces for real-time, multimodal interaction. The architecture comprises a multi-agent system for dry-lab tasks—planning, coding, critique, and tool creation—augmented by a Tool Ocean that autonomously expands analytical capabilities through web and literature mining. The wet-lab module leverages XR smart glasses, enabling the AI to perceive laboratory environments via egocentric video streams and provide adaptive, context-aware guidance, error detection, and documentation.

The dry-lab agentic core builds on the STELLA framework, with a Manager Agent decomposing scientific objectives, a Developer Agent executing bioinformatics analyses, and a Critic Agent iteratively refining workflows. The Tool Creation Agent autonomously identifies and integrates new analytical resources, supporting continuous self-evolution. This architecture enables dynamic scaling and adaptation to novel research tasks, with performance improving as the system accumulates experience.

The wet-lab module employs XR glasses running a Unity/Android application, streaming video and audio to a local or cloud GPU server for real-time inference. The server invokes a vision-LLM (VLM) to interpret visual input, align actions with protocols, and provide structured feedback. The system supports 3D/4D spatial modeling using multiview camera feeds and Gaussian splatting for photorealistic, temporally consistent reconstruction, facilitating object-centric tracking and simulation-based training.

Vision-LLM Training and Benchmarking

To enable robust visual reasoning in laboratory settings, LabOS post-trains a VLM on a curated corpus of >200 egocentric lab videos (LabSuperVision, LSV), FineBio, and JoVE datasets. The training pipeline employs supervised fine-tuning (SFT) with LoRA on paired video-text examples, followed by reinforcement learning using Group Relative Policy Optimization (GRPO). The reward function is rule-based, emphasizing procedural accuracy, safety compliance, and expert-consistent reasoning.

Benchmarking on the LSV dataset reveals that leading commercial models (Gemini 2.5 Pro, GPT-4o, Qwen 2.5-VL-7B, Cosmos-Reason-1) underperform in protocol alignment and error detection, with top scores of 2.86/5 and ~2/5, respectively. In contrast, LabOS-VLM (235B) achieves >90% error detection accuracy and superior protocol generation quality, outperforming all baselines. The model demonstrates fine-grained step recognition, context-aware guidance, and real-time error correction in authentic wet-lab scenarios.

Biomedical Applications and Empirical Results

LabOS is validated in three biomedical research domains: cancer immunotherapy, mechanistic gene investigation, and stem cell engineering.

Cancer Immunotherapy Target Discovery:

LabOS autonomously analyzes CRISPRa functional screens in A375 melanoma cells co-cultured with NK cells, identifying and re-ranking candidate regulators of NK-mediated cytotoxic resistance. The system nominates CEACAM6 as a top target, confirmed by wet-lab assays and survival analysis on TCGA datasets. This demonstrates closed-loop integration of computational reasoning and experimental validation.

Mechanistic Investigation of Cell Fusion:

LabOS generates and ranks hypotheses for genes regulating cell-cell fusion, prioritizing ITSN1 via pathway enrichment and interaction priors. Experimental knockdown of ITSN1 in U2OS cells validates its role in fusion, confirming the AI's mechanistic reasoning.

Stem Cell Engineering and Skill Transfer:

LabOS copilots researchers through complex gene-editing workflows in human iPSCs, providing real-time guidance, error detection, and automated documentation. The system records expert practice and coaches junior scientists, enabling rapid skill transfer and reproducible training without extended side-by-side mentorship.

Performance, Scaling, and Deployment Considerations

LabOS establishes new state-of-the-art results on biomedical reasoning benchmarks:

Humanity's Last Exam: Biomedicine—32% accuracy
LAB-Bench: DBQA—61% accuracy
LAB-Bench: LitQA—65% accuracy These results represent up to 8% improvement over next-best models. Performance scales with inference-time compute, reflecting the self-evolving design.

Deployment leverages lightweight AR/XR glasses (<85g, >2h battery life, 1200+ Nits display, 6DoF gesture support), suitable for laboratory environments. Real-time streaming and inference require local GPU servers or cloud infrastructure, with data privacy and security considerations for sensitive experimental records. The system supports modular integration with laboratory information management systems (LIMS) and can be extended to other scientific domains.

Implications and Future Directions

LabOS demonstrates that multimodal, agentic AI systems can transcend digital-only reasoning to participate in physical experimentation, closing the loop from hypothesis generation to wet-lab validation. The integration of XR interfaces and specialized VLMs enables AI to perceive, understand, and act within dynamic laboratory environments, supporting reproducibility, error mitigation, and skill transfer.

Theoretical implications include the emergence of self-evolving, context-aware AI agents capable of adaptive reasoning and tool creation, with potential to generalize across scientific disciplines. Practically, LabOS offers a blueprint for intelligent laboratories, where human intuition and machine rigor co-evolve, accelerating discovery and democratizing expertise.

Future developments may focus on expanding the Tool Ocean with domain-specific modules, enhancing 3D/4D spatial modeling for complex workflows, and integrating with robotic automation for fully autonomous experimentation. Scaling to larger model sizes and broader datasets will further improve reasoning and perception capabilities. Addressing challenges in data privacy, interoperability, and human-AI trust will be critical for widespread adoption.

Conclusion

LabOS represents a comprehensive AI-XR co-scientist platform, unifying agentic reasoning and multimodal human-AI collaboration in laboratory research. Through self-evolving agents, specialized VLMs, and XR interfaces, LabOS achieves state-of-the-art performance in biomedical reasoning, real-time experiment guidance, and skill transfer. The system exemplifies the potential of AI to participate in and enhance scientific discovery, setting the stage for future intelligent laboratories where human and machine collaborate seamlessly.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces LabOS, an “AI co‑scientist” that doesn’t just think on a computer—it also watches and helps people do real experiments in the lab. It links two worlds:

Dry lab (planning and data analysis on a computer)
Wet lab (hands‑on experiments at the bench)

By pairing smart AI with XR smart glasses (think: safety glasses with a tiny screen and camera), LabOS can see what a scientist sees, give step‑by‑step guidance, spot mistakes, and keep a clean record of what happened. The big idea is to make labs more accurate, faster, and easier to learn in.

What questions were the researchers trying to answer?

The team focused on four simple questions:

Can AI plan scientific studies and analyze data like a skilled research assistant?
Can AI “see” what’s happening in a real lab and help a human in the moment?
Can this AI improve itself by learning new tools and strategies over time?
Does the system actually help make real discoveries in biology and medicine?

How does LabOS work?

To explain the system, it helps to imagine a team with different roles and a set of smart goggles:

1) The dry‑lab “brain” (planning and analysis)

LabOS uses several AI “agents” that work together like a small team:
- Manager (plans tasks), Developer (writes and runs code), Critic (checks and improves steps), and a Tool‑Creator (builds new tools when needed).
It keeps a growing “Tool Ocean” (like an app store) of codes, databases, and methods it has found or built from online sources and scientific papers.
Over time, it learns which strategies work best and reuses them, so it gets better at solving new problems.

2) The wet‑lab “eyes and coach” (helping at the bench)

Scientists wear XR smart glasses that stream what they see to the AI and show instructions back to them.
A special vision‑LLM (VLM)—think “eyes + reading brain”—was trained to understand lab videos. It matches what’s seen (pipetting, incubating, mixing) to the written protocol, and:
- Guides the next step
- Warns about mistakes (like not changing a pipette tip)
- Keeps time and records details automatically

3) Training the AI to see in labs

The team built a new dataset called LabSuperVision (LSV): over 200 first‑person lab videos where experts labeled steps, materials, and common errors.
They found that general AI models struggled with these real lab videos.
So they fine‑tuned a VLM specifically for lab scenes using supervised learning (practice with correct answers) and reinforcement learning (rewarding safer, more precise instructions). Result: the “LabOS‑VLM” sees and reasons much better in lab settings.

4) 3D/4D lab “digital twins”

LabOS can reconstruct parts of the workspace in 3D (and over time = 4D), like a time‑lapse, interactive map of the lab bench. This helps with training, replaying what happened, and checking where objects were used.

What did they find?

Here are the main results, explained simply:

Better computer‑side reasoning:
- On tough biomedical question‑answer tests, LabOS outperformed other strong AI models. It also gets better the more it is used and the longer it thinks—showing it can “self‑improve.”
Better vision in real labs:
- General models often missed fine‑grained errors in lab videos. After training, the LabOS‑VLM was much better at writing accurate step lists and detecting mistakes (like contamination risks or wrong timing), achieving very high error‑detection accuracy on held‑out test videos.
Real‑world science wins:
- Cancer immunotherapy: LabOS analyzed CRISPR screening data and identified a gene, CEACAM6, as a possible target that helps tumors resist killing by immune cells (NK cells). Lab tests confirmed activating CEACAM6 made tumor cells harder to kill—matching the AI’s prediction.
- Cell fusion biology: LabOS proposed ITSN1 as a key gene controlling cell‑cell fusion. Scientists tested this by knocking it down and saw fusion drop—confirming the AI’s idea.
- Stem cell engineering: With XR glasses, LabOS gave live guidance during complex gene‑editing steps in human stem cells, flagged mistakes in real time, and recorded expert workflows that can train new students faster and more safely.

Why is this important?

Faster, more reliable science: LabOS helps turn ideas into experiments and results more smoothly, while catching errors that often cause delays or failures.
Better training: Expert techniques are hard to transfer from one person to another. LabOS can “watch and learn” from experts, then coach beginners—speeding up learning from months to days.
Safer, more reproducible labs: By monitoring steps and keeping detailed logs, LabOS reduces contamination risk and makes results easier for others to repeat.
Human + AI teamwork: Instead of replacing scientists, LabOS acts like a co‑pilot—combining human creativity with machine precision.

Simple takeaway

LabOS is like giving the lab a smart co‑worker: one that plans studies, watches your hands while you work, points out problems before they matter, and learns from every run. In early tests, it not only thought well on biomedical questions but also helped make real, validated discoveries. If this approach spreads, labs could become more efficient, safer, and better at turning good ideas into real breakthroughs.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper.

Dataset generalizability: LSV comprises ~200 sessions from 7 researchers and 36 protocols; it is unclear how well the VLM and XR guidance transfer to different labs, instruments, protocol families, lighting, PPE, and bench setups beyond this narrow distribution.
Release and reproducibility of LSV: The paper does not specify whether LSV (videos, annotations, gold protocols) will be released, how privacy is safeguarded (faces/voices, PHI, proprietary content), or inter-annotator agreement (e.g., Cohen’s κ) for step/error labels.
Train/test separation and leakage: The 80/10/10 split across FineBio/JoVE/LSV is described at a high level; it remains unclear whether the held-out LSV test is free of protocol, operator, or scene leakage from training data (especially if JoVE protocols overlap conceptually).
Quantitative coverage of error types: The taxonomy of errors (sterile breach, step mismatch, timing deviation, etc.) and their prevalence across training vs. test sets are not reported; robustness to rare but critical error classes is unknown.
VLM evaluation metrics and calibration: “>90% error detection accuracy” lacks definition (sensitivity/specificity, false alarms, severity-weighted metrics, per-class PR curves); thresholding/calibration for real-time use is not described.
Human evaluation reliability: Human scoring (0–5) lacks inter-rater reliability statistics, rater training protocol, and adjudication process; reliance on a “GPT-5” comparator further obscures evaluation validity and reproducibility.
Out-of-distribution robustness: Performance under occlusion (e.g., biosafety cabinets), glare, fogging, motion blur, camera tilt, reagent brand/label variation, multi-operator scenes, and background clutter remains untested or unreported.
Latency and real-time constraints: End-to-end latency (capture→server→VLM→XR feedback), jitter, and their impact on time-sensitive steps are not measured; it is unknown how the system handles tight timing windows or rapid micro-steps missed by 4 fps sampling.
Failure modes and recovery: There is no analysis of how the system detects its own uncertainty, defers, or safely fails (e.g., when classification confidence is low or streams are interrupted).
Cognitive load and usability: No user study quantifies scientist cognitive load (e.g., NASA-TLX), interruption cost, trust, situational awareness, or error rates with vs. without XR guidance across experience levels.
Ergonomics and biosafety: Wearing XR glasses in sterile hoods and BSL-2+ settings raises contamination and visibility concerns; procedures for disinfecting hardware, mitigating fogging/glare, and maintaining sterile technique are not validated.
Network and compute resilience: Behavior under network loss, degraded bandwidth, or local GPU outages is not characterized; offline/edge inference feasibility and quality trade-offs remain open.
Energy, cost, and access: Compute/battery requirements for 7B–235B VLM scales, energy usage, and cost per hour of operation are not reported; equity of access for resource-limited labs is unclear.
3D/4D reconstruction accuracy: There is no quantitative evaluation (e.g., pose/trajectory error, depth RMSE, object localization IoU) for the MapAnything/4D-LangSplat pipeline in dynamic lab scenes or its real-time viability on commodity hardware.
Spatially grounded reasoning: How 3D scene understanding quantitatively improves step verification, safety gating, or object tracking versus 2D alone is not ablated.
Instrument integration: The claim of synchronizing with “equipment feeds” lacks concrete integrations (LIMS, microscopes, incubators, sequencers), API coverage, data standards (e.g., HL7/FHIR), and error-handling for device telemetry.
Safety governance: No systematic safety framework is provided for high-risk protocols (chemical hazards, radiation, infectious agents); severity-aware gating, lockouts, or mandatory human confirmation for critical steps are unspecified.
Regulatory compliance: Pathways for GLP/GMP, 21 CFR Part 11, or CLIA compliance, audit trails, tamper-evident logs, and e-signatures are not detailed, limiting adoption in regulated environments.
Privacy and IP protection: Policies for storage/retention of video/audio, encryption, access control, redaction, and handling of proprietary methods are not described; cross-border data transfer and export control risks are unaddressed.
Security and adversarial robustness: The system’s resilience against adversarial visuals (labels, screens), spoofed audio, prompt injection via printed text, and supply-chain risks in XR devices is not evaluated.
Self-evolving “Tool Ocean” governance: How auto-generated tools are vetted (testing, sandboxing), versioned, licensed, and secured is unclear; risks of unsafe or biased tool behavior and dependency drift remain.
Reproducibility of agent decisions: The multi-agent planner/dev/critic loop lacks determinism audits, seed control, and provenance tracking; explainability of critique decisions and traceability from recommendation to evidence remain open.
Continual learning and drift: How the system handles data drift, catastrophic forgetting, and lab-specific finetuning—while preserving privacy and avoiding overfitting—is not specified; no longitudinal performance tracking is provided.
Benchmark inconsistencies: Reported benchmark numbers (HLE, DBQA, LitQA) vary across sections/figures and lack confidence intervals, statistical tests, and budgeted compute comparisons to baselines.
Wet-lab validation scope: NK-cell study validates CEACAM6 in a single cell line context; generalization across tumor types, donors, primary patient-derived models, and in vivo relevance is untested.
Genetic perturbation controls: Off-target effects for CRISPRa/CRISPRi (multiple guides, rescue experiments), batch effects, and replicate counts are not reported; effect sizes and statistical power are unclear.
Mechanistic depth: The ITSN1 finding lacks rescue experiments, orthogonal perturbations, and replication across cell types/fusogens to establish mechanism beyond a single U2OS+FAST assay.
Stem cell copiloting impact: Claims of faster training and improved reproducibility in iPSC workflows are not quantified (e.g., time-to-criterion, pass/fail rates, yield/viability, inter-operator variance) or benchmarked against SOPs/videos.
Human-in-the-loop boundaries: Criteria for when the AI is allowed to instruct vs. merely suggest, and how conflicts with expert judgment are resolved and logged, are not defined.
Ethical/dual-use safeguards: There is no discussion of preventing misuse (e.g., enabling dangerous or restricted experiments), user vetting, or content restrictions within the tool-creation and guidance modules.
Cross-language and accessibility: Support for non-English protocols, accents, speech under masks, and accessibility (e.g., color blindness, hearing protection environments) is not addressed.
Generalization to non-biomedical domains: The system’s applicability to chemistry, materials science, or field biology protocols (with different hazards and tools) is untested.
On-device deployment: The feasibility of running smaller VLMs on-device for privacy/latency, with accuracy/latency trade-offs versus server/cloud inference, remains unexplored.
Comparative baselines: There is no head-to-head comparison against state-of-the-art autonomous lab robots or structured workflow engines to quantify the unique gains from XR+VLM copiloting.
Human factors in step pacing: The system invokes analysis every ~5–10 s; the risk of missing micro-operations, rapid pipetting, or transient sterile breaches is not quantified; adaptive sampling policies are undeveloped.
Provenance of AI-generated protocols: Versioning, authorship, and evidence links for AI-authored protocols (and update notifications when upstream knowledge changes) are not specified.
Environmental footprint: Training/inference carbon costs for 32B–235B models and mitigation strategies (e.g., distillation, sparse routing, on-device caching) are not reported.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed with currently available components described in the paper (LabOS agentic software, LabOS-VLM, XR glasses, local/cloud GPU inference, logging), as demonstrated in biomedical lab settings.

XR-guided bench copilot for wet-lab protocols
- Sectors: biotechnology, pharma R&D, academic research cores; education/training labs
- What it does: Delivers stepwise SOPs on smart glasses, verifies actions via egocentric video, detects deviations (e.g., sterile breaches, timing errors), and provides corrective prompts in real time.
- Tools/workflows: LabOS XR app (Unity/Android) on AR/XR glasses; LabOS-VLM (7B–235B) server; JSON-based feedback loop at ~5–10 s cadence.
- Assumptions/dependencies: XR hardware availability (lightweight AR glasses with 6DoF); low-latency networking to on-prem GPU or secure cloud; curated SOPs mapped to steps; lab approval for head-worn cameras; operator consent and biosafety policy adherence.
Automated experiment documentation and audit trails
- Sectors: regulated labs (GLP/GMP), CROs, academic cores
- What it does: Time-stamped capture of steps, parameters, deviations, and environment context mapped to gold-standard protocols for reproducibility and QA.
- Tools/workflows: Egocentric video/audio, instrument logs, structured run records; searchable session replays aligned to SOP step boundaries.
- Assumptions/dependencies: Data integrity controls (ALCOA+), secure storage, role-based access; institutional policies for video retention; potential CSV/validation for regulated use.
Rapid upskilling and onboarding through expert-trajectory capture
- Sectors: academia, biotech R&D, workforce development
- What it does: Expert runs are captured and converted to digital training modules that coach novices in situ to perform complex workflows (e.g., lentiviral transduction in iPSCs).
- Tools/workflows: Expert session library indexed by steps/parameters; XR-guided tutoring with performance tracking and context-aware hints.
- Assumptions/dependencies: Instructor consent and IP ownership; coverage of critical tacit cues (camera viewpoint, lighting); institutional buy-in for XR-based training.
Assistive safety and sterility monitoring
- Sectors: BSL-2/BSL-3 research labs, cell therapy process development
- What it does: Detects sterile technique violations (e.g., pipette tip reuse, improper placement), step mismatches, and timing deviations; alerts operator.
- Tools/workflows: LabOS-VLM error classifiers trained on LSV/FineBio/JoVE; configurable alerting thresholds.
- Assumptions/dependencies: Camera field-of-view must cover critical actions; false-positive/negative tolerance defined by safety leadership; operator override remains available.
Self-improving dry-lab agent for target discovery and mechanistic insight
- Sectors: biotech target discovery, translational research, computational biology cores
- What it does: Multi-agent system (planning, dev, critic, tool-creation) ingests screen data and literature to generate/rank hypotheses (e.g., CEACAM6 for NK resistance; ITSN1 for cell fusion), write analysis code, and iterate with wet-lab feedback.
- Tools/workflows: STELLA-derived agent stack; “Tool Ocean” for auto-generated analysis modules; pathway/network enrichment; survival analyses on public cohorts (e.g., TCGA).
- Assumptions/dependencies: Access to high-quality screen data, databases (KEGG, Reactome, Ensembl), and compute; domain expert oversight for priors and validation.
Instrument-aware protocol guidance and data capture
- Sectors: biotech/pharma labs, core facilities
- What it does: Synchronizes operator actions with device readouts (e.g., incubators, microscopes) to enforce QC checkpoints and auto-fill run logs.
- Tools/workflows: Equipment data ingestion via APIs/CSV; XR prompts gated by device states; unified run timeline.
- Assumptions/dependencies: Vendor APIs or export capability; device-network integration; IT approval for connectivity.
Benchmarking and model selection for lab VLMs
- Sectors: AI/ML teams in industry and academia; instrument and XR vendors
- What it does: Use LSV to evaluate lab-perception models on protocol alignment and issue identification; quantify gains from domain post-training.
- Tools/workflows: LSV dataset and evaluation rubrics; human and AI scoring; A/B testing of base vs. post-trained VLMs.
- Assumptions/dependencies: Access to LSV or similar datasets; adherence to dataset licensing and privacy constraints.
3D/4D replay for skills transfer and post hoc analysis
- Sectors: training centers, process development, lab management
- What it does: Reconstructs spatio-temporal scenes for replay and “what happened” reviews to bolster reproducibility and coaching.
- Tools/workflows: MapAnything-based 3D reconstruction; Gaussian splatting for photorealistic scene replays; step-aligned annotations.
- Assumptions/dependencies: Adequate multi-view capture or robust egocentric tracking; storage and compute for reconstruction; privacy controls for scene content.
Curriculum modules for lab education
- Sectors: universities, community labs, biotech bootcamps
- What it does: Turn standardized videos (e.g., JoVE) plus expert captures into XR lessons with formative assessment via real-time error detection.
- Tools/workflows: SFT+RL-tuned VLM aligned to course SOPs; stepwise rubrics and feedback; instructor dashboards.
- Assumptions/dependencies: Classroom-safe XR deployment; curated course-aligned SOPs; adherence to safety/ethical guidelines (especially for bioscience labs).
Team science facilitation and cross-lab reproducibility checks
- Sectors: consortia, multi-site studies
- What it does: Standardizes protocol execution and documentation across sites; flags site-specific deviations; supports harmonization.
- Tools/workflows: Shared SOP libraries; site-specific model calibration; centralized log aggregation and review.
- Assumptions/dependencies: Agreement on shared SOPs; cross-site data governance and anonymization; model robustness across environments.

Long-Term Applications

These applications require further research, domain adaptation, scaling, regulatory acceptance, or integration with additional systems (e.g., robotics, LIS/LIMS, GMP validation).

Regulatory-grade compliance automation in GMP/GLP labs
- Sectors: pharma manufacturing/QC, cell/gene therapy, diagnostics
- What it could do: Provide validated, AI-assisted execution with audit trails acceptable to FDA/EMA (e.g., automated deviation capture, electronic batch records).
- Dependencies: Formal computer systems validation (CSV), GxP validation of XR and VLM components, ALCOA+ implementations, change control; sustained accuracy across sites.
Closed-loop autonomous experimentation with lab robotics
- Sectors: robotic biolabs, high-throughput screening, materials discovery
- What it could do: AI agent designs experiments, VLM perceives outcomes, and robotic systems execute/recover steps with minimal human intervention.
- Dependencies: Robust robot interfaces (OT-2, liquid handlers), safety interlocks, generalizable perception-to-action policies; expanded training data for varied apparatus and chemistries.
Generalization beyond biomedicine to chemistry, materials, and manufacturing
- Sectors: chemical synthesis, battery/energy R&D, semiconductor cleanrooms, aerospace QA
- What it could do: XR-guided procedures and error detection for glovebox operations, synthesis workflows, assembly steps.
- Dependencies: Domain-specific datasets and annotations; new safety ontologies; re-training VLMs for different object vocabularies and visual cues.
Telepresence and remote supervision of labs
- Sectors: distributed R&D, contract research, field labs
- What it could do: Expert-in-the-loop oversight of experiments across locations with AI summarization and risk alerts.
- Dependencies: Reliable low-latency streaming; security and privacy controls; legal frameworks for remote sign-off and liability.
Knowledge marketplaces for protocol steps and tacit know-how
- Sectors: publishers, CRO platforms, edtech
- What it could do: Curate, license, and exchange step-level XR modules and AI-evaluable SOPs; community benchmarking via LSV-like tasks.
- Dependencies: IP and licensing norms; contributor incentives; quality assurance pipelines; standard metadata schemas.
Context-aware biosafety and biosecurity enforcement
- Sectors: BSL-3/4 labs, select agent programs, policy and oversight bodies
- What it could do: Real-time enforcement of facility-specific biosafety rules, PPE compliance, restricted procedure gating.
- Dependencies: High-precision detection; formal verification of safety policies; fail-safe design to prevent harm from model errors; regulatory approval.
Clinical and hospital laboratory integration
- Sectors: pathology, clinical chemistry, microbiology (CLIA/CAP)
- What it could do: XR guidance for complex prep steps, chain-of-custody verification, and standardized workflows.
- Dependencies: LIS/LIMS integration, HIPAA compliance, device interoperability, clinical validation studies.
Enterprise-scale digital twins for capacity planning and risk analysis
- Sectors: pharma operations, academic core facilities
- What it could do: 4D models of labs to simulate throughput, bottlenecks, and “what-if” scenarios, including ergonomic and hazard analyses.
- Dependencies: Persistent, accurate spatial models; integration with scheduling and inventory systems; calibration/maintenance of scene models.
Policy frameworks for AI-in-the-lab governance
- Sectors: funding agencies, standards bodies, institutional compliance
- What it could do: Establish standards for AI-driven documentation, consent for capture, XR safety, model auditability, and benchmark adoption.
- Dependencies: Multistakeholder consensus; standardized metrics (e.g., LSV-derived); mechanisms for incident reporting and continuous monitoring.
Cross-institution reproducibility scoring and accreditation
- Sectors: journals, funders, accreditation bodies
- What it could do: Require or encourage AI-assisted run logs and step-aligned evidence for published protocols; award reproducibility badges.
- Dependencies: Publisher and funder policies; privacy-preserving sharing; community acceptance of metrics.
Consumer/education-grade XR science kits
- Sectors: K–12/undergraduate education, makerspaces
- What it could do: Safe, simplified versions of XR-guided experiments with AI feedback for learning fundamental techniques.
- Dependencies: Non-hazardous protocols; low-cost hardware; curated curricula; robust content moderation to prevent unsafe use.
Foundation-model extensions for fine-grained lab perception
- Sectors: AI vendors, research institutes
- What it could do: Next-gen VLMs that generalize across labs, instruments, lighting, and occlusions; multilingual SOP understanding.
- Dependencies: Larger and more diverse datasets than LSV; standardized annotation frameworks; compute for SFT+RL at scale.
Cross-modal fusion with omics and real-time sensors
- Sectors: high-throughput biology, bioprocessing
- What it could do: Combine visual cues with inline omics/sensor data to adaptively tune experiments (e.g., adjust incubation based on measured cell state).
- Dependencies: Real-time assay integration, streaming analytics, robust causal policies; validation in diverse biological systems.

Each long-term application assumes sustained human-in-the-loop oversight, ongoing model calibration to new labs, and clear governance for data privacy, biosafety, and accountability.

View Paper Prompt View All Prompts

Glossary

3D/4D reconstruction: Computational recovery of 3D structure (and its evolution over time as 4D) from images or video for spatial reasoning. "state-of-the-art 3D/4D reconstruction algorithms."
4D-LangSplat: A method for time-aware, language-indexable Gaussian splatting enabling semantically rich 4D scene representations. "4DLangSplat [11] can further help produce a time-aware, semantically indexable 3D environment,"
6DoF: Six degrees of freedom tracking (3D position and 3D orientation) for spatial interaction. "6DoF and hand gesture support for 3D- aware human-AI interactions."
A375: A human melanoma cell line commonly used in cancer research and functional screens. "A375 melanoma tumor cells"
AR/XR: Augmented/Extended Reality; head-worn displays overlaying digital information in real-world lab settings. "via AR/XR for human-AI collaboration"
Cas9: An RNA-guided endonuclease used in CRISPR systems to cut DNA at targeted sites. "Cas9 was successfully transferred into a sterile 1.5 mL EP tube as required."
CEACAM6: A cell adhesion molecule implicated as a cancer immunotherapy target. "the AI agent nominated CEACAM6 as a putative target"
Cell confluency: The percentage of the culture surface covered by adherent cells, used to gauge growth stage. "a 10 cm dish of 293T cells (~70-80% confluency)."
Cell fusion assay: An experimental procedure to quantify or visualize cell-cell fusion events. "cell fusion assay (induced by the fusogenic protein FAST [13])"
CRISPR: A genome editing technology enabling targeted DNA modification guided by RNA. "CRISPR gene-editing of human iPSC stem cells"
CRISPR activation (CRISPRa): A CRISPR-based method that transcriptionally upregulates target genes without cutting DNA. "CRISPR activation (CRISPRa) screen"
CRISPR interference (CRISPRi): A CRISPR-based method that represses gene expression via dCas9-mediated transcriptional blockade. "CRISPR interference (CRISPRi) coupled with cell fusion assay"
Critic Agent: An agent module that evaluates intermediate results and proposes refinements in a multi-agent system. "a Critic Agent for evaluation and refinement."
Digital twins: Virtual replicas of physical lab environments/workflows for replay, analysis, and training. "These digital twins capture spatial and temporal relationships between instruments, samples, and human actions,"
Egocentric: First-person viewpoint from a head-mounted camera/glasses capturing the operator’s perspective. "egocentric video sessions"
EP tube: A small Eppendorf microcentrifuge tube used for sample handling in labs. "a sterile 1.5 mL EP tube"
FAST proteins: Fusion-associated small transmembrane proteins that drive membrane fusion in certain viruses. "fusion-associated small transmembrane (FAST) proteins."
Gaussian splatting: A 3D scene representation/rendering technique modeling scenes as many Gaussian primitives. "Gaussian splatting models scenes as sets of millions of Gaussian distributions"
Group Relative Policy Optimization (GRPO): A reinforcement learning algorithm that compares groups of rollouts to shape rewards. "Group Relative Policy Optimization (GRPO) [9]"
Guide RNA (gRNA): The RNA molecule that directs CRISPR nucleases to specific genomic targets. "Add reagent 2 (gRNA) into the EP tube."
Humanity’s Last Exam (HLE): A challenging benchmark evaluating biomedical reasoning capabilities. "Humanity's Last Exam (HLE): Biomedicine"
Induced pluripotent stem cells (iPSCs): Reprogrammed adult cells with embryonic-like potential to differentiate into many cell types. "lentiviral transduction in human iPSCs experiment"
Inference-time scaling: Improving an AI agent’s performance by allocating more computation/tools during inference. "via inference-time scaling."
ITSN1: Intersectin 1, a gene implicated here as a regulator of cell fusion. "identification of a gene regulating cell fusion, ITSN1."
JoVE: Journal of Visualized Experiments; curated procedural videos used as training/evaluation data. "JoVE (standardized procedure videos)"
LabSuperVision (LSV): An expert-annotated dataset of lab videos for evaluating scientific visual reasoning. "LabSuperVision (LSV), an expert-annotated laboratory video dataset"
Lentiviral transduction: Gene delivery into cells using lentiviral vectors for stable expression. "lentiviral transduction in human iPSCs experiment"
Lipofectamine 2000: A lipid-based reagent for transfecting nucleic acids into cells. "27 UL Lipofectamine 2000 in 450 uL Opti-MEM"
LoRA: Low-Rank Adaptation; a parameter-efficient method for fine-tuning large models. "with LoRA on paired video-text examples"
MAGeCK: Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout; software for CRISPR screen analysis. "I will build MAGeCK analysis and visual tools."
MapAnything: A feed-forward approach for universal metric 3D reconstruction from images. "we utilize MapAnything [10]"
Natural killer (NK) cells: Innate immune lymphocytes capable of killing tumor or infected cells. "natural killer (NK) cell killing of tumors."
Opti-MEM: A reduced-serum cell culture medium used in transfection and maintenance. "Add 1 ml Opti-MEM into a sterile 1.5 mL EP tube"
PEI: Polyethylenimine, a polymer used as a transfection reagent to deliver DNA/RNA into cells. "Add PEI (4:1 ratio to DNA)."
Plasmid: A circular DNA vector used to deliver or express genes in cells. "Cas9 plasmid."
Point cloud: A set of 3D points representing the geometry of a scene or object. "point cloud reconstructions"
Protocol alignment: The task of matching observed actions to a reference protocol’s steps and parameters. "Protocol alignment, where the model needs to generate a stepwise protocol describing procedural actions and parameters,"
Qwen-VL: A family of vision-language foundation models used as base models for post-training. "Using Qwen-VL as the base model"
Reinforcement finetuning: Optimizing a model with task-specific rewards to improve reasoning or alignment. "and then reinforcement finetuning to improve visual reasoning."
scRNA-seq: Single-cell RNA sequencing to profile gene expression at single-cell resolution. "scRNA-seq datasets"
STELLA: A self-evolving LLM agent framework for biomedical research and tool creation. "STELLA self-evolving agent framework"
TCGA: The Cancer Genome Atlas, a large-scale cancer genomics resource. "The Cancer Genome Atlas (TCGA)"
Template Library: A repository of previously successful reasoning workflows for reuse and generalization. "a Template Library of successful reasoning workflows is dynamically updated,"
Tool Ocean: A shared, expanding repository of analytical tools, code, and APIs for the agent. "A Tool Ocean maintains and expands a repository of analytical tools/codes, databases, and APIs."
Transfection: The process of introducing nucleic acids into eukaryotic cells. "Correct Transfection Protocol"
Vision-LLM (VLM): A multimodal AI model that jointly processes visual and textual inputs for reasoning. "A specially trained Vision-LLM (VLM) monitors the procedure,"
Volcano plots: Plots displaying statistical significance vs. effect size, often used in differential analyses. "Volcano plots"

View Paper Prompt View All Prompts

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Continue Learning

Authors (20)

First 10 authors:

Collections

Tweets

YouTube

Show All Videos

alphaXiv

LabOS: The AI-XR Co-Scientist That Sees and Works With Humans (34 likes, 0 questions)

LabOS: The AI-XR Co-Scientist That Sees and Works With Humans

Summary

LabOS: An Integrated AI-XR Co-Scientist for Human-Machine Collaboration in Biomedical Research

System Architecture and Design

Vision-LLM Training and Benchmarking

Biomedical Applications and Empirical Results

Performance, Scaling, and Deployment Considerations

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions were the researchers trying to answer?

How does LabOS work?

1) The dry‑lab “brain” (planning and analysis)

2) The wet‑lab “eyes and coach” (helping at the bench)

3) Training the AI to see in labs

4) 3D/4D lab “digital twins”

What did they find?

Why is this important?

Simple takeaway

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (20)

Collections

Tweets

YouTube

alphaXiv

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research