Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cognitive Genome Dataset Overview

Updated 29 January 2026
  • Cognitive Genome Dataset is a dual-resource platform that unifies a biomedical knowledge graph (MultiCNKG) with a richly annotated visual corpus (Visual Genome) to map cognitive functions.
  • It employs LLM-driven entity alignment and crowdsourced annotation to maintain high precision, recall, and semantic consistency across molecular and perceptual domains.
  • The dataset supports advanced research in personalized medicine and visual reasoning by enabling fine-grained queries over interconnected genetic, disease, cognitive, and image-based data.

The Cognitive Genome Dataset is an umbrella term that, in current literature, designates two distinct but ambitious resources: (1) MultiCNKG, a large-scale integrative biomedical knowledge graph that unifies genes, diseases, pathways, and higher-order cognitive processes relevant to neuroscience and medicine; and (2) Visual Genome, a densely annotated corpus of natural images designed to ground higher-level visual reasoning in machine learning. Both datasets target the challenge of representing and connecting the multi-faceted “genome” underlying cognition, either at the biomolecular or the perceptual-visual level, by supporting fine-grained queries over rich, structured annotations. Below, both instantiations are detailed with respect to their composition, methodology, semantic representation, evaluation metrics, and applications.

1. Dataset Composition and Scope

1.1 MultiCNKG Cognitive Genome

The MultiCNKG Cognitive Genome Dataset consists of a unified heterogeneous knowledge graph merging three principal KGs: the Cognitive Neuroscience Knowledge Graph (CNKG), Gene Ontology (GO), and Disease Ontology (DO) (Sarabadani et al., 8 Oct 2025). The core merged graph contains:

Statistic Value
Total nodes (N) 6,900
Total edges (E) 11,300
Node types 5
Edge types 7

Node Types:

  • Genes (~2,400): Protein-coding loci/allelic variants implicated in brain function or pathology (e.g., APOE, BDNF, COMT).
  • Diseases (~1,300): Neurological/psychiatric syndromes (e.g., Alzheimer's disease, Parkinson's disease).
  • Cognitive Processes (~1,100): Higher-order functional constructs (e.g., working memory, selective attention).
  • Biological Pathways (~700): Signal transduction or molecular cascades (e.g., synaptic plasticity, dopaminergic signaling).
  • Therapeutic Targets (~1,400): Biomolecules pertinent to pharmacological intervention (e.g., NMDA receptor, COMT enzyme).

Edge Types:

  • Causes (n≈1,800): Direct mechanistic or mutational causality.
  • Associated_with (n≈2,200): Statistical or mechanistic association.
  • Regulates (n≈1,500): Modulation of pathways/processes.
  • Involved_in (n≈1,200): Participation in a higher-order process/pathway.
  • Treated_by (n≈800): Drug–target or disease–drug relationships.
  • Influences (n≈1,300): Modulatory impact on functional outcomes.
  • Linked_to (n≈1,500): Broad cross-domain semantic linkage.

Derived Quantities:

  • Average node degree: ⟨k⟩ ≈ 3.28
  • Graph density: δ = 2E/N(N−1)

1.2 Visual Genome (Cognitive Genome for Visual Reasoning)

Visual Genome (VG), also labeled the Cognitive Genome in the visual reasoning community (Krishna et al., 2016), comprises:

Statistic Value
Images 108,249
Object annotations 2.56M
Attributes 1.76M
Relationships 2.03M
Region descriptions 4.1M
QA pairs 1.7M

Each image is densely annotated for objects (mean: 21.3), attributes (16.2), relationships (18.7), region descriptions (42), and QAs (17).

2. Data Integration and Annotation Methodologies

2.1 MultiCNKG Integration Pipeline

  • Entity Alignment: Nodes from CNKG, GO, and DO are embedded via LLMs (GPT-4 or BioGPT). Pairwise cosine similarity:

sim(vi,vj)=vi,vjvivj\text{sim}(v_i, v_j) = \frac{\langle \mathbf{v}_i, \mathbf{v}_j \rangle}{\|\mathbf{v}_i\|\|\mathbf{v}_j\|}

Entities with sim ≥ 0.85 are merged.

  • Edge Canonicalization and Adjacency Update: Edge relations are mapped to canonical forms using LLM-driven entailment, updating the adjacency matrix:

Af=A1+A2+A3+ΔAA_f = A_1 + A_2 + A_3 + \Delta A

Additional edges are predicted based on LLM outputs where P(rnewh,t)0.9P(r_\text{new}|h, t) ≥ 0.9.

  • Iterative Expansion: The graph is expanded iteratively with new nodes/edges, and conflicts are resolved through expert/external feedback, pruning low-confidence relations.

2.2 Visual Genome Annotation Pipeline

  • Crowdsourced Multi-stage Annotation:
    • Region Descriptions: Workers annotate regions with bounding boxes and free-form phrases, enforcing lexical diversity (BLEU4 ≤ 0.7).
    • Object and Attribute Extraction: Secondary annotation validates object bounding boxes via overlap (IoU>0.8) and highlights object nouns, attributes, and relationships.
    • Relationship and Scene Graph Construction: Predicate links (actions, prepositions) are tagged and unified into a global scene graph per image.
    • QA Pair Generation and Verification: Free-form and region-based questions/answers, cross-verified by majority vote.
  • Canonicalization: All tokens mapped to WordNet synsets via POS tagging, lemmatization, and manual verification.

3. Semantic Structure and Schema

3.1 MultiCNKG

Edges are typed triples (head, relation, tail), e.g., (APOE4, causes, Alzheimer's disease). Paths encode biological inference chains, e.g.:

  • (BDNF, regulates, Synaptic_plasticity) → (Synaptic_plasticity, involved_in, Memory)
  • (APOE4, causes, Alzheimer's disease) → (Alzheimer's disease, impairs, Episodic_memory)

3.2 Visual Genome

Each image is described by a JSON object:

  • Objects: IDs, lexical names, synsets, bounding boxes, and attributes.
  • Relationships: Directed predicates between object pairs, with synset mapping.
  • Region descriptions anchor local subgraphs.
  • QA pairs cover both image-wide and region-specific queries.
  • The scene graph merges all local region graphs for holistic representation.

4. Evaluation and Quality Control

4.1 MultiCNKG Metrics

  • Precision: 85.20%
  • Recall: 87.30%
  • F₁ Score: ≈86.23%
  • Coverage: 92.18% (fraction of merged unique nodes/edges vs. total input)
  • Graph Consistency: 82.50% (OWL-based logical validation)
  • Novelty Detection: 40.28% (LLM-predicted novel links / total edges)
  • Expert Validation: 89.50% plausible LLM-suggested links

High recall and coverage indicate near-complete preservation/enrichment of source KGs. Precision and expert validation imply strong biomedical/clinical plausibility. Substantial novelty points to nontrivial discovery capacity.

4.2 Visual Genome Metrics

  • Annotation Densities (objects/image: 21.3, attributes/image: 16.2)
  • Lexical Diversity: BLEU-based filtering of region phrases.
  • Box Co-reference: Merged when IoU≥0.8.
  • QA mapping: Coverage@100 = 41%.
  • Synset Mapping (held-out precision/recall):
    • Objects: 88.0% / 98.5%
    • Attributes: 85.7% / 95.9%
    • Relationships: 92.9% / 88.5%
  • Benchmarks: Attribute-only classification (19% top-1, 43% top-5), Region description generation (BLEU-4=0.010, METEOR=0.09, human agreement=43%).

5. Use Cases and Supported Tasks

5.1 MultiCNKG

  • Personalized Medicine: Genetic risk stratification for cognitive trajectory prediction.
  • Early Diagnosis: Identification of gene-pathway markers for preclinical detection of cognitive decline.
  • Hypothesis Generation: Discovery of novel gene → pathway → cognition chains for experimental prioritization.

5.2 Visual Genome

  • Scene-Graph-to-Image Retrieval and Synthesis: Grounding structured queries in vision.
  • Region-Granular Visual Question Answering and Dense Captioning: Enabling region-specific natural language understanding and generation.
  • Relationship Prediction, Zero-Shot Learning, and Attribute–Object classification.

A plausible implication is that the richness and multi-level semantic granularity of both resources enable the development and benchmarking of reasoning-capable models for both molecular-to-cognitive and perception-to-language pipelines.

6. Limitations and Prospective Directions

  • Scalability: MultiCNKG’s current 6.9K nodes are modest; future work includes integrating DrugBank, PharmKG, and additional multi-omics sources (epigenetics, transcriptomics), as well as environmental factors.
  • LLM Dependency: Ongoing exploration of open-source LLMs (e.g., BioGPT, LLaMA variants) to address transparency and cost per (Sarabadani et al., 8 Oct 2025).
  • Validation: Prospective use of federated learning architectures and crowdsourced expert feedback for real-time updates and continuous curation.
  • Visual Genome: Continued expansion of annotation coverage and region-specific reasoning, as well as new benchmarks for context-sensitive vision-language modeling.

In summary, both the MultiCNKG-based and Visual Genome instantiations of the Cognitive Genome Dataset provide comprehensive, multi-relational semantic maps spanning molecular, behavioral, and perceptual domains. These resources underpin translational neurogenomics, advanced visual reasoning, and data-driven hypothesis testing, offering foundational assets for the next generation of machine intelligence applications in cognitive and biomedical research (Sarabadani et al., 8 Oct 2025, Krishna et al., 2016).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cognitive Genome Dataset.