Repository Blueprint Extraction
- Repository blueprint extraction is a systematic method that synthesizes a repository’s structure, dependencies, workflows, and semantics into an explicit machine-readable blueprint.
- It integrates static analysis, semantic augmentation, and LLM-driven summarization to capture high-fidelity dependency graphs, build/test artifacts, and hierarchical templates.
- The generated blueprints serve as foundations for tasks like automated code navigation, build orchestration, code generation, and effective AI-powered repository integration.
Repository blueprint extraction refers to the systematic recovery, synthesis, and representation of a software repository’s structure, dependencies, workflows, and semantics into an explicit, machine-readable form termed a blueprint. This blueprint typically manifests as a dependency graph, knowledge graph, or hierarchical template and serves as the foundation for downstream tasks such as repository understanding, navigation, code generation, build orchestration, and integration with AI-powered agents. Recent advances unify static, semantic, dynamic, and LLM-driven techniques to provide high-fidelity, updatable blueprints that bridge repository comprehension and generative automation at scale.
1. Formal Models and Taxonomy of Repository Blueprints
The formal objective of repository blueprint extraction is to compute a structured representation
where nodes correspond to entities such as packages, files, classes, functions, tests, build targets, or abstract functional groups; edges encode typed relations—syntactic (calls, imports, inheritance), architectural (containment, implements), build/test dependencies, or semantic coverage. Label sets and annotate nodes and edges with entity kinds and dependency types, potentially extended by weights (frequency, coverage) or feature vectors (Haratian et al., 2024).
Contemporary frameworks distinguish between:
- Static dependency blueprints: derived from source-level analysis (AST, IR), yielding dependency graphs (DG) over classes, functions, modules (Haratian et al., 2024).
- Semantic-augmented blueprints: where nodes are enriched with embedding-based or LLM-generated summaries and vectors to enable similarity-based retrieval (Bevziuk et al., 10 Oct 2025, Luo et al., 2 Feb 2026).
- Build/test blueprints: graphs detailing build artifacts, aggregators, orchestrators, and explicit test coverage, grounded in build system evidence (Cherny-Shahar et al., 15 Jan 2026).
- High-order, hierarchical blueprints: combining both low-level code entities and high-level abstraction nodes (functional groups/domains) for multi-granular navigation (Luo et al., 2 Feb 2026).
2. Extraction Algorithms and Pipelines
Blueprint extraction pipelines integrate several core processes, which can be instantiated as follows:
- Source/Artifact Collection
- Project checkout, symbol indexing (via IDE integration or VCS cloning) (Haratian et al., 2024, Bevziuk et al., 10 Oct 2025).
- Build system introspection (CMake, Make, Lake facets) and test artifact discovery (Cherny-Shahar et al., 15 Jan 2026, Zhu et al., 30 Jan 2026).
- Structural Parsing and Dependency Recovery
- Language-agnostic AST or PSI traversal to enumerate classes, methods, and modules, recording definition and reference sites.
- Emission of dependency edges: call, import, instantiation, extends, access (Haratian et al., 2024).
Example pseudocode for Python dependency extraction:
1 2 3 4 5 |
for node in ast: if node is CallExpr: addEdge(caller, callee, "call") if node is ImportStmt: addEdge(importer, imported, "import") |
- Semantic Feature Lifting
- LLM agents parse code to extract atomic "verb + object" features, class/function summaries, and high-level workflows (Lin et al., 28 Apr 2025, Luo et al., 2 Feb 2026).
- Embedding vectors are computed for both code and summaries (e.g., BGE-large), populating node properties for vectorized retrieval (Bevziuk et al., 10 Oct 2025).
- Build/Test Graph Extraction
- Deterministic extraction using tools like SPADE over build system outputs (CMake File API, CTest JSON), capturing components, aggregators, runners, external packages, and package managers (Cherny-Shahar et al., 15 Jan 2026).
- Evidence nodes link graph elements to source locations or call stacks, creating an auditable architectural map.
- High-Level Domain Discovery and Hierarchy Construction
- LLM-driven identification and structuring of functional areas, categories, and subcategories enforced via hierarchical prompts (Luo et al., 2 Feb 2026).
- Semantic summarization unifies disparate entities under abstracted nodes (e.g., “Model Training,” “Data Preprocessing”).
- Incremental and Scalable Updates
- Change-detection (via git diffs or AST tree deltas) limits reprocessing to affected local graph regions, employing atomic update protocols for insertion, deletion, and modification (Luo et al., 2 Feb 2026).
3. Blueprint Representations: Schemas and Serialization
Blueprints are serialized for agent interoperability and visualization through standardized JSON or graph formats:
- Dependency Graph (DG) Schema:
1 2 3 4 |
{
"nodes": [{"id": "...", "type": "..."}],
"edges": [{"src": "...", "dst": "...", "type": "...", "count": N}]
} |
- Repository Intelligence Graph (RIG) Schema:
1 2 3 4 5 |
{
"components": [...], "aggregators": [...], "runners": [...],
"tests": [...], "external_packages": [...], "package_managers": [...],
"evidence": [...]
} |
- RPG-Encoder Nodes:
Each node stores a semantic feature vector and metadata , supporting bidirectional mapping between abstract functionality and concrete code (Luo et al., 2 Feb 2026).
- Hierarchical Blueprint Templates:
Synthesized from LLM aggregations over multiple repositories, combining four structural axes (repository-, folder-, function-, class-level) as reusable templates (Lin et al., 28 Apr 2025).
4. Downstream Applications and Agent Interfaces
Repository blueprints empower diverse downstream tasks:
- Semantic and Structural Retrieval
- Hybrid pipelines combine vector similarity (via description/code embeddings) and graph-traversal (one-hop expansion, graph-walk) for issue- or question-driven retrieval (Bevziuk et al., 10 Oct 2025, Luo et al., 2 Feb 2026).
- LLM-based preprocessing reformulates user intent and discovers explicit or implicit file targets (Bevziuk et al., 10 Oct 2025).
- Build/Test Orchestration
- Agents can infer build order, reverse dependencies, and coverage by traversing RIG dependency and coverage edges (Cherny-Shahar et al., 15 Jan 2026).
- LLM assistants interact with JSON blueprints, avoiding costly re-parsing of build scripts or speculative source queries.
- Repository Generation and Reconstruction
- Blueprint-driven generative agents synthesize repository scaffolding, ensuring structural completeness and pattern fidelity in code-transformation tasks (Lin et al., 28 Apr 2025, Luo et al., 2 Feb 2026).
- High-fidelity blueprints facilitate zero-shot or closed-loop reconstruction, approaching near-complete category and unit test coverage (Luo et al., 2 Feb 2026).
- Formalization and Human–AI Collaboration
- In proof assistant workflows, blueprints (dependency graphs linking informal exposition to formal Lean constants) unify project documentation and progress tracking, supporting AI-aided theorem proving (Zhu et al., 30 Jan 2026).
5. Evaluation Metrics and Empirical Results
Blueprint extraction quality is measured through task-grounded evaluation:
- Dependency Recall/Precision (DG Extraction):
RefExpo achieves recall 0.92 for Python (Judge), 1.00 for Java (PyCG), and improves macro-level recall by 7% over best baseline (Haratian et al., 2024).
- Agent-Assisted QA (RIG):
In structured QA benchmarks (30 questions per repo), providing RIG yields a mean +12.2% accuracy, −53.9% wall-clock completion time, with larger gains in multilingual and complex repositories (Cherny-Shahar et al., 15 Jan 2026).
- Task-Oriented Localization (RPG-Encoder):
On SWE-bench Verified, RPG-Encoder achieves 93.7% Acc@5, exceeding the best baseline by 14.4%; repository reconstruction via ZeroRepo-RPG achieves 98.5% category coverage and 86.0% executable pass rate (Luo et al., 2 Feb 2026).
- Ablation/Completeness in Generation (AutoP2C):
Structural completeness scores (COMP_class, COMP_func) sharply decrease if blueprint extraction is removed, confirming its essential role in repository generation (Lin et al., 28 Apr 2025).
A summary table of systems, representations, and best-reported evaluation figures:
| System | Blueprint Representation | Best-Reported Metric(s) |
|---|---|---|
| RefExpo | Dependency Graph (DG) | Recall 0.92 (Python), +31% unique edges |
| RIG/SPADE | Build/Test Intelligence Graph | +12.2% Accuracy, −53.9% Time |
| RPG-Encoder | Hierarchical RPG (semantics+deps) | 93.7% Acc@5, 98.5% Reconstruction |
| AutoP2C | 4-level Hierarchical Template | All 8 generated repos executable |
| LeanArchitect | Lean + LaTeX Dep Graph | Full project migration in <1 day, latent inconsistency detection |
6. Limitations, Challenges, and Extension Prospects
Current blueprint extraction methods exhibit several limitations:
- Static analysis gaps: Reflection, metaprogramming, dynamic imports, and asynchronous or cross-language invocations are incompletely covered by static-only approaches (Haratian et al., 2024).
- Semantic ambiguity: Zero-temperature LLM prompts reduce noise but occasionally misclassify function or folder intent, especially in heterogeneous codebases (Lin et al., 28 Apr 2025, Luo et al., 2 Feb 2026).
- Graph reasoning: Even with a correct blueprint, agent errors may shift to graph traversal and multi-hop reasoning mistakes rather than structural misunderstandings; improved algorithmic and LLM in-graph reasoning remains critical (Cherny-Shahar et al., 15 Jan 2026).
- Incremental maintenance: Scalable update protocols are necessary for large evolving repositories—a 95.7% reduction in overhead is possible via localized update patterns (Luo et al., 2 Feb 2026).
Proposed extensions include integration of dynamic tracing, richer type and alias analysis, support for additional languages, tighter build/test model integration, and advanced progress-tracking for collaborative and AI-augmented development (Haratian et al., 2024, Zhu et al., 30 Jan 2026, Luo et al., 2 Feb 2026).
7. Domain-Specific and Emerging Blueprint Applications
- Formalization (Lean/Mathlib): LeanArchitect demonstrates that embedding blueprint annotations within the proof assistant enables synchronous LaTeX export, authoritative inferencing of dependencies/progress, and exposes inconsistencies in blueprints that were previously maintained separately (Zhu et al., 30 Jan 2026).
- ML Research Automation: AutoP2C formalizes the translation from academic papers’ multimodal content to full executable repositories by bootstrapping from a high-quality blueprint template, inferring canonical layouts, function signatures, and patterns (Lin et al., 28 Apr 2025).
- Repository-Aware LLM Assistants: RIG and RPG-Encoder empower agents to answer complex compositional queries, recover architectural slices, build/test dependencies, and achieve state-of-the-art results on repository QA and reconstruction tasks. This suggests blueprints have become indispensable substrates for repository-centric AI capabilities (Cherny-Shahar et al., 15 Jan 2026, Luo et al., 2 Feb 2026).