MemGovern: Governed Experiential Memory for SWE
- MemGovern is a framework that converts fragmented GitHub issue–PR–patch data into structured, agent-friendly experience cards for improved bug resolution.
- It implements a multi-stage pipeline including data collection, governance, standardization, and vector indexing to produce 135,000 high-fidelity experience cards.
- Experiments demonstrate model-agnostic performance gains up to +9.4 percentage points on SWE benchmarks, highlighting its practical impact on autonomous software engineering.
MemGovern is a framework for enhancing autonomous software engineering (SWE) agents by governing and transforming heterogeneous, unstructured human debugging knowledge—extracted from GitHub issue–PR–patch triplets—into actionable, structured experiential memory in the form of experience cards. The system addresses limitations of current SWE agents that operate in a “closed-world,” solving bugs from scratch or reusing within-repository context, by systematically converting social, fragmented GitHub data into a governed, agent-friendly format and introducing agentic primitives for effective search and reasoning over collective human experience (Wang et al., 11 Jan 2026).
1. Closed-World Limitations and Open-World Data Bottlenecks
SWE agents such as SWE-Agent and AutoCodeRover primarily fix bugs either by patch generation from scratch, using only local context, or by intra-repository example retrieval. This “closed-world” paradigm disregards the expansive, cross-repository debugging and development expertise archived on platforms like GitHub. GitHub issue threads and associated pull requests encode valuable information, including root-cause analyses, abstract fix strategies, and validation steps, offering a potent “open-world” knowledge source with the potential to augment agentic reasoning and patch correctness.
However, leveraging this knowledge is nontrivial. Primary challenges include:
- Unstructuredness: Issue threads and PR discussions are rife with procedural noise (e.g., greetings, social chatter, merge messages), verbosity, and terminological inconsistency.
- Fragmentation: Repositories differ in module layouts, naming conventions, and test harnesses, which makes naive transfer of prior fixes brittle.
- Noise-Dense Data: These challenges confine agentic systems to within-repo retrieval or to rely solely on model-internal, parametric reasoning.
Without explicit governance, the richness of human debugging and validation encoded in GitHub remains largely inaccessible to SWE agents.
2. Framework Architecture and Experience Governance
MemGovern implements a multi-stage pipeline that transforms raw GitHub data into high-fidelity, agent-friendly memory, structured around experience cards and a two-pronged search/browse interface.
2.1 High-Level Pipeline
- Data Collection: Crawl issue–PR–patch triplets from the top- GitHub repositories, selected via a scoring function:
where denote the number of stars, issues, and PRs, respectively, and are scalar weights.
- Experience Governance: Filter for technically relevant, “closed loop” triplets (linked issue and patch, presence of diagnostic anchors), and dismiss those with technical-content ratio below a threshold (e.g., 0.2).
- Card Standardization: Encode each filtered example as an experience card:
- Index layer : a normalized problem summary and diagnostic signals with all repository-specific identifiers removed.
- Resolution layer : root cause analysis, abstracted fix strategy, concise patch digest.
- Quality Control: An LLM evaluator rates experience cards for completeness, clarity, and correctness. Cards failing to meet a threshold are iteratively revised up to three cycles, ensuring only high-fidelity cards are retained.
- Memory Indexing: Cards are embedded using a text-code dual encoder and stored in a vector database (e.g., FAISS).
2.2 Agentic Experience Search
MemGovern provides two key agentic primitives:
- Searching: Given a query (derived from failing tests, stack traces, or keywords), compute
using the embedding function . The top- similar experience cards are retrieved.
- Browsing: For a selected card , the agent gains access to , including root cause, fix strategy, patch digest, and verification plan.
Agents coordinate this two-step process dynamically: they search using an iteratively refined , browse small candidate sets, analogically map fixes to their current context (renaming variables, changing APIs), then test and iterate as needed.
3. System Components and Implementation
MemGovern’s end-to-end operation is powered by the following components:
- Governance Backbone: GPT-5.1 (medium reasoning) is employed for content purification, experience card standardization, and quality evaluation.
- Indexing and Retrieval: Utilizes dual-encoder embeddings for both index and resolution layers, with storage and querying handled by FAISS using cosine similarity.
- Agent Integration: Incorporates new tools in the SWE-Agent framework:
ExperienceSearch(top-k)andExperienceBrowse(index_id), invoked as API primitives within the agent’s reasoning loop.
The experience card generation follows a high-level pseudocode:
1 2 3 4 5 6 |
for repo in top_M: for (issue, PR, patch) in triplets: if purified(issue, PR, patch) passes selection: card = standardizeWithLLM(issue_thread, diff) if qualityCheck(card) passes: store(card) |
4. Experimental Performance and Analysis
4.1 Benchmark Protocol and Quantitative Results
MemGovern is evaluated on SWE-bench Verified, comprising 500 real-world GitHub issues, with success measured by the resolution rate—the percentage of issues for which agents output a patch that passes developer-written unit tests.
Across seven LLM backbones (Claude-4-Sonnet, GPT-5-Medium, DeepSeek-V3.1T, Qwen3-235B, Kimi-K2, Qwen3-Coder-30B, GPT-4o, GPT-4o-Mini), MemGovern provides a model-agnostic improvement, averaging +4.65 percentage points over vanilla SWE-Agent. Selected results:
| Backbone | Baseline (%) | MemGovern (%) | Gain (pp) |
|---|---|---|---|
| Claude-4-Sonnet | 66.6 | 69.8 | 3.2 |
| GPT-5-Medium | 65.0 | 67.4 | 2.4 |
| DeepSeek-V3.1T | 62.8 | 65.8 | 3.0 |
| Qwen3-235B | 47.2 | 55.4 | 8.2 |
| GPT-4o | 23.2 | 32.6 | 9.4 |
Memory coverage ablation demonstrates monotonic gains as the number of governed cards increases (10% to 100% of the full set). Raw PR+patch data, lacking governance, yields unstable and smaller gains; governed experience consistently improves performance.
Regarding retrieval usage, Agentic Experience Search outperforms both Static RAG (+1–2 pp) and Agentic RAG (+0.6–8.0 pp), yielding improvements of +3.0–9.4 pp.
4.2 Qualitative Case Studies
- Django
order_bycrash: The baseline approach resulted in a “defensive bypass” violating the API contract, while MemGovern retrieved a card specifying “extract field.expression.name,” preserving proper semantics. memoryviewhandling inHttpResponse: Raw records led to unrelated patch suggestions and incomplete fixes; MemGovern’s governed card isolated “handle memoryview + bytearray,” enabling a correct conversion.
5. Limitations and Prospective Extensions
Token overhead inherent in the Search/Browse process is present but offset by the improvement in resolution rate. The current governance mechanism, based on LLMs for content extraction and standardization, could be augmented with human-in-the-loop or algorithmic validators to further enhance fidelity. As a plug-in infrastructure, MemGovern’s card schema can be adapted to other agent scaffolds (e.g., ChatDev, AutoGen) and tasks beyond bug-fixing, such as feature implementation or code review, by refining the experience card model.
A plausible implication is that MemGovern establishes a template for future memory infrastructures, systematically curating and structuring large-scale, heterogeneous engineering experience to augment agentic reasoning across software domains.
6. Summary and Significance
MemGovern constructs a governed, large-scale experiential memory (135,000 cards) from noisy GitHub issue data and equips code agents with a dynamic two-stage retrieval interface. This approach delivers robust, model-agnostic performance gains (+4.65 pp resolution rate on SWE-bench Verified), demonstrating that systematically governed human experience is a powerful complement to LLM-based parametric reasoning in code agents (Wang et al., 11 Jan 2026).