Papers
Topics
Authors
Recent
Search
2000 character limit reached

MemGovern: Governed Experiential Memory for SWE

Updated 16 January 2026
  • MemGovern is a framework that converts fragmented GitHub issue–PR–patch data into structured, agent-friendly experience cards for improved bug resolution.
  • It implements a multi-stage pipeline including data collection, governance, standardization, and vector indexing to produce 135,000 high-fidelity experience cards.
  • Experiments demonstrate model-agnostic performance gains up to +9.4 percentage points on SWE benchmarks, highlighting its practical impact on autonomous software engineering.

MemGovern is a framework for enhancing autonomous software engineering (SWE) agents by governing and transforming heterogeneous, unstructured human debugging knowledge—extracted from GitHub issue–PR–patch triplets—into actionable, structured experiential memory in the form of experience cards. The system addresses limitations of current SWE agents that operate in a “closed-world,” solving bugs from scratch or reusing within-repository context, by systematically converting social, fragmented GitHub data into a governed, agent-friendly format and introducing agentic primitives for effective search and reasoning over collective human experience (Wang et al., 11 Jan 2026).

1. Closed-World Limitations and Open-World Data Bottlenecks

SWE agents such as SWE-Agent and AutoCodeRover primarily fix bugs either by patch generation from scratch, using only local context, or by intra-repository example retrieval. This “closed-world” paradigm disregards the expansive, cross-repository debugging and development expertise archived on platforms like GitHub. GitHub issue threads and associated pull requests encode valuable information, including root-cause analyses, abstract fix strategies, and validation steps, offering a potent “open-world” knowledge source with the potential to augment agentic reasoning and patch correctness.

However, leveraging this knowledge is nontrivial. Primary challenges include:

  • Unstructuredness: Issue threads and PR discussions are rife with procedural noise (e.g., greetings, social chatter, merge messages), verbosity, and terminological inconsistency.
  • Fragmentation: Repositories differ in module layouts, naming conventions, and test harnesses, which makes naive transfer of prior fixes brittle.
  • Noise-Dense Data: These challenges confine agentic systems to within-repo retrieval or to rely solely on model-internal, parametric reasoning.

Without explicit governance, the richness of human debugging and validation encoded in GitHub remains largely inaccessible to SWE agents.

2. Framework Architecture and Experience Governance

MemGovern implements a multi-stage pipeline that transforms raw GitHub data into high-fidelity, agent-friendly memory, structured around experience cards and a two-pronged search/browse interface.

2.1 High-Level Pipeline

  1. Data Collection: Crawl issue–PR–patch triplets from the top-MM GitHub repositories, selected via a scoring function:

Score(r)=λslog(1+Sr)+λilog(1+Ir)+λplog(1+Pr),\mathrm{Score}(r) = \lambda_s\log(1 + S_r) + \lambda_i\log(1 + I_r) + \lambda_p\log(1 + P_r),

where Sr,Ir,PrS_r, I_r, P_r denote the number of stars, issues, and PRs, respectively, and λ\lambda are scalar weights.

  1. Experience Governance: Filter for technically relevant, “closed loop” triplets (linked issue and patch, presence of diagnostic anchors), and dismiss those with technical-content ratio below a threshold τ\tau (e.g., 0.2).
  2. Card Standardization: Encode each filtered example as an experience card:

Ei=Ii,RiE_i = \langle I_i, R_i \rangle

  • Index layer IiI_i: a normalized problem summary and diagnostic signals with all repository-specific identifiers removed.
  • Resolution layer RiR_i: root cause analysis, abstracted fix strategy, concise patch digest.
  1. Quality Control: An LLM evaluator rates experience cards for completeness, clarity, and correctness. Cards failing to meet a threshold γ\gamma are iteratively revised up to three cycles, ensuring only high-fidelity cards are retained.
  2. Memory Indexing: Cards are embedded using a text-code dual encoder and stored in a vector database (e.g., FAISS).

MemGovern provides two key agentic primitives:

  • Searching: Given a query qq (derived from failing tests, stack traces, or keywords), compute

sim(q,Ii)=ϕ(q)ϕ(Ii)ϕ(q)ϕ(Ii),\mathrm{sim}(q, I_i) = \frac{\phi(q) \cdot \phi(I_i)}{\|\phi(q)\| \|\phi(I_i)\|},

using the embedding function ϕ\phi. The top-KK similar experience cards are retrieved.

  • Browsing: For a selected card E(k)E^{(k)}, the agent gains access to R(k)R^{(k)}, including root cause, fix strategy, patch digest, and verification plan.

Agents coordinate this two-step process dynamically: they search using an iteratively refined qq, browse small candidate sets, analogically map fixes to their current context (renaming variables, changing APIs), then test and iterate as needed.

3. System Components and Implementation

MemGovern’s end-to-end operation is powered by the following components:

  • Governance Backbone: GPT-5.1 (medium reasoning) is employed for content purification, experience card standardization, and quality evaluation.
  • Indexing and Retrieval: Utilizes dual-encoder embeddings for both index and resolution layers, with storage and querying handled by FAISS using cosine similarity.
  • Agent Integration: Incorporates new tools in the SWE-Agent framework: ExperienceSearch(top-k) and ExperienceBrowse(index_id), invoked as API primitives within the agent’s reasoning loop.

The experience card generation follows a high-level pseudocode:

1
2
3
4
5
6
for repo in top_M:
    for (issue, PR, patch) in triplets:
        if purified(issue, PR, patch) passes selection:
            card = standardizeWithLLM(issue_thread, diff)
            if qualityCheck(card) passes:
                store(card)
This pipeline yields a corpus of 135,000 governed experience cards.

4. Experimental Performance and Analysis

4.1 Benchmark Protocol and Quantitative Results

MemGovern is evaluated on SWE-bench Verified, comprising 500 real-world GitHub issues, with success measured by the resolution rate—the percentage of issues for which agents output a patch that passes developer-written unit tests.

Across seven LLM backbones (Claude-4-Sonnet, GPT-5-Medium, DeepSeek-V3.1T, Qwen3-235B, Kimi-K2, Qwen3-Coder-30B, GPT-4o, GPT-4o-Mini), MemGovern provides a model-agnostic improvement, averaging +4.65 percentage points over vanilla SWE-Agent. Selected results:

Backbone Baseline (%) MemGovern (%) Gain (pp)
Claude-4-Sonnet 66.6 69.8 3.2
GPT-5-Medium 65.0 67.4 2.4
DeepSeek-V3.1T 62.8 65.8 3.0
Qwen3-235B 47.2 55.4 8.2
GPT-4o 23.2 32.6 9.4

Memory coverage ablation demonstrates monotonic gains as the number of governed cards increases (10% to 100% of the full set). Raw PR+patch data, lacking governance, yields unstable and smaller gains; governed experience consistently improves performance.

Regarding retrieval usage, Agentic Experience Search outperforms both Static RAG (+1–2 pp) and Agentic RAG (+0.6–8.0 pp), yielding improvements of +3.0–9.4 pp.

4.2 Qualitative Case Studies

  • Django order_by crash: The baseline approach resulted in a “defensive bypass” violating the API contract, while MemGovern retrieved a card specifying “extract field.expression.name,” preserving proper semantics.
  • memoryview handling in HttpResponse: Raw records led to unrelated patch suggestions and incomplete fixes; MemGovern’s governed card isolated “handle memoryview + bytearray,” enabling a correct conversion.

5. Limitations and Prospective Extensions

Token overhead inherent in the Search/Browse process is present but offset by the improvement in resolution rate. The current governance mechanism, based on LLMs for content extraction and standardization, could be augmented with human-in-the-loop or algorithmic validators to further enhance fidelity. As a plug-in infrastructure, MemGovern’s card schema can be adapted to other agent scaffolds (e.g., ChatDev, AutoGen) and tasks beyond bug-fixing, such as feature implementation or code review, by refining the experience card model.

A plausible implication is that MemGovern establishes a template for future memory infrastructures, systematically curating and structuring large-scale, heterogeneous engineering experience to augment agentic reasoning across software domains.

6. Summary and Significance

MemGovern constructs a governed, large-scale experiential memory (135,000 cards) from noisy GitHub issue data and equips code agents with a dynamic two-stage retrieval interface. This approach delivers robust, model-agnostic performance gains (+4.65 pp resolution rate on SWE-bench Verified), demonstrating that systematically governed human experience is a powerful complement to LLM-based parametric reasoning in code agents (Wang et al., 11 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MemGovern.