Papers
Topics
Authors
Recent
Search
2000 character limit reached

EmeraldData: Asset-Level Environmental Database

Updated 3 January 2026
  • EmeraldData is a regulatory-compliant platform that extracts and validates granular asset-level environmental data from corporate filings.
  • It employs an advanced LLM-driven pipeline using IRZ-CoT prompting, customized chunking, and multi-phase cleaning to enhance extraction precision.
  • The system integrates proprietary validation and retrieval-augmented methods to boost coverage in high-deforestation-risk sectors.

EmeraldData refers to a class of asset-level environmental databases specifically designed to satisfy regulatory requirements such as the European Union Deforestation Regulation (EUDR). These databases provide structured, validated, and highly granular data on physical assets, including details on location, ownership, and commodity use. EmeraldData platforms overcome the limitations of traditional data repositories that rely on aggregated financial metrics and manual entry by adopting automated, LLM-driven pipelines for extraction, cleaning, and validation of data from large-scale corporate filings and other sources, with a particular focus on sectors contributing to deforestation risk (Menon et al., 5 May 2025).

1. Pipeline Architecture for Asset-Level Data Creation

EmeraldData systems employ an automated end-to-end pipeline with the following core components:

  • Data Ingestion: Utilizes tools such as the secEDGAR Python library to bulk-download 10-K financial filings (from 2022–2024) for target companies (e.g., Mining, Oil & Gas, and Utilities sectors). These documents are parsed with BeautifulSoup to remove HTML, scripts, tables, and boilerplate. Parsed text and metadata are persisted in a company-specific SQLite database.
  • Preprocessing and Chunking: To manage filings exceeding 100,000 tokens, text is segmented into 1,024-token chunks with a 20-token overlap aligned on sentence boundaries, enabling parallel and memory-efficient LLM processing.
  • LLM-Driven Extraction: Each chunk is processed using a 4-bit quantized Gemma 2 model, guided by the Instructional, Role-Based, Zero-Shot Chain-of-Thought (IRZ-CoT) prompt. The model outputs structured entities: physical_assets, locations, ownerships, commodities, and their relationships.
  • Database Assembly and Multi-Phase Cleaning:
    • RegEx Standardisation: Cleans extracted data by removing formatting artifacts and standardizing naming conventions.
    • Asset Similarity Consolidation: Vectorizes asset strings with TF-IDF and merges instances with cosine similarity ≥ 0.5.
    • LLM-Assisted Cleaning: Applies domain-specific prompts to Gemma 2 for normalization (e.g., converting chemical symbols to names), punctuation, and verification of country names using Wikipedia.
  • Validation: Applies multi-step validation using both existing data providers (LSEG Workspace) and Retrieval-Augmented Validation for entries that remain unvalidated after the primary step.

This architecture emphasizes modularity, robustness to large-scale and heterogeneous filings, and high degrees of automation, supporting regulatory use cases that require detailed, traceable provenance for each asset entry (Menon et al., 5 May 2025).

2. The IRZ-CoT Prompting Strategy

Central to the extraction process, the IRZ-CoT (Instructional, Role-Based, Zero-Shot Chain-of-Thought) prompt strategy is designed to increase both completeness and accuracy of entity identification. Its core features include:

  • Instructional Component: Directs the LLM to domain-specific concepts and definitions regarding physical assets and commodities.
  • Role-Based Component: Assigns the LLM a consistent expert-persona, which has been shown to anchor its outputs and reduce hallucinations.
  • Zero-Shot Chain-of-Thought (CoT): A query format that encourages the model to tackle the extraction task in discrete, logical sub-steps—namely, identifying assets, finding locations, determining ownership, and linking to commodities.

The prompt template explicitly requires stepwise reasoning and outputs in a fixed, structured schema, e.g.:

1
2
3
4
5
physical_assets: [ ... ]
locations: [ ... ]
ownerships: [ ... ]
commodities: [ ... ]
relationships: [asset:'', location:'', ownership:'', commodities:'']

Empirical experiments demonstrate IRZ-CoT with Gemma 2 achieves a +9 percentage point gain in precision, +4 in recall, and +6 in F1 score over traditional zero-shot prompting (F1: 0.62 vs 0.56, p < 0.01 on a 30-chunk ground-truth test set from Alcoa 2022), indicating superior extraction performance (Menon et al., 5 May 2025).

3. Retrieval-Augmented Validation and Dual-Step Assurance

EmeraldData platforms mitigate gaps in direct database validation coverage through a Retrieval-Augmented Validation (RAV) process. After initial matching with proprietary databases (e.g., LSEG Workspace), unvalidated asset entries undergo the following workflow:

  • Real-Time Web Querying: The system uses the Google Custom Search API to retrieve up to five relevant web snippets for each unvalidated asset.
  • BM25 Ranking: Snippets are ranked for relevance using the BM25 term weighting algorithm; only the top-ranked, high-confidence material is selected.
  • Generation and Classification:
    • Llama 3 generates a concise summary (e.g., clarifying asset location, commodity, or ownership) based on the top snippet.
    • Gemma 2 classifies—using a strict binary prompt—if the web-derived description matches the database entry. Only identical responses yield validation.

Validation is assessed at the attribute level (asset, location, ownership, commodity); partial validation is not credited. The process raises cumulative coverage substantially: across Mining, Oil & Gas, and Utilities, coverage rises from 9.5–7.7–4.9% to approximately 23% in each sector, roughly a 3× improvement. Integration of advanced table parsing (e.g., LlamaIndex) further elevates coverage in certain settings (e.g., Oil & Gas, from 25% to 94% for MPC) (Menon et al., 5 May 2025).

4. Data Quality Metrics and Mathematical Formulation

EmeraldData platforms operationalize several quality metrics:

  • Entity Extraction Accuracy: Precision, recall, and F1 score are computed as:
    • Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}
    • Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}
    • F1=2PrecisionRecallPrecision+Recall\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
  • Asset String Similarity and Consolidation: Cosine similarity on TF-IDF vectors (cos_sim(a,b)=abab\cos\_sim(a,b) = \frac{a \cdot b}{\|a\| \cdot \|b\|}), with consolidation threshold at 0.5.
  • Validation Coverage Metrics:
    • Hits@5, Jaccard, Cosine, Dice, Normalized Levenshtein for entity linkage.
    • LSEG Coverage (%)=NmatchedNLSEG×100(\%) = \frac{N_\text{matched}}{N_\text{LSEG}} \times 100
    • Total Validation Coverage (%)=NvalidatedNtotal×100(\%) = \frac{N_\text{validated}}{N_\text{total}} \times 100

Coverage is computed at both the attribute and overall asset levels, enabling granular identification of validation bottlenecks (Menon et al., 5 May 2025).

5. Sectoral Applicability and Quantitative Performance

EmeraldData methodologies are empirically validated across high-deforestation-risk sectors: Mining, Oil & Gas, and Utilities. The pipeline demonstrates tangible improvements at each operational step. Quantitative benchmarks reveal:

Sector Avg LSEG Coverage Avg LSEG + RAV Coverage
Mining 9.5% 23.5%
Oil & Gas 7.7% 23.4%
Utilities 4.9% 22.7%

Results underscore the scalability and reliability of automated extraction and validation, providing robust templates for compliance-driven asset-level data repositories (Menon et al., 5 May 2025).

6. End-to-End Workflow and Implementation

The overall process underlying EmeraldData systems can be formalized as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
for company in companies:
    # 1. Ingest
    filings = secEdgar.download(company.CIK, years=202224)
    raw_texts = [BeautifulSoup(f).get_text() for f in filings]
    store_in_sqlite(company, raw_texts)

    # 2. Preprocess → Chunks
    all_text = concatenate(raw_texts)
    chunks = sentence_chunk(all_text, size=1024, overlap=20)

    # 3. Extract with IRZ-CoT
    extractions = []
    for chunk in chunks:
        prompt = build_IRZ_CoT_prompt(chunk)
        response = gemma2.generate(prompt)
        extractions += parse_structured(response)

    # 4. Assemble & Clean
    db_raw = assemble_table(extractions)
    db_clean1 = regex_standardize(db_raw)
    db_clean2 = tfidf_consolidate(db_clean1, threshold=0.5)
    db_clean3 = llm_assisted_clean(db_clean2)

    # 5. Validate with LSEG
    db_valid1 = match_with_LSEG(db_clean3, threshold=0.6)

    # 6. Retrieval-Augmented Validation (RAV)
    unvalidated_assets = find_unvalidated(db_valid1)
    for asset in unvalidated_assets:
        snippets = google_cse_search(asset.name)
        best_snip = bm25_rank(snippets)[0]
        answer = llama3.generate(best_snip)
        is_valid = gemma2.generate_binary(
            f"Compare '{asset.name}' with '{answer}'. Similar? yes/no."
        )
        if is_valid == 'yes':
            mark_valid(asset)

    save_table(company, db_valid1)

The textual workflow diagram is: [SEC EDGAR 10-K] → [HTML cleanup] → [Chunking] → [IRZ-CoT Extraction] → [Raw DB] → [RegEx Standardise] → [TF-IDF Merge] → [LLM Cleaning] → [Clean DB] → [LSEG Validation] → [Unvalidated Assets] → [Google CSE + BM25] → [Llama 3 Generation] → [Gemma 2 Classification] → [Final Validated DB]

This workflow achieves high-throughput, regulated asset-level data construction and validation, supporting ambitions for responsible supply chain operation and environmental, social, and governance (ESG) compliance. The process yields substantial gains in extraction accuracy (increase in F1 score by ≈6 percentage points) and coverage (≈3× improvement), establishing a template for EmeraldData platforms intended for regulatory-driven deployment (Menon et al., 5 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EmeraldData.