Analyzing Sustainability Reports Using Natural Language Processing

Published 3 Nov 2020 in cs.CL and cs.LG | (2011.08073v2)

Abstract: Climate change is a far-reaching, global phenomenon that will impact many aspects of our society, including the global stock market \cite{dietz2016climate}. In recent years, companies have increasingly been aiming to both mitigate their environmental impact and adapt to the changing climate context. This is reported via increasingly exhaustive reports, which cover many types of climate risks and exposures under the umbrella of Environmental, Social, and Governance (ESG). However, given this abundance of data, sustainability analysts are obliged to comb through hundreds of pages of reports in order to find relevant information. We leveraged recent progress in NLP to create a custom model, ClimateQA, which allows the analysis of financial reports in order to identify climate-relevant sections based on a question answering approach. We present this tool and the methodology that we used to develop it in the present article.

Abstract PDF Upgrade to Chat

Citations (40)

View on Semantic Scholar

Summary

The paper introduces ClimateQA, a tool that applies TCFD-inspired QA methods with RoBERTa to automate climate-related information extraction from corporate reports.
It employs a mixed dataset of hand-labeled and scraped reports to fine-tune RoBERTa models, achieving up to 85.5% F1 score with notable efficiency gains.
ClimateQA is deployed on Azure, providing sustainability analysts with a web interface to rapidly process and download key report segments.

This paper presents ClimateQA, a tool developed using NLP to automate the analysis of financial and sustainability reports for climate-related information (2011.08073). The primary goal is to help sustainability analysts efficiently identify relevant disclosures scattered across lengthy documents, reducing manual effort.

Problem:

Climate change poses significant financial risks, prompting companies to disclose climate-related information in Environmental, Social, and Governance (ESG) reports. However, these reports are often hundreds of pages long, lack standardized structure, and use varied terminology, making manual analysis time-consuming and inefficient. Current methods like keyword searches are often inadequate.

Approach:

The authors framed the task as a question-answering (QA) problem. They utilized the 14 questions recommended by the Task Force on Climate-related Financial Disclosures (TCFD) as prompts. The ClimateQA model takes a TCFD question and a sentence from a report as input and determines if the sentence answers the question.

Methodology:

Data Collection:
- Unlabelled: 2,249 financial and sustainability reports were scraped from public sources like EDGAR and the Global Reporting Initiative database. Raw text was extracted using the Tika package. The paper mentions pre-training word embeddings on this corpus to capture financial jargon, although the final model uses a standard RoBERTa base.
- Labeled: A small set of reports previously hand-labeled by sustainability analysts using the TCFD questions was obtained.
Dataset Creation: Positive examples were created by pairing TCFD questions with their corresponding labeled answer sentences. Negative examples were generated by pairing questions with sentences that did not answer them. This resulted in a highly imbalanced dataset, which was split into training, validation, and test sets based on company names to prevent data leakage. Stratified sampling was used to manage the imbalance, resulting in training/validation/test splits with specific numbers of positive and negative examples (e.g., 1500 positive / 15k negative for training).
Model Selection & Training:
- The RoBERTa (Robustly Optimized BERT Pretraining Approach) architecture was chosen.
- Both RoBERTa-Base (125M parameters) and RoBERTa-Large (355M parameters) were evaluated.
- Models were fine-tuned on the labeled QA dataset.

Results:

Model Performance: RoBERTa-Large achieved slightly higher F1 scores (Test F1: 85.5%) than RoBERTa-Base (Test F1: 82.0%). However, RoBERTa-Base was significantly faster to train (5 hours vs. 12 hours on a 12GB GPU) and required less memory. Due to the minor performance difference and significant efficiency gains, RoBERTa-Base was selected for the final tool.
Generalization: A performance drop was observed between validation and test sets (average -9.7% F1 for RoBERTa-Base), indicating challenges in generalizing to unseen companies.
Sector Variation: Performance varied by industry sector. The Energy sector showed the best results (Test F1: 89.8%), possibly due to more standardized reporting or boilerplate language in the training data for that sector. Materials & Buildings showed the largest drop between validation and test (-24.2%).
Question Variation: Performance also varied significantly depending on the TCFD question. Questions about generic concepts like time frames (Question 4) performed poorly, while highly specific questions about risk management integration (Question 10) showed poor generalization. Questions about GHG emissions (Question 12) had high F1 scores despite being answered infrequently.

Practical Implementation: The ClimateQA Tool

The research resulted in a deployed tool aimed at end-users (sustainability analysts):

Deployment: Hosted on Microsoft Azure.
User Interface: A web application allows users to upload PDF reports.
Processing Pipeline:
- Text extraction from PDF (using Tika).
- Text parsing and sentence splitting.
- Inference using the fine-tuned RoBERTa-Base model to identify sentences answering TCFD questions.
- 3. Results (identified sentences paired with questions) are stored in Blob Storage as a TSV file for user download.

User -> Web App -> Upload PDF -> Azure Blob Storage
                                      |
                                      V
                                Azure ML Pipeline Trigger
                                      |
          +---------------------------+---------------------------+-----------------------+
          | 1. PDF Text Extraction    | 2. Text Parsing/Splitting | 3. ClimateQA Inference|
          |    (Tika)                 |    (Sentences -> TSV)     |    (RoBERTa-Base)     |
          +---------------------------+---------------------------+-----------------------+
                                      |
                                      V
                                Results (TSV) -> Azure Blob Storage -> User Download

Future Work:

The authors plan to improve PDF text extraction, particularly for tables, potentially exploring commercial tools. They also aim to better integrate domain-specific financial LLMs and enhance the user interface with interactive visualization of results within the original documents.

In summary, the paper details the development and deployment of ClimateQA, an NLP-based tool using RoBERTa fine-tuned for question answering, to automate the extraction of climate-related information from corporate sustainability reports based on TCFD guidelines, addressing a practical need for analysts in the field.