Semantic Clustering of Civic Proposals: A Case Study on Brazil's National Participation Platform

Published 25 Sep 2025 in cs.SE | (2509.21292v1)

Abstract: Promoting participation on digital platforms such as Brasil Participativo has emerged as a top priority for governments worldwide. However, due to the sheer volume of contributions, much of this engagement goes underutilized, as organizing it presents significant challenges: (1) manual classification is unfeasible at scale; (2) expert involvement is required; and (3) alignment with official taxonomies is necessary. In this paper, we introduce an approach that combines BERTopic with seed words and automatic validation by LLMs. Initial results indicate that the generated topics are coherent and institutionally aligned, with minimal human effort. This methodology enables governments to transform large volumes of citizen input into actionable data for public policy.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a robust pipeline using BERTopic, seed words, and LLM validation to automate the classification of civic proposals.
It demonstrates that semi-supervised approaches with VCGE seed words significantly improve taxonomy alignment, as evidenced by ARI and NMI metrics.
The study offers practical insights for real-time policy responses by streamlining civic engagement on Brazil's national participation platform.

Semantic Clustering of Civic Proposals: A Case Study on Brazil's National Participation Platform

Introduction

The paper "Semantic Clustering of Civic Proposals: A Case Study on Brazil's National Participation Platform" addresses the challenge of effectively organizing civic participation data on digital platforms. With the advent of platforms such as "Brasil Participativo" (BP), capturing and organizing civic proposals has become a high-priority task for governments. However, manual classification is impractical at scale, requiring interventions from specialists and alignment with official taxonomies. This study introduces a methodology combining BERTopic, seed words, and automatic validation by LLMs to efficiently transform vast amounts of citizen input into actionable data. The proposed approach seeks to preserve semantic coherence and align institutionally, promising minimal human intervention.

Methodology

The methodological framework is comprehensive, incorporating data extraction and pre-processing, internal validation and hyperparameter optimization, topic modeling, and external validation. Crucial to this pipeline are:

Pre-processing: Involves selection and categorization of civic proposals, using a significant portion of data for training (80%) and the remainder for testing (20%).
Topic Modeling: Utilizes BERTopic with both unsupervised and semi-supervised strategies. The semi-supervised approach integrates seed words derived from the Vocabulário Controlado do Governo Eletrônico (VCGE), ensuring the generated topics align with official government categories.
Figure 1: Pipeline de categorizaÃ§Ã£o temÃ¡tica com BERTopic.
External Validation: Employs LLMs such as Gemma 3 and GPT models to automatically label proposals, utilizing metrics like Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) for alignment verification.

Results

Parameter Optimization

Researchers identified optimal configurations for BERTopic that maximized semantic coherence and thematic diversity. Using BERTimbau-large for embedding generation proved superior across different parameter settings due to its ability to balance internal coherence and thematic diversity.

Figure 2: Weighted Score por Modelo e NÃºmero de TÃ³picos.

The optimal configuration for extracting 56 topics involved a uni-gram model with specific target numbers for topics and topic size. This configuration effectively balanced the thematic diversity and robustness of clusters.

Impact of Semi-supervision

The incorporation of seed words significantly improved semantic alignment with VCGE categories, enhancing both semantic coherence and resultant thematic consistency without substantially compromising topic diversity.

Figure 3: ComparaÃ§Ã£o entre nÃ£o-supervisionado e semi-supervisionado.

Comparative studies between unsupervised and semi-supervised approaches revealed measurable improvements, particularly in ARI and NMI metrics, which reflect the strength of the alignment with official taxonomy at both high-level and more detailed levels.

Figure 4: MÃ©tricas externas de alinhamento (ARI e NMI) nos nÃveis N1 e N2.

Discussion

Strategic and Practical Implications

The paper delineates a practical approach that can be seamlessly integrated into civic engagement platforms to automate the categorization process, significantly reducing manual workload while simultaneously ensuring high-quality information flow—vital for strategic policy formulation. This methodology supports real-time insights into emergent societal trends, such as shifts in public interest towards health or environmental concerns.

Enhancements and Future Directions

This research affirms the utility of integrating linguistic and institutional knowledge represented by specialized embeddings and official vocabularies to capture semantic complexities and alignments at scale. Future developments should focus on periodic expert review to refine seed words, ensuring model adaptability and relevance within dynamic democratic contexts.

Conclusion

This paper successfully presents a robust pipeline for transforming civic proposals into actionable intelligence for policymakers by leveraging BERTopic, specialized LLMs, and official taxonomies. While LLMs effectively automate many processes, ongoing human validation remains necessary for refining and adapting the underlying vocabularies and parameters in varied contexts. This study offers a scalable solution for enabling participatory governance, enriching public sector responsiveness and engagement with citizens.

Markdown Report Issue