Skip to main content
Other

Pakistan National Assembly Debates (2008-2023): Multilingual Corpus, Topic Modeling, and Discourse Analysis

Muhammad Junaid Shah BukhariNational University of Modern LanguagesSyed Taimoor Hussain ShahTurin Polytechnic University
ABI

Abstract

A Multilingual BERTopic-Based Analysis of Pakistan’s National Assembly Debates: Legislative Themes and Political Discourse Description This record contains the data products, computational workflow, and supporting materials associated with the study “A Multilingual BERTopic-Based Analysis of Pakistan’s National Assembly Debates: Legislative Themes and Political Discourse.” The project examines parliamentary debates from the National Assembly of Pakistan over a fifteen-year period (2008-2023) using Natural Language Processing (NLP) methods designed for multilingual, noisy, and code-mixed legislative text. The archive is based on a corpus of 1,367 official debate documents, comprising more than 116,000 pages and over 32 million words, collected from the official archives of the National Assembly of Pakistan. The dataset spans three parliamentary terms: 2008-2013 — Pakistan Peoples Party (PPP) 2013-2018 — Pakistan Muslim League-Nawaz (PML-N) 2018-2023 — Pakistan Tehreek-e-Insaf (PTI), followed by the Pakistan Democratic Movement coalition period within the same assembly term The purpose of this record is to preserve and share the processed corpus, analytical outputs, and reproducible workflow used in the article. It is intended to support transparency, reproducibility, and reuse in research on parliamentary discourse, legislative communication, multilingual political text, and computational social science in low-resource settings. Purpose of the study Parliamentary debates are an important institutional record of how elected representatives articulate policy priorities, frame national issues, respond to crises, and negotiate political authority. While large parliamentary corpora have been widely used in computational research for Europe and other high-resource settings, South Asian legislatures remain underrepresented, especially in multilingual and code-mixed contexts. This study addresses that gap by developing a computational framework for the analysis of Pakistan’s National Assembly debates and by introducing a large-scale corpus suitable for longitudinal research. The archive supports the article’s broader aim of enabling the study of: legislative priorities across governments rhetorical and discursive change over time multilingual parliamentary communication computational analysis in under-resourced political text settings Rather than serving only as a code repository, this Zenodo record is intended as a research companion archive for the article. Contents of this record This archive may include: processed debate-level datasets cleaned text outputs metadata tables BERTopic assignments LDA-based topic outputs sentiment analysis outputs named entity recognition outputs embeddings used for semantic modeling BERTopic model files trained Word2Vec model validation figures and analysis plots project documentation dependency information Representative files include: extracted_debates.csv debates_clean.csv debates_final.csv debates_with_topics.csv debates_final_with_sentiment.csv debates_bertopic_v2.csv embeddings_v2.npy bertopic_model_v2/ word2vec_model.bin figures in .png and .pdf format Methodological overview The analysis presented in the article is based on a multi-stage NLP workflow designed for parliamentary records extracted from official PDF documents. Data extraction and preprocessing The source documents were obtained from the National Assembly of Pakistan’s official archives. Text extraction was carried out using pdfplumber, with additional steps to distinguish between digital and problematic files. A language-aware preprocessing pipeline was developed to handle extraction noise, procedural boilerplate, punctuation, numbers, and mixed Urdu-English content. Date extraction and rule-based language categorization were also performed as part of corpus preparation. Topic modeling The project combines LDA and BERTopic approaches. LDA was used as an initial topic-modeling baseline, while BERTopic was used to generate semantically informed themes through transformer embeddings, dimensionality reduction, and clustering. This allowed the study to examine recurring legislative and political themes in a more contextualized way. Sentiment analysis Because parliamentary language differs from general-purpose text domains, a custom sentiment lexicon was developed for legislative discourse. This was used to analyze evaluative tone at the document level and across parliamentary terms. Named entity recognition Named Entity Recognition was conducted to identify recurring references to political figures, organizations, and locations in the debates. Post-processing rules were applied to reduce noise from procedural and extraction-related artifacts. Semantic analysis Word2Vec models were trained on the corpus and term-specific subsets to explore contextual relationships among politically salient terms and to support analysis of discourse framing over time. Coverage Temporal coverage 2008–2023 Geographic coverage Pakistan Institutional coverage National Assembly of Pakistan Linguistic coverage Urdu English code-mixed Urdu-English parliamentary text Archive structure Pakistan-National-Assembly-Debates-Analysis/├── data/│ ├── raw/│ │ ├── 2008-2013/│ │ ├── 2013-2018/│ │ └── 2018-2023/│ └── processed/│ ├── extracted_debates.csv│ ├── debates_clean.csv│ ├── debates_final.csv│ ├── debates_with_topics.csv│ ├── debates_final_with_sentiment.csv│ ├── debates_bertopic_v2.csv│ ├── embeddings_v2.npy│ ├── bertopic_model_v2/│ └── word2vec_model.bin├── paper_final/│ ├── all_figures.pdf│ ├── ner_analysis_clean.png│ ├── word2vec_analysis.png│ ├── data_quality_report.png│ ├── validation_metrics.png│ ├── bertopic_learning_analysis.png│ └── topic_analysis_*.png├── README.md└── requirements.txt Reproducibility The materials in this record are intended to support reproduction of the analytical workflow described in the article. The project was developed in Python and executed in a notebook-based environment, with Google Colab used during development. Core dependencies include: pdfplumber pymupdf pandas scikit-learn matplotlib spacy gensim bertopic sentence-transformers umap-learn hdbscan torch Pillow Where redistribution of raw source PDFs is restricted by size or archival constraints, this record provides the processed derivatives and analytical outputs necessary to inspect and reuse the workflow. Reuse potential This archive may be useful for research in: parliamentary NLP legislative discourse analysis multilingual corpus analysis computational political science South Asian political studies topic modeling in institutional text sentiment analysis for formal political language semantic change in political discourse low-resource NLP workflows It may also serve as a starting point for comparative studies involving other legislative institutions, including provincial assemblies and multilingual parliamentary corpora from other regions. Related publication A Multilingual BERTopic-Based Analysis of Pakistan’s National Assembly Debates: Legislative Themes and Political Discourse This Zenodo record is the companion archive for the above article and should be cited alongside the publication where relevant. Acknowledgements This project is based on publicly available parliamentary records from the National Assembly of Pakistan. We acknowledge the value of these records as an important public resource for legislative transparency and research. We also acknowledge the contributions of open-source software communities whose tools supported the corpus construction and computational analysis, including pdfplumber, spaCy, BERTopic, gensim, scikit-learn, and related Python libraries. The authors further gratefully acknowledge the support of the National University of Modern Languages (NUML), Pakistan, and Politecnico di Torino, Italy, for providing the academic environment and institutional support that contributed to this work.

Identifiers

Citations and references

Cited by 00 references