Статья

Integrating Thesaurus-Based Knowledge into Transformer Models for Semantic Understanding of Domain-Specific Texts

Bayangali AbdygalymDepartment of Information Systems, L.N. Gumilyov Eurasian National University, Astana 010000, KazakhstanSaule TazhibayevaDepartment of Information Systems, L.N. Gumilyov Eurasian National University, Astana 010000, KazakhstanMadina SambetbayevaDepartment of Information Systems, L.N. Gumilyov Eurasian National University, Astana 010000, KazakhstanAigerim YerimbetovaInstitute of Information and Computational Technologies of the Committee Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan, Almaty 050010, KazakhstanRoman TaberkhanDepartment of Information Systems, L.N. Gumilyov Eurasian National University, Astana 010000, KazakhstanManzura AbjalovaDepartment of Theoretical and Applied Linguistics, Tashkent State University of Uzbek Language and Literature, Tashkent 100060, UzbekistanAidos SabdenovDepartment of Computer Engineering, International University of Information Technologies, Almaty 050000, KazakhstanElmira DaiyrbayevaDepartment of Software Engineering, Satbayev University, Almaty 050010, Kazakhstan

Computersjournal2026en

ABI

Аннотация

Integrating structured linguistic resources into deep learning architectures represents a key challenge in domain-oriented NLP. This study proposes a framework for incorporating knowledge from a military thesaurus of the Ground Forces, structured according to the XML Zthes standard, into pre-trained transformed language models, including KazBERT, multilingual BERT, and XLM-RoBERTA. The approach addresses two interrelated tasks in specialized terminology processing: concept linking and semantic search. Unlike existing knowledge-injection methods designed primarily for general-domain applications, this framework formalizes the mapping of Zthes elements, such as Term, Broader term, Narrower term, Related term, ScopeNote, Language, and Source into structured textual representations that can be directly processed by transformer architectures. Fine-tuning is conducted on a dataset of 18,400 training instances automatically generated from the thesaurus, including synonym pairs, hierarchical relations (hyperonymy and hyponymy), associative links, and definitional descriptions. Experimental evaluation demonstrated that thesaurus-enriched models outperform baseline architectures across all major metrics. XLM-RoBERTA model achieves F1 = 0.84 and Top-5 accuracy = 0.94 in the concept linking task, representing a five-point improvement over the baseline. The model reaches Macro-F1 = 0.84 across four relation types. Results obtained on a specialized test set derived from terminology databases of Kazakhstan’s Armed Forces confirm robust cross-lingual generalization across Kazakh, Russian and English military discourse.

Темы

Biomedical Text Mining and Ontologies Natural Language Processing Techniques linguistics and terminology studies

Идентификаторы

DOI: 10.3390/computers15050297

Цитирования и источники

Цитирований: 0Использованных источников: 27

Показатели — AkademScholar · Скоро