Integrating Thesaurus-Based Knowledge into Transformer Models for Semantic Understanding of Domain-Specific Texts
Аннотация
Integrating structured linguistic resources into deep learning architectures represents a key challenge in domain-oriented NLP. This study proposes a framework for incorporating knowledge from a military thesaurus of the Ground Forces, structured according to the XML Zthes standard, into pre-trained transformed language models, including KazBERT, multilingual BERT, and XLM-RoBERTA. The approach addresses two interrelated tasks in specialized terminology processing: concept linking and semantic search. Unlike existing knowledge-injection methods designed primarily for general-domain applications, this framework formalizes the mapping of Zthes elements, such as Term, Broader term, Narrower term, Related term, ScopeNote, Language, and Source into structured textual representations that can be directly processed by transformer architectures. Fine-tuning is conducted on a dataset of 18,400 training instances automatically generated from the thesaurus, including synonym pairs, hierarchical relations (hyperonymy and hyponymy), associative links, and definitional descriptions. Experimental evaluation demonstrated that thesaurus-enriched models outperform baseline architectures across all major metrics. XLM-RoBERTA model achieves F1 = 0.84 and Top-5 accuracy = 0.94 in the concept linking task, representing a five-point improvement over the baseline. The model reaches Macro-F1 = 0.84 across four relation types. Results obtained on a specialized test set derived from terminology databases of Kazakhstan’s Armed Forces confirm robust cross-lingual generalization across Kazakh, Russian and English military discourse.