Maqola

Development of a Software Model for Classification and Automatic Cataloging of Archive Documents

Adilbek DauletovBahodir MuminovDepartment of Artificial Intelligence, Tashkent State University of Economics, Tashkent 100066, UzbekistanNoila S. MatyakubovaLanguage Teaching Center, Alfraganus University, Tashkent 100190, UzbekistanUldona AbdurahmonovaDepartment of Computational Linguistics and Digital Technologies, Tashkent State University of Uzbek Language and Literature Named after Alisher Navoi, Tashkent 100100, UzbekistanKhurshida BakhriyevaMakhbubakhon FayzievaDepartment of Information Systems and Technologies, National Pedagogical University of Uzbekistan, Tashkent 100007, Uzbekistan

Informationjournal2026en

ABI

Annotatsiya

This study proposes an integrated software model for automatic document classification and metadata generation based on the Dublin Core standard to address the issue of rapid and consistent management of archival documents in a digital environment. This approach combines the stages of receiving incoming documents, converting them to text using optical character recognition (OCR), image preprocessing (binarization, deskew, noise reduction), and text cleaning and vectorization (TF–IDF) into a single pipeline. In the document classification stage, the Bidirectional Encoder Representations from Transformers (BERT) model with a context-sensitive transformer architecture is used, along with classical machine learning models (Logistic Regression, Naive Bayes, Support Vector Machine) and an ensemble approach (LightGBM), to increase the accuracy by modeling the document content at a deep semantic level. Experiments were conducted on the RVL-CDIP dataset, and the OCR efficiency was evaluated using the Character Error Rate (CER) indicator, and the classification results were evaluated using the accuracy, precision, recall and F1-score metrics. The results confirmed the high stability and generalization ability of the BERT (accuracy, 95.1%; F1, 95.0%) and LightGBM (accuracy, 93.2%; F1, 93.2%) models. In the final stage, OCR, NER, and classification outputs are automatically organized into Dublin Core metadata elements (Title, Creator, Date, Description, Subject, Type, Format, Language) and exported in JSON/XML formats. This automation significantly reduces manual cataloging effort and improves indexing and retrieval efficiency in digital archival systems.

Mavzular

Handwritten Text Recognition Techniques Text and Document Classification Technologies Image Retrieval and Classification Techniques

Identifikatorlar

DOI: 10.3390/info17040341

Iqtiboslar va manbalar

0 ta iqtibos16 ta foydalanilgan manba

Koʻrsatkichlar — AkademScholar · Tez orada