Development and Comparative Analysis of Algorithms for Automatic Classification of Documents in Electronic Archives
Аннотация
This article presents a comprehensive comparative analysis of advanced machine learning and deep learning algorithms for the task of automatic document classification within modern electronic archival systems. Addressing the increasing challenge of managing vast quantities of digitized historical and administrative documents, our study meticulously evaluates the performance of four distinct models: the convolutional neural networks VGG-16 and GoogLeNet, the gradient boosting framework LightGBM, and the transformer-based BERT model. The assessment is conducted using a robust set of metrics, including precision, recall, F1-score, and support, to provide a holistic view of each model’s capabilities. The research methodology outlines a complete document processing pipeline, starting with the application of Optical Character Recognition (OCR) technology to extract text from scanned documents. This is followed by essential preprocessing stages such as noise removal, data standardization, and the extraction of semantic features crucial for accurate categorization. A key component of our analysis involves examining model performance and identifying specific misclassification patterns through the use of a Confusion Matrix. The results of this rigorous investigation not only scientifically validate the necessity of integrating state-of-the-art artificial intelligence models into electronic archive infrastructures but also provide a data-driven basis for selecting the most suitable algorithms to achieve a more efficient and accurate document management workflow.