Increasing the Reliability of Information in Electronic Documents Based on the Use of Statistical Redundancy and n-Gram Structure of Texts
Abstract
The problem is formulated and the methodology for creating a technology for increasing the reliability of information in electronic document management systems with mechanisms for extracting statistical, logical, semantic links, specific characteristics of elements and relationships of document concepts that determine the use of statistical information redundancy of the text based on models of multidimensional probability distributions of mono-, di-, tri- and n-grams is developed. Tools for increasing the reliability of information based on modifying the rules of statistical Huffman coding, forming n-grams of text statistics, and determining a rational set of hash functions are obtained. A methodology for studying the probability function of undetected errors with adaptable boundaries, variables, and intervals for checking the belonging of document elements to subsets of expanded and prohibited values is developed. A classifier of two alternatives with reliable and unreliable information is studied, the capabilities of which are expanded by a clustering mechanism with a one-sided model for viewing a text line from the left or right side. A software package for increasing the reliability of information has been developed and implemented in the C++ language, in which the proposed mechanisms operating in the CUDA parallel computing technology environment are synthesized. The software package is designed to detect and correct multiple errors in information.