Skip to main content
Article

Towards automated content analysis of rhetorical structure of written essays using sequential content-independent features in Portuguese

Rafael Ferreira MelloCesar School, Brazil and Federal Rural University of Pernambuco, BrazilGiuseppe FiorentinoFederal Rural University of Pernambuco, BrazilHilário OliveiraInstituto Federal do Espírito Santo - Campus Serra, BrazilPéricles MirandaFederal Rural University of Pernambuco, BrazilMladen RakovićMonash University, AustraliaDragan GaševićMonash University, Australia and University of Edinburgh, United Kingdom
2022en
ABI

Abstract

Brazilian universities have included essay writing assignments in the entrance examination procedure to select prospective students. The essay scorers manually look for the presence of required Rhetorical Structure Theory (RST) categories and evaluate essay coherence. However, identifying RST categories is a time-consuming task. The literature reported several attempts to automate the identification of RST categories in essays with machine learning. Still, previous studies have focused on using machine learning algorithms trained on content-dependent features that can diminish classification performance, leading to over-fitting and hindering model generalisability. Therefore, this paper proposes: (i) the analysis of state-of-the-art classifiers and content-independent features to the task of RST rhetorical moves; (ii) a new approach that considers the sequence of the text to extract features – i.e. sequential content-independent features; (iii) an empirical study about the generalisability of the machine learning models and sequential content-independent features for this context; (iv) the identification of the most predictive features for automated identification of RST categories in essays written in Portuguese. The best performing classifier, XGBoost, based on sequential content-independent features, outperformed the classifiers used in the literature and are based on traditional content-dependent features. The XGBoost classifier based on sequential content-independent features also reached promising accuracy when tested for generalisability.

Identifiers

Citations and references

Cited by 20 references
Metrics — AkademScholar · Coming soon