THE EFFECT OF TRAINING SAMPLE SIZE ON THE STABILITY OF CLASSIFICATION MODELS
DOI:
https://doi.org/10.30857/2786-5371.2025.6.3Keywords:
training sample size, model stability, classification, machine learning, learning curve, machine learning algorithms, ensemble methods, PythonAbstract
Purpose. The research is aimed at a comprehensive analysis of the impact of training sample size on classification model stability and determining optimal strategies for selecting sample size for different types of machine learning algorithms. The goal of the work is to develop a methodology for assessing model stability depending on the volume of training data and to determine recommendations for selecting optimal sample size to achieve high stability and generalization ability of classification models.
Methodology. The research methodology is based on experimental analysis of performance and stability of different types of classification models (logistic regression, Random Forest, Gradient Boosting, neural networks) when trained on samples of different sizes (from 100 to 10000 examples). Model stability assessment is performed using metrics of coefficient of variation of accuracy, variance of accuracy, and interquartile range when training models multiple times on different data subsets. Methods of learning curve analysis are applied to determine saturation points, assess model complexity, and progressive cross-validation. The effectiveness of stability improvement methods is investigated, including data augmentation techniques, regularization, and ensemble methods..
Findings. Experimental results demonstrate a significant dependence of classification model stability on training sample size. For simple linear models (logistic regression), stable performance is achieved at a sample size of approximately 2000-3000 examples, while for complex models (neural networks) 5000-10000 examples are required. The coefficient of variation of accuracy decreases with increasing sample size: for logistic regression from 0.25 to 0.08, for Random Forest from 0.18 to 0.05, for Gradient Boosting from 0.15 to 0.04, for neural networks from 0.22 to 0.06. It is found that data augmentation techniques allow reducing the coefficient of variation by 52-68% at small sample sizes, and ensemble methods provide stability with a coefficient of variation less than 0.05 even for samples of 500 examples. The impact of class imbalance and feature space dimensionality on model stability is established, which requires correction of optimal sample size.
Originality. A comprehensive methodology for assessing classification model stability depending on training sample size is developed, including theoretical analysis of the relationship between sample size and variance component of generalization error, empirical methods for determining saturation points, and comparative analysis of the effectiveness of different stability improvement methods. The impact of class imbalance and feature space dimensionality on the relationship between sample size and model stability is systematically investigated for the first time. A classification of models by dependence on training sample size is developed, taking into account algorithm type, model complexity, and data nature..
Practical value. The obtained results allow justifying the choice of optimal training sample size for a specific classification task depending on algorithm type, model complexity, and data nature. The developed recommendations can be applied in various fields where high stability of classification models is required, including medical diagnostics, financial analysis, cybersecurity, and image processing. The methodology for determining optimal sample size allows optimizing the use of computational resources and ensuring high reliability of classification results under conditions of limited training data.