THE EFFECT OF TRAINING SAMPLE SIZE ON THE STABILITY OF CLASSIFICATION MODELS

Authors

  • Vladyslav PYLYPENKO Kyiv National University of Technologies and Design, Ukraine

DOI:

https://doi.org/10.30857/2786-5371.2025.6.3

Keywords:

training sample size, model stability, classification, machine learning, learning curve, machine learning algorithms, ensemble methods, Python

Abstract

Purpose. The research is aimed at a comprehensive analysis of the impact of training sample size on classification model stability and determining optimal strategies for selecting sample size for different types of machine learning algorithms. The goal of the work is to develop a methodology for assessing model stability depending on the volume of training data and to determine recommendations for selecting optimal sample size to achieve high stability and generalization ability of classification models.

Methodology. The research methodology is based on experimental analysis of performance and stability of different types of classification models (logistic regression, Random Forest, Gradient Boosting, neural networks) when trained on samples of different sizes (from 100 to 10000 examples). Model stability assessment is performed using metrics of coefficient of variation of accuracy, variance of accuracy, and interquartile range when training models multiple times on different data subsets. Methods of learning curve analysis are applied to determine saturation points, assess model complexity, and progressive cross-validation. The effectiveness of stability improvement methods is investigated, including data augmentation techniques, regularization, and ensemble methods..

Findings. Experimental results demonstrate a significant dependence of classification model stability on training sample size. For simple linear models (logistic regression), stable performance is achieved at a sample size of approximately 2000-3000 examples, while for complex models (neural networks) 5000-10000 examples are required. The coefficient of variation of accuracy decreases with increasing sample size: for logistic regression from 0.25 to 0.08, for Random Forest from 0.18 to 0.05, for Gradient Boosting from 0.15 to 0.04, for neural networks from 0.22 to 0.06. It is found that data augmentation techniques allow reducing the coefficient of variation by 52-68% at small sample sizes, and ensemble methods provide stability with a coefficient of variation less than 0.05 even for samples of 500 examples. The impact of class imbalance and feature space dimensionality on model stability is established, which requires correction of optimal sample size.

Originality. A comprehensive methodology for assessing classification model stability depending on training sample size is developed, including theoretical analysis of the relationship between sample size and variance component of generalization error, empirical methods for determining saturation points, and comparative analysis of the effectiveness of different stability improvement methods. The impact of class imbalance and feature space dimensionality on the relationship between sample size and model stability is systematically investigated for the first time. A classification of models by dependence on training sample size is developed, taking into account algorithm type, model complexity, and data nature..

Practical value. The obtained results allow justifying the choice of optimal training sample size for a specific classification task depending on algorithm type, model complexity, and data nature. The developed recommendations can be applied in various fields where high stability of classification models is required, including medical diagnostics, financial analysis, cybersecurity, and image processing. The methodology for determining optimal sample size allows optimizing the use of computational resources and ensuring high reliability of classification results under conditions of limited training data.

Downloads

Download data is not yet available.

Author Biography

Vladyslav PYLYPENKO, Kyiv National University of Technologies and Design, Ukraine

Phd Student, Department of Information and Computer Technologies

https://orcid.org/0000-0002-2761-4817

Scopus Author ID: 58089336700

Downloads

Published

2025-12-23

How to Cite

PYLYPENKO, V. (2025). THE EFFECT OF TRAINING SAMPLE SIZE ON THE STABILITY OF CLASSIFICATION MODELS. Technologies and Engineering, 26(6), 32–44. https://doi.org/10.30857/2786-5371.2025.6.3

Issue

Section

INFORMATION TECHNOLOGIES, ELECTRONICS, MECHANICAL AND ELECTRICAL ENGINEERING

Most read articles by the same author(s)