• For Contributors +
• Journal Search +
Journal Search Engine
ISSN : 1598-5504(Print)
ISSN : 2383-8272(Online)
Journal of Agriculture & Life Science Vol.55 No.2 pp.99-107
DOI : https://doi.org/10.14397/jals.2021.55.2.99

# Performance Analysis of Different Optimizers, Batch Sizes, and Epochs on Convolutional Neural Network for Image Classification

Thavisack Sihalath, Jayanta Kumar Basak, Anil Bhujel, Elanchezhian Arulmozhi, Byeong-Eun Moon, Na-Eun Kim, Doeg-Hyun Lee, Hyeon-Tae Kim*
Department of Bio-systems Engineering, Gyeongsang National University (Institute of Smart Farm), Jinju 52828, Republic of Korea.
*Corresponding author: Hyeon-Tae Kim Tel: +82-055-772-1896 Fax: +82-055-772-1899 E-mail: bioani@gnu.ac.kr
December 10, 2020 ; March 23, 2021 ; April 8, 2021

## 초록

Korea Institute of Planning and Evaluation for Technology in Food, Agriculture, Forestry and Fisheries(iPET)
Ministry of Agriculture, Food and Rural Affairs(MAFRA)
717001-7

## Introduction

The hyper-parameters significantly influence the performance of machine learning. Choosing appropriate hyper-parameters can play an essential role in the achievement of neural network architecture because it directly controls the behavior of the training algorithms. This study attempted to apply some hyper-parameters to observe a loss value performance by applying some optimizers in the gradient descent optimization algorithms. It also wanted to find the impact of training accuracy by applying different the number of batch size (BS) and epoch (EP) in convolutional neural network (CNN) based on an image classification problem. In this study, the model was implemented using the deep convolutional neural network algorithm (Hodan, 1954). Computer vision consists of various application domains such as localization, segmentation, image classification and object detection. Recently, image classification is the main component in computer vision which is also a very demanding problem due to the increasing number of contents, such as images and videos. Image classification and object detection are extensively used in various fields such as facial recognition, pedestrian detection in the field of security, intelligent video analysis, pedestrian tracking, plant disease conditions and so on (Huo & Yin, 2015;Li et al., 2017;Rangarajan et al., 2018;Mane et al., 2018). Therefore, image classification can be considered as an essential solution for categorizing and labeling groups of pixels or vectors based on the images. Nonetheless, one of the most common way to optimize the neural network is to choose a hyper-parameter as an optimizer to perform optimization. The blackbox function optimization method and Bayesian optimization on hyper-parameters have been in- troduced before (Hutter, 2014). Some researchers can often achieve the benefits of decomposing the learning rate by instead expanding the batch size during training process. However, the problem of improving the performance of the models in convolutional neural networks is relevant and still problematical in deep learning. Adjusting the training parameters was one of the ways to carry out model development (Radiuk, 2018).

Therefore, gradient descent methods have been introduced to perform optimization for deep learning (Ruder, 2016;Selvaraju et al., 2016). The batch gradient operates the entire training data set and figures out the weights of the cost function concerning the parameters. The stochastic gradient descent performs the parameter update for every training example. Mini-batches take the best of both a batch and stochastic strategies by operating the parameter update for every mini-batch of training samples (Reddy et al., 2018). The process of training the model can be affected by setting up the batch size because the batch size also influences the network (Ioffe & Szegedy, 2015). Moreover, the batch size is used when fitting the model and this impacts the training in the term of time to converge and the amount of the over fitting (Radiuk, 2018). There has been research into optimization methods for machine learning that suggest the number of the batch size be set to no more than 64 (Takáč et al., 2013). Some studies showed that with large datasets, large batch sizes caused optimization difficulties while trained networks demonstrated good generalization (Goyal et al., 2017). Daneshmand et al. (2016) found that the loss function of a recurrent neural network model with mean squared error among Adagrad, RMSProp and Adam optimizers was 0.15, 0.25, and 0.14 respectively.

In this paper, we have explored the study of image classification using deep learning through the convolutional neural network method of various gradient descent optimizers. The main contributions were as follows: First, an experiment was conducted of a dataset which is consisting of 2 classes for a binary classification problem. Second, we applied a convolutional neural network as VGG16 architecture proposed by (Simonyan & Zisserman, 2015) which attempted to classify the samples of a dataset. Finally, different hyper-parameters such as the optimizer, BS, and EP were applied toward a model to find out the classification of the loss and accuracy performance for each output. Furthermore, this study created scope for further experimentation on several datasets under different hyper-parameter conditions to find out a suitable range of optimizers for the neural network. Moreover, it also enhances knowledge and understanding by using different batch sizes and epochs to improve accuracy for image classification.

## Materials and Methods

### 1. Experiment

The experiment was conducted to find out the influential hyper-parameter effect of a cost function of loss and accuracy performance using manual hyper-parameters set up in a CNN model for an image classification problem. It should be noted that, in deep learning, there are more hyper-parameters such as the number of hidden layers, learning rate, loss function and etc that can be affected by the algorithms. However, this study had focused on three hyper-parameters with different optimizers, batch sizes, and epochs. The architecture was performed using a CNN architecture VGG16 pre-trained model with various hyper-parameters such as SGD, Adam, Adagrad and Adamax optimizer, the number of BS (16, 32, 64, 120) and EP (50, 100, 150) mentioned in Table 1. This experiment was demonstrated on a dataset of cats and dogs which has 24000 images with a 224×224 pixel size. The dataset was a split of 19211 images for training and 2000 images for testing. In this study, an optimizer’s algorithms were used to investigate the performance of the models which is developed according to the equation (1)-(4).

Vanilla Stochastic Gradient Descent (SGD) uses the least memory to give a batch size (Duchi et al., 2012). It simply computes the gradient of the parameters by uniformly and randomly choosing a single or a few training examples.

$θ = θ − α ∇ θ J ( θ ; x , y )$
(1)

• θ is a parameter that refer to weights, biasses and activations

• α is the learning rate, α =0.01

• is the gradirnt which taken of J

• J is formally objectivefunction (cost function or loss function)

• J(θ;x,y) is the input of the parameters θ along with a training of example

Adaptive Moment Estimation, or “Adam”, is a combination of SGD with momentum and RMSProp (Hinton, 2012) and practically accepted for training the neural network. It has been widely used in the field of machine learning framework (Kingma & Ba, 2015).

$w t + 1 = w t − α . υ ^ t s ⌢ t + ∈$
(2)

• α is a learning rate or step size

• ϵ is a small tern preventing division by zero ϵ =10-8

• vt is an update biased first moment estimate

• st is and update biased second moment estimate

• wt is an update parameter

Adagrad can work well with sparse data and a large scale of neural networks for training the neural network (Duchi et al., 2012). It performs at a different learning rate for every parameter of each time step based on the past gradients computation. This can be done by tuning the learning rate differently for different sets of parameters.

$θ = θ − α s t + ∈ ⋅ ∇ θ J ( θ ; x , y )$
(3)

• θ is a parameter that refer to weights, biasses and activations

• α is the learning rate, α =0.002

• st is an update biased second moment estimate

• ϵ is a small term preventing division by zero ϵ =10-7

• ∇ is the gradirnt which taken of J

• J is formally objectivefunction (cost function or loss function)

Adamax (Kingma & Ba, 2015) is a combination of Adam by replacing the RMS property based on the infinity norm of the past gradients.

$w t + 1 = w t − α S t ⋅ V ⌢ t$
(4)

• wt is an update parameter

• α is a learning rate or step size α =0.02

• st is an update biased second moment estimate

• vt is an update biased first moment estimate

### 2. Batch Size and Epoch

Batch size is a hyper-parameter where the number of data points operate to work in a model in each sequential iteration. It is a number of samples processed before the model is updated. A batch size is intimately familiar with a well-known idea related to parameters optimization (You et al., 2017;Devarakonda et al., 2017) and training with large batch size (Hoffer et al., 2017). Keskar et al. (2019) had also suggested that large BS can hurt the model’s ability to generalize. Choosing a suitable batch size is essential to converge a cost function and parameter value generalization of the models. Radiuk (2018) has used two sets of numbers of the batch sizes (16, 32, 64, 128, 256, 512, 1024) and (50, 100, 150, 200, 250) applied to LeNet CNN architecture with the MNIST and CIFAR-10 dataset to observe the impact of the batch size in the models. In choosing a batch size, a balance needs to be struck between the available computational hardware and significantly lower computational speed. Therefore, this study has chosen mini batches, such as 16, 32, 64, and 120 samples, which demonstrated in CNN with hyper-parameters shown in Table 1.

In terms of artificial neural networks, epoch is a hyper-parameter which is defined before training a model. The number of epochs is a hyperparameter that defines the number of times that the learning algorithm will work through the entire training dataset. One epoch is an entire dataset passed both backward and forward through the training neural network only once. It can be divided into smaller batches, as the number of epochs is increased, the weight is changed in the neural network. Given the complexity and variability of data in real problems, it may take hundreds to thousands of epochs to get some sensible accuracy on test data. In this study, we determined a different number of epochs, such as 50, 100, and 150 as shown in Table 1.

In this work, the model was built in python 3.7, all codes has been run on a desktop computer with window 10 pro, 64 bit-based Operating System, with a processor AMD Ryzen 3, 2200G, with Radeon Vega Graphics 3.5 GHz, and RAM 4.0 GB.

### 3. Loss function evaluation criteria

Loss function is minimized when smaller values represent a better model than larger values. In this study, Binary Cross-entropy Loss Function (BCLF) was used to demonstrate the performance of the model for each optimizer (Sadowski, 2017;Srivastava et al., 2018). The BCLF calculated the variation between true and predicted probability distributions. In addition, the partial derivatives are computed over the predicted and true distributions and the error is computed with the difference of both distributions. The BCLF was applied along the SGD, Adam, Adagrad, Adamax optimizer to observe loss value among these optimizers. It was set up for a binary classification betweenclasses (cats and dogs). Here is the formulation expressed as shown in an equation (5).

$C E = − ∑ i = 1 C ′ = 2 t i log ( f ( s i ) ) − ( 1 − t 1 ) log ( 1 − f ( s 1 ) )$
(5)

The losses sum up over the different binary problems to the backpropagation. We set C as independent binary classification (C = 2), s1, t1 are the score and the ground truth label for the class C1 which is also the class Ci in. C. s2 = 1 and t2 = 1 - t1 are the score and the ground truth label of class C2.

## Results and Discussion

### 1. Loss evaluation

To determine the performance of the hyper-parameter in CNN using VGG16 pre-train model architecture through sigmoid for activation function, loss function was used for configuring a model. The results of the loss values for each optimizer by a different number of batch size and epoch for training and testing are shown in Table 2. The SGD dynamic loss values were stable during training the model by applying different BS and EP as illustrated in Fig. 1 (a), (b) and (c). However, an increased number of EP can improve the performance of the models. The best performance was obtained by 32 BS and 150 EP (training loss = 0.080) and the highest loss value was given by 120 BS and 50 EP.

The dynamics of Adam while training an algorithm is illustrated in Fig. 2 (a), (b) and (c). Srivastava et al (2018) achieved similar loss values (Loss = ‘1.0’) by using Adam and Adagrad while training the models. In this paper, A model followed by 64 BS and 150 EP had the best performance due to the least loss value for 0.0396 in training and maximum loss values were found for 0.1294, followed by 16 BS and 50 EP for Adam optimizer.

As illustrated in Fig. 3 (a), (b), (c) initial loss had high values followed by BS 16, 32, 64, 120, respectively. However, as final loss values by training the models with different BS and EP are no significant difference through the Adagrad optimizer. The Adagrad optimizer was found with a minimum loss of 0.0105, followed by 120 BS and 150 EP. In contrast, the maximum loss was found at 32 BS and 100 EP for (Loss=0.1913).

As shown in Fig. 4 (a), (b), (c), the dynamics loss value of Adamax was stable and performed well during training the models. The Adamax worked very well with the higher number of BS and EP for both the accuracy and loss by using 120 BS and 150 EP (with Acc=99.31%, Loss=0.0197), respectively.

### 2. Accuracy evaluation

In the case of accuracy, the results of the accuracy for each optimizer by an applied different number of batch size and epoch for training and testing are mentioned in Table 3. The best performance was given by 32 BS and 150 EP along with the SGD optimizer. The work on CNN used Adam and various optimizers which achieved nearly 68% accuracy during training (Bello et al., 2017). Reddy et al. (2018) observed that while implementing in CNN architecture, the Adam optimizer had a maximum accuracy like Adagrad and RmsProp. In this paper, Adam performed well in training accuracy at 64, 120 BS and 150 EP for 98.39% and 98.28%, respectively. In the case of Adagrad and Adamax, the optimizers showed the highest accuracy of 99.65% and 99.31%, respectively along with 120 BS and 150 EP. The time consumed by using different optimizer, batch size and epoch are shown in Table 3.

### 3. Performance of increasing batch size and poch

The batch size is intimately related to the tuning parameters. Several authors have studied about adaptively increasing batch size with fixed schedules from the context of accelerating the optimization method through variance reduction (Babanezhad et al., 2015;Daneshmand et al., 2016). Based on statistical metrics evaluation (accuracy and loss), the results of this study observed that the highest accuracy of various optimizers among those batch sizes and epochs are shown in Table 3. The highest

The empirical results of this study demonstrated the satisfactory results among selected hyper-parameters and the best model can obtain a classification accuracy of 99.87% with the Adamax optimizer which was obtained at a 120 batch sizes and 150 epochs. On the other hand, the Adagrad optimizer had the best performance in the loss evaluation.

## Acknowledgment

This work was supported by Korea Institute of Planning and Evaluation for Technology in Food, Agriculture, Forestry and Fisheries (IPET) through Agriculture, Food and Rural Affairs Research Center Support Program, funded by Ministry of Agriculture, Food and Rural Affairs (MAFRA) (717001-7).

## Figures

Dynamics of loss values of SGD optimizer during training the model through the binary cross- Entropy loss function with different Batch Size (BS). (a) 50 Epoch; (b) 100 Epoch; (c) 150 Epoch.

Dynamics of loss values of Adam optimizer during training the model through the binary cross-entropy loss function with different Batch Size (BS). (a) 50 Epoch; (b) 100 Epoch; (c) 150 Epoch.

Dynamics of loss values of Adagrad optimizer during training the model through the binary cross-entropy loss function with different Batch Size (BS). (a) 50 Epoch; (b) 100 Epoch; (c) 150 Epoch.

Represents dynamics of loss values of Adagrad optimizer during training the model through the binary cross-entropy loss function with different Batch Size (BS). (a) 50 Epoch; (b) 100 Epoch; (c) 150 Epoch.

## Tables

Hyper-parameters set up of CNN model for increasing the number of batch size and an epoch with selected optimizers

The performance of the optimizer hyper-parameter with different number of batch size and epoch by the loss metric

The performance of the optimizer hyper-parameter with different number of batch size and epoch by accuracy metric

## References

1. Babanezhad R , Ahmed MO , Virani A , Schmidt M , Konečný J and Sallinen S. 2015. Stop wasting my gradients: Practical SVRG. J. Adv Neu Info Proc Sys. 2251-2259.
2. Bello I , Zoph B , Vasudevan V and Le QV. 2017. Neural optimizer search with reinforcement learning. Int Conf on Machine Learning 1: 712-721.
3. Daneshm H , Lucchi A and Hofman T. 2016. Starting small-learning with adaptive sample sizes. Int Conf on Machine Learning 3: 2167-2186.
4. Devarakonda A , Naumov M and Garland M. 2017. AdaBatch: Adaptive batch sizes for training deep neural networks. http://arxiv.org/abs/1712.02029
5. Duchi JC , Bartlett PL and Wainwright MJ. 2012. Randomized smoothing for (parallel) stochastic optimization. IEEE Int Conf on Decision and Control 12: 5442-5444.
6. Goyal P , Dollár P , Girshick R , Noordhuis P , Wesolowski L and Kyrola A. 2017. Accurate large minibatch http://arxiv.org/abs/1706.02677
7. Hinton G , Srivastava N and Swersky K. Neural networks for machine learning. Coursera, Video Lectures 1: 264.
8. Hoffer E , Hubara I and Soudry D. 2017. Closing the generalization gap in large batch training of neural networks. J. Adv Neu Info Proc Sys. 1732-1742.
9. Huo B and Yin F. 2015. Research on novel image classification algorithm based on multi-feature extraction and modified SVM. Classifier 9: 103-112.
10. Hutter F. 2014. Meta-learning. Studies in Computational Intelligence 498: 233-317.
11. Ioffe S and Szegedy C. 2015. Batch normalization accelerating deep network training by reducing internal covariate shift. Int. Machine Learning. 448-456.
12. Keskar NS , Nocedal J , Tang PTP , Mudigere D and Smelyanskiy M. 2019. On large-batch training for deep learning: Generalization gap and sharp minima. Int. Conference on Learning Representations. 1-16.
13. Kingma DP and Ba JL. 2015. Adam: A method for stochastic optimization. Int. Conference on Learning Representations. 5-15.
14. Li J , Singh R and Singh R. 2017. A novel large-scale multimedia image data classification algorithm based on mapping assisted deep neural network. Multi Tools and App. 76: 678-710.
15. Mane H , Gopala V and Matcha R. 2018. Image classification using deep learning. Int. J. Eng. Technol. 7: 614-617.
16. Radiuk PM. 2018. Impact of training set batch size on the performance of convolutional neural networks for diverse datasets. Info. Techno. Sci. 20: 20-24.
17. Rangarajan AK , Purushothaman R and Ramesh A. 2018. Tomato crop disease classification using pre-trained deep learning algorithm. Proc Com. Sci. 33: 1040-1047.
18. Reddy SV , Reddy KT and ValliKumari V. 2018. Optimization of deep learning using various optimizers, loss functions and dropout. Int. J. Recent Technol. Eng.7: 448-455.
19. Ruder S. 2016. An overview of gradient descent optimization. arXiv:1609.04747. pp.1-14.
20. Sadowski P. 2017. Notes on backpropagation. Dept. Com. Sci. University of California Irvine 1: 1-4.
21. Selvaraju RR , Cogswell M , Das A , Vedantam R , Parikh D and Batra D. 2016. Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision 17: 331-336.
22. Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition. Int. Conference on Learning Representations 1-14.
23. Srivastava Y , Murali V and Dubey SR. 2018. A performance comparison of loss functions for deep face recognition. Int. Conference on Computer Vision, Pattern Recog. Image Proc. Graph. 322-332.
24. Takáč M , Bijral A , Richtárik P and Srebro N. 2013. Mini-batch primal and dual methods for SVMs. Int. Conference on Machine Learning 28: 2059-2067.
25. You Y , Gitman I and Ginsburg B. 2017. Large batch training of convolutional networks, 1-8. http://arxiv.org/abs/1708.03888
 오늘하루 팝업창 안보기 닫기