Introduction
The hyperparameters significantly influence the performance of machine learning. Choosing appropriate hyperparameters can play an essential role in the achievement of neural network architecture because it directly controls the behavior of the training algorithms. This study attempted to apply some hyperparameters to observe a loss value performance by applying some optimizers in the gradient descent optimization algorithms. It also wanted to find the impact of training accuracy by applying different the number of batch size (BS) and epoch (EP) in convolutional neural network (CNN) based on an image classification problem. In this study, the model was implemented using the deep convolutional neural network algorithm (Hodan, 1954). Computer vision consists of various application domains such as localization, segmentation, image classification and object detection. Recently, image classification is the main component in computer vision which is also a very demanding problem due to the increasing number of contents, such as images and videos. Image classification and object detection are extensively used in various fields such as facial recognition, pedestrian detection in the field of security, intelligent video analysis, pedestrian tracking, plant disease conditions and so on (Huo & Yin, 2015;Li et al., 2017;Rangarajan et al., 2018;Mane et al., 2018). Therefore, image classification can be considered as an essential solution for categorizing and labeling groups of pixels or vectors based on the images. Nonetheless, one of the most common way to optimize the neural network is to choose a hyperparameter as an optimizer to perform optimization. The blackbox function optimization method and Bayesian optimization on hyperparameters have been in troduced before (Hutter, 2014). Some researchers can often achieve the benefits of decomposing the learning rate by instead expanding the batch size during training process. However, the problem of improving the performance of the models in convolutional neural networks is relevant and still problematical in deep learning. Adjusting the training parameters was one of the ways to carry out model development (Radiuk, 2018).
Therefore, gradient descent methods have been introduced to perform optimization for deep learning (Ruder, 2016;Selvaraju et al., 2016). The batch gradient operates the entire training data set and figures out the weights of the cost function concerning the parameters. The stochastic gradient descent performs the parameter update for every training example. Minibatches take the best of both a batch and stochastic strategies by operating the parameter update for every minibatch of training samples (Reddy et al., 2018). The process of training the model can be affected by setting up the batch size because the batch size also influences the network (Ioffe & Szegedy, 2015). Moreover, the batch size is used when fitting the model and this impacts the training in the term of time to converge and the amount of the over fitting (Radiuk, 2018). There has been research into optimization methods for machine learning that suggest the number of the batch size be set to no more than 64 (Takáč et al., 2013). Some studies showed that with large datasets, large batch sizes caused optimization difficulties while trained networks demonstrated good generalization (Goyal et al., 2017). Daneshmand et al. (2016) found that the loss function of a recurrent neural network model with mean squared error among Adagrad, RMSProp and Adam optimizers was 0.15, 0.25, and 0.14 respectively.
In this paper, we have explored the study of image classification using deep learning through the convolutional neural network method of various gradient descent optimizers. The main contributions were as follows: First, an experiment was conducted of a dataset which is consisting of 2 classes for a binary classification problem. Second, we applied a convolutional neural network as VGG16 architecture proposed by (Simonyan & Zisserman, 2015) which attempted to classify the samples of a dataset. Finally, different hyperparameters such as the optimizer, BS, and EP were applied toward a model to find out the classification of the loss and accuracy performance for each output. Furthermore, this study created scope for further experimentation on several datasets under different hyperparameter conditions to find out a suitable range of optimizers for the neural network. Moreover, it also enhances knowledge and understanding by using different batch sizes and epochs to improve accuracy for image classification.
Materials and Methods
1. Experiment
The experiment was conducted to find out the influential hyperparameter effect of a cost function of loss and accuracy performance using manual hyperparameters set up in a CNN model for an image classification problem. It should be noted that, in deep learning, there are more hyperparameters such as the number of hidden layers, learning rate, loss function and etc that can be affected by the algorithms. However, this study had focused on three hyperparameters with different optimizers, batch sizes, and epochs. The architecture was performed using a CNN architecture VGG16 pretrained model with various hyperparameters such as SGD, Adam, Adagrad and Adamax optimizer, the number of BS (16, 32, 64, 120) and EP (50, 100, 150) mentioned in Table 1. This experiment was demonstrated on a dataset of cats and dogs which has 24000 images with a 224×224 pixel size. The dataset was a split of 19211 images for training and 2000 images for testing. In this study, an optimizer’s algorithms were used to investigate the performance of the models which is developed according to the equation (1)(4).
Vanilla Stochastic Gradient Descent (SGD) uses the least memory to give a batch size (Duchi et al., 2012). It simply computes the gradient of the parameters by uniformly and randomly choosing a single or a few training examples.

θ is a parameter that refer to weights, biasses and activations

α is the learning rate, α =0.01

∇ is the gradirnt which taken of J

J is formally objectivefunction (cost function or loss function)

J(θ;x,y) is the input of the parameters θ along with a training of example
Adaptive Moment Estimation, or “Adam”, is a combination of SGD with momentum and RMSProp (Hinton, 2012) and practically accepted for training the neural network. It has been widely used in the field of machine learning framework (Kingma & Ba, 2015).

α is a learning rate or step size

ϵ is a small tern preventing division by zero ϵ =10^{8}

v_{t} is an update biased first moment estimate

s_{t} is and update biased second moment estimate

w_{t} is an update parameter
Adagrad can work well with sparse data and a large scale of neural networks for training the neural network (Duchi et al., 2012). It performs at a different learning rate for every parameter of each time step based on the past gradients computation. This can be done by tuning the learning rate differently for different sets of parameters.

θ is a parameter that refer to weights, biasses and activations

α is the learning rate, α =0.002

s_{t} is an update biased second moment estimate

ϵ is a small term preventing division by zero ϵ =10^{7}

∇ is the gradirnt which taken of J

J is formally objectivefunction (cost function or loss function)
Adamax (Kingma & Ba, 2015) is a combination of Adam by replacing the RMS property based on the infinity norm of the past gradients.

w_{t} is an update parameter

α is a learning rate or step size α =0.02

s_{t} is an update biased second moment estimate

v_{t} is an update biased first moment estimate
2. Batch Size and Epoch
Batch size is a hyperparameter where the number of data points operate to work in a model in each sequential iteration. It is a number of samples processed before the model is updated. A batch size is intimately familiar with a wellknown idea related to parameters optimization (You et al., 2017;Devarakonda et al., 2017) and training with large batch size (Hoffer et al., 2017). Keskar et al. (2019) had also suggested that large BS can hurt the model’s ability to generalize. Choosing a suitable batch size is essential to converge a cost function and parameter value generalization of the models. Radiuk (2018) has used two sets of numbers of the batch sizes (16, 32, 64, 128, 256, 512, 1024) and (50, 100, 150, 200, 250) applied to LeNet CNN architecture with the MNIST and CIFAR10 dataset to observe the impact of the batch size in the models. In choosing a batch size, a balance needs to be struck between the available computational hardware and significantly lower computational speed. Therefore, this study has chosen mini batches, such as 16, 32, 64, and 120 samples, which demonstrated in CNN with hyperparameters shown in Table 1.
In terms of artificial neural networks, epoch is a hyperparameter which is defined before training a model. The number of epochs is a hyperparameter that defines the number of times that the learning algorithm will work through the entire training dataset. One epoch is an entire dataset passed both backward and forward through the training neural network only once. It can be divided into smaller batches, as the number of epochs is increased, the weight is changed in the neural network. Given the complexity and variability of data in real problems, it may take hundreds to thousands of epochs to get some sensible accuracy on test data. In this study, we determined a different number of epochs, such as 50, 100, and 150 as shown in Table 1.
In this work, the model was built in python 3.7, all codes has been run on a desktop computer with window 10 pro, 64 bitbased Operating System, with a processor AMD Ryzen 3, 2200G, with Radeon Vega Graphics 3.5 GHz, and RAM 4.0 GB.
3. Loss function evaluation criteria
Loss function is minimized when smaller values represent a better model than larger values. In this study, Binary Crossentropy Loss Function (BCLF) was used to demonstrate the performance of the model for each optimizer (Sadowski, 2017;Srivastava et al., 2018). The BCLF calculated the variation between true and predicted probability distributions. In addition, the partial derivatives are computed over the predicted and true distributions and the error is computed with the difference of both distributions. The BCLF was applied along the SGD, Adam, Adagrad, Adamax optimizer to observe loss value among these optimizers. It was set up for a binary classification betweenclasses (cats and dogs). Here is the formulation expressed as shown in an equation (5).
The losses sum up over the different binary problems to the backpropagation. We set C as independent binary classification (C^{′} = 2), s_{1}, t_{1} are the score and the ground truth label for the class C_{1} which is also the class C_{i} in. C. s_{2 = 1} and t_{2} = 1  t_{1} are the score and the ground truth label of class C_{2}.
Results and Discussion
1. Loss evaluation
To determine the performance of the hyperparameter in CNN using VGG16 pretrain model architecture through sigmoid for activation function, loss function was used for configuring a model. The results of the loss values for each optimizer by a different number of batch size and epoch for training and testing are shown in Table 2. The SGD dynamic loss values were stable during training the model by applying different BS and EP as illustrated in Fig. 1 (a), (b) and (c). However, an increased number of EP can improve the performance of the models. The best performance was obtained by 32 BS and 150 EP (training loss = 0.080) and the highest loss value was given by 120 BS and 50 EP.
The dynamics of Adam while training an algorithm is illustrated in Fig. 2 (a), (b) and (c). Srivastava et al (2018) achieved similar loss values (Loss = ‘1.0’) by using Adam and Adagrad while training the models. In this paper, A model followed by 64 BS and 150 EP had the best performance due to the least loss value for 0.0396 in training and maximum loss values were found for 0.1294, followed by 16 BS and 50 EP for Adam optimizer.
As illustrated in Fig. 3 (a), (b), (c) initial loss had high values followed by BS 16, 32, 64, 120, respectively. However, as final loss values by training the models with different BS and EP are no significant difference through the Adagrad optimizer. The Adagrad optimizer was found with a minimum loss of 0.0105, followed by 120 BS and 150 EP. In contrast, the maximum loss was found at 32 BS and 100 EP for (Loss=0.1913).
As shown in Fig. 4 (a), (b), (c), the dynamics loss value of Adamax was stable and performed well during training the models. The Adamax worked very well with the higher number of BS and EP for both the accuracy and loss by using 120 BS and 150 EP (with Acc=99.31%, Loss=0.0197), respectively.
2. Accuracy evaluation
In the case of accuracy, the results of the accuracy for each optimizer by an applied different number of batch size and epoch for training and testing are mentioned in Table 3. The best performance was given by 32 BS and 150 EP along with the SGD optimizer. The work on CNN used Adam and various optimizers which achieved nearly 68% accuracy during training (Bello et al., 2017). Reddy et al. (2018) observed that while implementing in CNN architecture, the Adam optimizer had a maximum accuracy like Adagrad and RmsProp. In this paper, Adam performed well in training accuracy at 64, 120 BS and 150 EP for 98.39% and 98.28%, respectively. In the case of Adagrad and Adamax, the optimizers showed the highest accuracy of 99.65% and 99.31%, respectively along with 120 BS and 150 EP. The time consumed by using different optimizer, batch size and epoch are shown in Table 3.
3. Performance of increasing batch size and poch
The batch size is intimately related to the tuning parameters. Several authors have studied about adaptively increasing batch size with fixed schedules from the context of accelerating the optimization method through variance reduction (Babanezhad et al., 2015;Daneshmand et al., 2016). Based on statistical metrics evaluation (accuracy and loss), the results of this study observed that the highest accuracy of various optimizers among those batch sizes and epochs are shown in Table 3. The highest
Accuracy by a different number of BS and EP for SGD, Adam, Adagrad and Adamax are summarized in Fig 5(a), (b), (c) and (d), respectively. The Adagrad optimizer was the best performing in final loss values for a binary classification problem. Also, Adamax performed similarly with Adagrad in terms of lost value. However, it was observed that a high loss value was obtained by the SGD optimizer in training loss. One possibility for this observation stems from previous experiments that showed that increasing the number of epochs can improve accuracy level among Adam, Adagrad and Adamax optimizers. However, Radiuk (2018) found that the lowest values of test accuracy level were found in the batch sizes of 16, 32, 50, 64 while the best results were estimated in the batch sizes of 512, 1024. The latter two batch sizes having considered the MNIST and CIFAR10 datasets. Interestingly, an improvement of the results with an increasing number of epoch proved that a higher value of epochs are provided a better accuracy level. In addition, the most stable optimizer in this study is the Adam optimizer as shown in Fig. 5(a). The accuracy improved by using the number of epochs such as 50, 100 and 150 respectively. However, the best performing model for accuracy is the Adamax which is obtained by a 120 batch size and 150 epochs.
The empirical results of this study demonstrated the satisfactory results among selected hyperparameters and the best model can obtain a classification accuracy of 99.87% with the Adamax optimizer which was obtained at a 120 batch sizes and 150 epochs. On the other hand, the Adagrad optimizer had the best performance in the loss evaluation.