Written by
Swetha Tanamala
on
on
Summary of Vggnet Paper
In this blog, for my notes as well as for the reference of others, I have written a small summary of the paper.
About Paper
- Vggnet paper title: Very deep convolutional networks for large scale Image Recognition
- Paper submission date: 4th sep 2014.
Achievements of the paper
In ImageNet competition 2014, authors secured
- First place in the localisation
- Second place in the classification
Key Contributions
- Experimented with different level of depths in convolutional neural networks (ConvNets)
- Presented the improvement in error reduction with respect to the depth
- More non-linear rectification layers (ReLUs) makes the decision function more discriminative
- Though depth was increased, the number of parameters in ConvNets was not greater than previously proposed shallow networks
- Released two best performing models to facilitate furthur research
Model details
Architecture
- Input to the model is a fixed size \(224 \times 224\) RGB image
- Preprocessing is subtracting the training set RGB value mean from each pixel
- Convolutional layers
- Stride fixed to 1 pixel
- padding is 1 pixel for \(3 \times 3\)
- Spatial pooling layers
- This layers doesn’t count to the depth of network by convention
- Spatial pooling is done using max pooling layers
- window size is \(2 \times 2\)
- Stride fixed to 2
- Convnets used 5 max pooling layers
Training
- Loss function is multinomial logistic regression
- Learning algorithm is mini-batch stochastic gradient descent(SGD) based on back-propogation with momentum
- Batch size was 256
- Momentum was 0.9
- Regularisation
- L2 Weight decay (penalty multiplier was 0.0005)
- Dropout for first two FC layers (set to 0.5)
- Learning rate
- Intial: 0.01
- decreased by 10 when validation set accuracy stopped improving
- Though greater number of parameters and depth compared to alexnet, the ConvNets required required less epochs for loss function to converge due to
- more regularisation by large depth and small convolutional kernels
- pre-initialisation of certain layers
- Training image size
- S is the smallest side of isotropically-rescaled image
- Two approaches for setting S
- Fix S, known as single scale training
- Here S = 256 and S = 384
- Vary S, known as multi-scale training
- S from [Smin, Smax] where Smin = 256, Smax = 512
- Fix S, known as single scale training
- Then \(224 \times 224\) image was radomly cropped from rescaled image per SGD iteration