Chris_zhangrx's blog

**Abstract** On the LSVRC-2010 dataset, we trained a neural network using 1.2 million high-resolution images across 1,000 categories. The model achieved a top-1 error rate of 37.5% and a top-5 error rate of 17.0% on the test set. The architecture consists of 60 million parameters, with 650,000 neurons organized into 5 convolutional layers and 3 fully connected layers. To prevent overfitting, we implemented the Dropout technique. Additionally, this model was used in the ILSVRC-2012 competition, where it outperformed others with a top-5 error rate of 15.3%, significantly better than the second-place result of 26.2%. **1. Prologue** In the early days of neural networks, Yann LeCun and his colleagues faced rejection at top conferences when their work was submitted. At that time, many researchers believed that manual feature engineering was necessary for effective image classification. In the 1980s, neuroscientists and physicists proposed that hierarchical feature detectors could be more robust, but they were unsure about what features these structures would actually learn. Some researchers discovered that multi-layer feature detectors could be effectively trained using backpropagation (BP). However, even with BP, the performance of deep networks at the time did not meet expectations, leading to frustration. It wasn’t until two years later that we realized the issue was not the algorithm itself, but rather the lack of sufficient data and computational power. **2. Introduction** The main contributions of this paper include: 1. Training a convolutional neural network on ImageNet and achieving state-of-the-art accuracy. 2. Developing a GPU-optimized implementation based on 2D convolutions and making it publicly available. 3. Introducing ReLU activation functions, multi-GPU training, and local response normalization to enhance performance and reduce training time. 4. Applying Dropout and data augmentation techniques to prevent overfitting. 5. Designing a network with five convolutional layers and three fully connected layers, as deeper networks showed better performance. Due to computational constraints, we used five 3GB GTX580 GPUs and trained the model for approximately 5–6 days. **3. The Dataset** ImageNet is a large-scale dataset containing over 15 million images, categorized into 22,000 classes, mostly sourced from the internet and manually labeled. The ILSVRC competition uses a subset of ImageNet, consisting of around 1,000 categories, with about 1,000 images per class, totaling 1.2 million training images, 50,000 validation images, and 150,000 test images. ILSVRC-2010 is the only version with a labeled test set, so we primarily used it for our experiments. However, we also participated in the ILSVRC-2012 competition. ImageNet evaluates models using top-1 and top-5 error rates. Top-5 means the correct label does not appear among the top five predicted labels. Since ImageNet contains images of varying resolutions, we resized all images to 256x256 pixels. For rectangular images, we first rescale the shorter side to 256 and then crop a central 256x256 region. We only subtracted the mean pixel value from each image without any other preprocessing. **4. The Architecture** The network comprises 8 layers, including 5 convolutional layers and 3 fully connected layers. Below are some key innovations in our design: **4.1. Rectified Linear Unit (ReLU) Nonlinearity** Using ReLU instead of the tanh function significantly accelerated training on a simple 4-layer convolutional network. This helps speed up the learning process and mitigates overfitting. **4.2. Training on Multiple GPUs** A single GTX580 GPU has only 3GB of memory, which limits the size of the network we can train. To overcome this, we used two GPUs in parallel, splitting the training parameters between them. Communication occurred only at specific layers, such as the third convolutional layer receiving input from the entire second layer, while the fourth layer received input only from its own GPU. This approach reduced the top-1 and top-5 error rates by 1.7% and 1.2%, respectively, and slightly improved training speed. **4.3. Local Response Normalization** Although ReLU doesn't require input normalization to avoid saturation, we found that applying local response normalization helped improve generalization. Using parameters k=2, n=5, α=10⁻⁴, and β=0.75, this technique reduced the top-1 and top-5 error rates by 1.4% and 1.2%, respectively. **4.4. Overlapping Pooling** Using overlapping pooling regions reduced the top-1 and top-5 error rates by 0.4% and 0.2%, respectively, with minimal risk of overfitting. **4.5. Overall Architecture** Due to dual GPU training, the second, fourth, and fifth convolutional layers were connected within the same GPU, while the third convolutional layer was fully connected to the second. The LRN layer was applied to the first and second convolutional layers, and max pooling was used in layers 1, 2, and 5. ReLU was used in all convolutional and fully connected layers. **5. Reducing Overfitting** **5.1. Data Augmentation** We augmented the data by randomly cropping 224x224 patches from 256x256 images and flipping them horizontally. During testing, we extracted 10 crops (including 4 corners and the center) and averaged their softmax outputs. Another method involved adjusting RGB values, which slightly reduced top-1 accuracy by 1%. **5.2. Dropout** Dropout was applied to the first two fully connected layers with a probability of 0.5. This made the network structure different for each sample, and it required twice as many iterations to converge. **6. Learning Details** We used Stochastic Gradient Descent (SGD) with a batch size of 128, momentum of 0.9, and weight decay of 0.0005. Weight decay played an important role in reducing training error. We initialized weights with a Gaussian distribution (mean = 0, std = 0.01), and set biases for certain layers to 1 or 0. The learning rate started at 0.01 and was reduced by a factor of 10 after every 30 epochs, over a total of 90 epochs. **7. Results** The results show that the two GPU-trained models focused on different aspects of the image—color information for one and edge detection for the other. These differences contributed to the overall performance improvement. **8. Discussion** It is worth noting that removing any of the convolutional layers significantly degraded the top-1 accuracy by 2%. This highlights the importance of network depth in achieving good performance.

Insulated Terminals

Insulated Terminals,Terminals,High-quality insulated terminals

Taixing Longyi Terminals Co.,Ltd. , https://www.txlyterminals.com

This entry was posted in on