CNN — Architecture

Aksshar Ramesh
10 min readAug 20, 2023

CNN stands for Convolutional Neural Network. It’s a type of artificial intelligence model designed to understand and process images or other grid-like data. Imagine it as a tool that helps a computer recognize patterns and features in pictures.

Here’s a simple description of how a CNN works:

  1. Filters and Feature Maps: Think of the CNN as a set of filters, like different-colored sunglasses. These filters are used to look at small portions of the image at a time. When you put on these filters and look at the picture, they highlight certain features, like edges or corners.
  2. Convolution: The CNN takes these filters and slides them across the entire image, just like sliding a window. As the filter slides, it checks each small area it covers and sees what features it matches.
  3. Feature Extraction: Each time the filter checks an area, it calculates a number that represents how well that filter matches the features in that area. These numbers create a new picture called a “feature map” that highlights where certain features are found.
  4. Pooling: After creating these feature maps, the CNN might shrink them by “pooling.” This is like zooming out a bit and keeping only the most important information. Pooling helps reduce the amount of data the CNN needs to process.
  5. Flattening: Eventually, the CNN takes all these feature maps and flattens them into a single long list. This list is like a summary of all the important information the CNN found in the image.
  6. Fully Connected Layers: Finally, this list is given to a traditional neural network. This part helps the CNN understand what the overall image represents, like identifying whether it’s a cat or a dog based on the patterns it learned from the filters.

CNN is like a smart filter that scans pictures to find specific shapes and patterns. It breaks down the image step by step, learns what’s important, and then uses that knowledge to figure out what the whole picture is about. This makes CNNs really useful for tasks like image recognition, where you want a computer to understand what’s in a picture.

The CNN is a combination of two basic building blocks:

  1. The Convolution Block — Consists of the Convolution Layer and the Pooling Layer. This layer forms the essential component of Feature-Extraction
  2. The Fully Connected Block — Consists of a fully connected simple neural network architecture. This layer performs the task of Classification based on the input from the convolutional block.

We shall define each of these layers now.

Convolutional architecture

The CONVOLUTIONAL LAYER is primarily responsible for extracting features from images. To better understand this, let’s break down the concepts of ‘filters’ and ‘convolution,’ and then delve into how they are utilized within this layer.

Filters, also known as ‘kernels,’ can be thought of as special images that represent specific features. Consider a scenario where we have an image of a curve. Let’s take this curve as an example feature that we aim to identify — whether it exists in an image or not.

In this context, the convolutional layer plays a crucial role. It applies these filters onto the input image through a process called ‘convolution.’ This process involves sliding the filter across the entire image, checking how well the filter’s pattern aligns with different parts of the image. The result of this sliding operation creates a new image, known as a ‘feature map,’ where each pixel value represents how closely the filter’s pattern matches a particular area of the input image.

Essentially, the convolutional layer acts like a spotlight that moves across the image, highlighting regions where the filter’s specific feature is found. This process enables the network to identify and emphasize different patterns, textures, or shapes in the image — an essential step in recognizing complex features in images.

Convolution is a specialized operation that involves applying a particular matrix, often the image matrix, with another matrix, typically referred to as the filter matrix. This operation serves to highlight specific patterns or features within an image.

During the convolution process, each cell in the image matrix is multiplied with the corresponding cell in the filter matrix. These products are then summed together to create a single output value. Let’s consider an example to illustrate this concept: if we have a portion of the image matrix represented as A = [2 5 17], and a portion of the filter matrix as B = [101], the convolution of A with B would result in an answer of [2 * 1 + 5 * 0 + 17 * 1] = [2 + 17] = [19]. Please note that this example is simplified; in practice, both matrices are two-dimensional, but the core idea of the operation remains consistent.

This explanation demonstrates how the filter matrix is systematically positioned over the entire image matrix, performing the multiplication and summation at each step. This process generates an output matrix that highlights the presence of certain features, textures, or patterns within the original image — a crucial step in feature extraction for tasks like image recognition.

So, the process of identifying a specific feature is accomplished by convolving the ‘filter matrix’ across the image matrix, resulting in the creation of a distinct matrix containing assigned values. Now, what do these values signify? Let’s revisit our previously discussed filter image and envision an uncomplicated ice-cream image.

An example with result 0, thus confirming absence of the feature.

Notice that the rear of the ice-cream is of a similar shape. So, we convolve the filter within that region of the image-matrix. We also consider another region of the image that does not possess the same curve-like feature. We also perform the operation on that section of the image using the same filter-matrix. Now, take a look at the values of the corresponding operation.

Image 1 illustrates a situation where a prominent value is present, whereas Image 2 yields an outcome of 0. The key takeaway is that when a specific feature is found in a particular image area, it generates a considerably high convolution value. Conversely, in other regions, the value is small, indicating the absence of that feature. With this understanding in place, we can now proceed to utilize these concepts for feature extraction.

Our starting point is a colored image, which can be represented as a 2-dimensional matrix containing RGB values. Simultaneously, we possess a collection of filter matrices, each representing distinct characteristics. To execute the process, we apply each filter matrix through convolution to the corresponding segment of the image matrix for the red, green, and blue channels. Subsequently, we sum the resulting values from each channel to generate the value for a cell in the output matrix.

In doing so, we are essentially trying to find out whether a particular feature is present in the image we are trying to recognize. Now, there are a few other operations required to be performed on the image during convolution.

Padding: Padding in convolution compensates for the uneven treatment of corner and border values during the convolution process. When we mentally simulate convolution, we recognize that cells within the filter matrix are considered more times than those in the corners or at the image borders. This results in a disparity where corner and border values contribute less to the overall operation.

To address this issue, an approach called padding is employed. In this method, an additional row and column containing only zeroes are introduced around all sides of the image matrix. Although these zero values don’t convey any extra information, their purpose is to ensure that previously underrepresented corner and border values are granted greater significance. In essence, padding aids in giving more equitable weightage to all cells in the image matrix during the convolution process.

A padding of 0’s around the actual matrix.
a padding of 0’s around the actual matrix

Striding: In the context of “strided” convolution, the filter’s movement isn’t restricted to shifting by just one row or column at a time. Instead, it is shifted by a larger step, such as 2 or 3 rows or columns with each movement. This approach is usually adopted to minimize the number of computations required and to also decrease the dimensions of the resulting output matrix. While performing this process on larger images, data loss isn’t a significant concern. Instead, the primary advantage is a substantial reduction in computational overhead, making it especially beneficial for resource-intensive tasks involving sizable images.

This implementation with stride 2 shows the particular cells that will be convolved

RELU Activation: RELU or Rectified Linear Unit is applied on all the cells of all the output-matrix. The function is defined as:

The basic intuition to derive from here is that, after convolution, if a particular convolution function results in ‘0’ or a negative value, it implies that the feature is not present there and we denote it by ‘0’, and for all the other cases we keep the value. Together with all the operations and the functions applied on the input image, we form the first part of the Convolutional Block.

Pooling:

The Pooling layer involves a process where a specific value is selected from a group of values, often the maximum or average value among them. This action serves to decrease the dimensions of the output matrix. For instance, in MAX-POOLING, the highest value within a defined section of the matrix, like a 2 x 2 region, is chosen. Consequently, this choice captures the values that signify the presence of a feature in that particular portion of the image. Through this method, extraneous information regarding feature presence is discarded, leaving only the essential details to be considered.

In the design of Convolutional Neural Network (CNN) architectures, it’s a common practice to periodically introduce a Pooling layer between consecutive convolutional blocks. The purpose is to gradually reduce the spatial dimensions of the representation. This reduction aids in diminishing the count of parameters and computational workload within the network.

An example of both max pooling and average pooling

Together with the CONVOLUTIONAL LAYER and the POOLING LAYER, we form the CONVOLUTIONAL BLOCK of the CNN architecture. Generally, a simple CNN architecture constitutes of a minimum of three of these Convolutional Block, that performs feature extraction at various levels.

Fully Connected layer:

The Fully Connected layer constitutes the final component of a CNN architecture, primarily involved in the classification task. It functions as a Simple Neural Network with complete connectivity, comprising two or three hidden layers alongside an output layer. Typically, the output layer employs ‘Softmax Regression’ to execute the classification process within an extensive range of categories. You might find the foundational concept of regression analysis explained in the provided link.

Full steps involved:

Here’s the process from the start to the end of a CNN:

  1. The input is an RGB image, typically represented as a 2D matrix for each of the 3 color channels. If each channel has a size of n x n, then the input is a n x n x 3 dimensional matrix.
  2. A 3D matrix is prepared, containing ‘k’ filters, each with a size of (f x f).
  3. Padding is applied to the image with ‘p’ rows and columns. This increases the dimensions of the input matrix to (n + 2p) x (n + 2p) x 3.
  4. Convolution operation is carried out on the input image using the filter matrices, as explained earlier. Strided convolution is performed, with a stride of ‘s’. The output matrix size is adjusted based on the stride.
  5. Pooling is applied to the output matrices of each layer. The dimensions of these output matrices depend on the size of the pooling filter and the chosen stride length.
  6. Steps 3 to 5 are repeated multiple times, typically around three times, creating a series of convolutional and pooling layers.
  7. Once output matrices of dimensions like a x b x l are obtained, they are flattened into a 1D array. This means that all the values from these matrices are arranged sequentially in an array, which will serve as the input matrix for the Fully Connected Neural Network.
  8. The Fully Connected Neural Network processes this flattened input array to perform the required calculations, eventually producing the desired result.

BACKPROPAGATION:

An overarching question arises regarding the definition of matrices for individual filters. This isn’t predetermined. During the training phase of the CNN, it’s exposed to a substantial number of images. Each time, the resulting error is fed back into the CNN, leading to adjustments in the matrix values within each layer. This fundamental process resembles the training of a Simple Neural Network and is termed BACKPROPAGATION.

The convolution layers are the main powerhouse of a CNN model. Automatically detecting meaningful features given only an image and a label is not an easy task. The convolution layers learn such complex features by building on top of each other. The first layers detect edges, the next layers combine them to detect shapes, to following layers merge this information to infer that this is a nose. To be clear, the CNN doesn’t know what an object is. By seeing a lot of them in images, it learns to detect that as a feature. The fully connected layers learn how to use these features produced by convolutions in order to correctly classify the images.

--

--

Aksshar Ramesh

AI Security Research fellow at Centre for Sustainable Cyber Security, University of Greenwich