Deep learning inference optimisation for IoT: Conv2D-ReLU-BN layer fusion and quantisation

Mestre, Jose I.; Barrachina, Sergio; Quezada, Darwin; Dolz, Manuel F.

doi:10.1007/s11227-025-07107-y

Deep learning inference optimisation for IoT: Conv2D-ReLU-BN layer fusion and quantisation

Open access
Published: 14 March 2025

Volume 81, article number 621, (2025)
Cite this article

Download PDF

You have full access to this open access article

The Journal of Supercomputing Aims and scope Submit manuscript

Deep learning inference optimisation for IoT: Conv2D-ReLU-BN layer fusion and quantisation

Download PDF

Jose I. Mestre¹^na1,
Sergio Barrachina¹^na1,
Darwin Quezada¹^na1 &
…
Manuel F. Dolz¹^na1

799 Accesses
1 Altmetric
Explore all metrics

Abstract

The deployment of deep learning models on resource-constrained devices requires the development of new optimisation techniques to effectively exploit the computational and storage capacities of these devices. Thus, the primary objective of this research is to introduce an innovative and efficient approach for fusing convolution (or fully connected), ReLU, and batch normalisation neural network layers into a unified, single-layer structure, alongside a quantisation method for this new fused layer. This approach has been evaluated using the Arduino BLE Sense ARM Cortex-M4 and the Arduino Portenta H7 Lite ARM Cortex-M4 and M7 processors, known for their widespread adoption in various Internet of Things devices. Depending on the microcontroller unit and compilation flag used, the fused layers can reduce the overall execution time by up to 1.53$\times$, and on individual layers it can reach a speedup of 2.95$\times$.

Optimizing convolutional neural networks for IoT devices: performance and energy efficiency of quantization techniques

Article Open access 20 February 2024

An Intrusion Detection Model Based on Deep Learning and Multi-layer Perceptron in the Internet of Things (IoT) Network

Efficient human activity recognition on edge devices using DeepConv LSTM architectures

Article Open access 22 April 2025

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Over the past decade, there has been a remarkable surge in the development of artificial intelligence, with applications in almost every domain of knowledge, such as medicine [1], air quality [2], chemistry [3], among others. This rapid expansion of machine learning (ML)/deep learning (DL) demands the development of innovative techniques to enhance their performance. These efforts focus not only on improving accuracy but also on reducing the computational load, particularly when deploying models on resource-constrained devices such as wearables, microcontrollers and Internet of Things (IoT) devices [4]. However, optimising these systems remains a significant challenge in such contexts.

Layer fusion is currently one of the most widely used techniques for improving the performance of neural networks. Nevertheless, it is also considered one of the most complex techniques due to the interdependence between the deep neural network (DNN) and the processor architecture [5]. Some layer fusion methods are based on computing-graphs (CG), where each operation is represented as a node containing one or more input and output tensors. Several CG-level optimisation strategies exist, including operator fusion, static memory planning pass, and constant folding [6], each offering distinct advantages. For instance, operator fusion allows small operators to be combined into a single operator, thereby reducing the number of operations during the inference process, which leads to a lower computational cost and inference time.

Given the benefits of CG optimisation, this option is widely used by several ML/DL frameworks and compilers, including TensorFlow, Apache TVM, and GLOW, by defining a set of rules or patterns to fuse specific consecutive layers. For example, the fusion of convolution, rectified linear units (ReLU), and batch normalisation can be integrated into a single layer under certain optimising conditions. TensorFlow performs this process when converting a TensorFlow model to TensorFlow Lite (TFLITE), but only if the model meets the fusion requirements. For this purpose, TensorFlow uses DL compilers such as multi-level intermediate representation (MLIR) or accelerated linear algebra (XLA). PyTorch, open neural network exchange (ONNX) and MXNet follow a similar process.

The aforementioned DL compilers are designed to effectively convert, optimise and run DNN models on a variety of devices. These compilers perform numerous optimisation procedures, including the CG optimisation mentioned above. These processes are tailored to improve the model performance and minimise the computational resources consumed by the DNN. Throughout the optimisation process, multiple factors are considered, particularly on hardware accelerators such as CPUs, GPUs, TPUs and FPGAs [6]. In the current landscape, numerous ML models are optimised to operate on resource-constrained devices, such as microcontroller unit (MCUs), which pose particular challenges due to their limited computational and storage capabilities.

Essentially, DL compilers can implement various optimisations at both high and low levels. The high level performs all optimisations and transformations independent of the hardware accelerator. In contrast, the low-level performs optimisations specific to the hardware, including code generation and compilation. Several of these DL compilers make use of either third-party or their own optimised linear algebra libraries (e.g. Basic linear algebra subprograms (BLAS)-based libraries, CUDA deep neural network (cuDNN), Math Kernel Library for deep neural networks (MKL-DNN), etc.) to accommodate hardware diversity [7]. Thus, by taking a model definition provided by a DL framework as input, a DL compiler can generate highly efficient code for different hardware accelerators.

This work presents a practical approach to merging Conv2D–ReLU–BN into a unified layer, referred to as FConv2D, using CG. This innovative method is developed using the TensorFlow source code, MLIR, and TensorFlow Lite for microcontrollers (TFLITE-micro) and includes quantisation analysis to approximate the transition of the new layer from 32-bit floating point to 8-bit integer format without significantly compromising model accuracy. Consequently, the proposed technique can be executed on both conventional computers and resource-constrained devices. More specifically, the main contributions of this study are summarised below.

We propose a novel approach to fuse convolutional, ReLU activation, and batch normalisation into a single operator;
We define a new thresholded ReLU activation function for the proposed layer fusion;
We perform a comprehensive performance analysis of the proposed layer fusion, achieving up to 1.53$\times$ reduction in overall inference execution time and up to 2.95$\times$ speedup on individual layers in two different MCUs.

The rest of this paper is organised as follows. Section 2 defines the problem. Section 3 comprehensively describes the proposed layer fusion’s mathematical background. Section 4 describes the procedure to implement the proposed method. Section 5 shows the experimental setup and the primary results. Section 6 provides an overview of the current studies related to this research field. Finally, Sect. 7 concludes this study.

2 Problem statement

Convolutional neural networks (CNNs) have achieved impressive results in various computer vision tasks. Many CNN architectures combine convolution (Conv2D), batch normalisation (BN), and rectified linear unit (ReLU) (or one of its variants, such as parametric ReLU—PReLU— or Leaky ReLU) layers in that order. However, the alternative configuration of Conv2D, ReLU (or even PReLU or Leaky ReLU), and BN is also widely used. While the authors of batch normalisation argue that placing BN immediately after Conv2D is likely to produce more symmetric, non-sparse, and stable activation distributions [8], some empirical evidence suggests that placing BN after the nonlinearity yields better accuracy and convergence speed [9]. To exemplify this, we trained the two variants of a VGG-like network shown in Table 1, one using the Conv2D–ReLU–BN sequence (v1), and the other using the Conv2D–BN–ReLU sequence (v2) on the CIFAR-10 dataset. Figure 1 displays the minimum, maximum and mean validation accuracy and loss, averaged over 100 different trainings for 80 epochs for both models.^{Footnote 1} The plots demonstrate that the Conv2D–ReLU–BN configuration produces slightly superior accuracy and convergence compared to Conv2D–BN–ReLU. However, it is important to note that the underlying reasons for this effect are still extensively debated [10]. Therefore, the suitability of placing BN after the nonlinearity may not be the best option for all models, but it can prove to be a valuable approach for certain model architectures.

Table 1 Specification of the VGG-like CNN used for the experimentation. The variants with configurations Conv2D–BN–ReLU and Conv2D–ReLU–BN are denoted as v1 and v2, respectively

Full size table

When deploying CNNs, DL inference engines apply a series of optimisations to reduce inference time. For the Conv2D–BN sequence, it is mathematically feasible to fuse the layers by integrating the BN parameters into the weights of the Conv2D layer without introducing additional operations (see Fig. 2). Consequently, the computation time for the fused Conv2D–BN–ReLU sequence is equivalent to executing only the Conv2D and ReLU layers, resulting in significant performance improvements. Although the fusion of these layers is implemented by default in many DL frameworks, the same does not hold if the nonlinearity (e.g. ReLU or its variants) is in between the Conv2D–BN layers, having as a result the Conv2D–ReLU–BN sequence. That is because this sequence’s mathematical formulation and fusion implementation are more complex. However, not fusing this sequence hurts the inference performance of the corresponding CNNs. To provide empirical evidence of this impact, Table 2 reports the per-layer inference time and percentage of the VGG-like CNN using the sequence Conv2D–ReLU–BN (variant v2) executed on the ARM Cortex-M7 processor in an Arduino Portenta H7 MCU. In practice, once TensorFlow Lite transforms this sequence, it is cast into a Conv2D–ReLU layer followed by the pair of Mul and Add operations that perform the BN layer (see Fig. 3). The execution of this Mul–Add pair entails the realisation of two pointwise memory-bound operations, which account for 44.46% of the total inference time for this specific CNN.

Table 2 Layer execution time and percentages for the VGG-like CNN using a single sample on TensorFlow Lite with quantised 8-bit integers on the Arduino Portenta H7 M7 processor

Full size table

It is important to emphasise that the same fusion principles apply to multi-layer perceptrons (MLPs) when a fully connected (FC) layer follows a BN layer or when a ReLU-like nonlinearity is present between the FC and BN layers. The operation performed by a FC layer, which involves a matrix multiplication of the activations and weights, is equivalent to the operation carried out by a convolution layer with the activations and filters, provided that the activations are previously rearranged using the widespread transformations im2col or im2row, described in [11].

This study aims to bridge this gap by introducing a novel optimisation method that integrates the Conv2D, ReLU, and BN layers into a single fused layer when implemented in frameworks like TFLITE-micro. It is worth mentioning that this study focuses solely on the Conv2D–ReLU–BN fusion as a proof of concept, leaving the evaluation of other layer blocks, such as FC–ReLU–BN, and other DNN architectures as part of future work. We hypothesise that such fusion will significantly reduce the inference time of CNN models in resource-constrained environments, such as MCUs with extremely limited memory bandwidth.

3 Layer fusion

This section describes the proposed method to fuse Conv2D, ReLU, and BN layers into a single layer, referred to as FConv2D.

3.1 Conv2D–ReLU–BN fusion

To describe the mathematical formulation for the layer fusion approach proposed in this work, we first revisit the fundamental concepts of the FC, Conv2D, ReLU, and BN layers to next proceed to explain the fusion process of both Conv2D/FC–BN–ReLU and Conv2D/FC–ReLU–BN layer sequences.

3.1.1 Fundamental layers

A FC layer consists of a set of m neurons and n inputs, where the output of the j-th neuron is computed as follows:

$$\begin{aligned} y_{j}^{{\tiny \textsf {Conv/FC}}} = b_j + \sum _{i=0}^{n} x_{i} \cdot w_{ij}. \end{aligned}$$

(1)

In the equation, $x_i$ is the i-th input value, $y_j^{{\tiny \textsf {Conv/FC}}}$ is the output value of the j-th neuron, and $w_{ij}$ is the weight associated with the connection from the i-th input to the j-th neuron in the FC layer. The computation of outputs in a Conv2D layer follows a similar approach as in Eq. (1), except that the input neuron tensor is first transformed using the im2col or im2row transform.

The ReLU activation function for the j-th input is applied as follows:

$$\begin{aligned} y_{j}^{{\tiny \textsf {ReLU}}} = {\left\{ \begin{array}{ll} 0 & \text {if }\ \ x_{j} \le 0,\\ x_{j} & \text {otherwise}. \end{array}\right. } \end{aligned}$$

Finally, each output of the batch normalisation layer can be calculated using (see [8]):

$$\begin{aligned} y_{j}^{{\tiny \textsf {BN}}} = \gamma _j \cdot \left( \frac{x_{j} - \mu _{j}}{\sqrt{\sigma ^2_j + \epsilon }}\right) + \beta _{j}, \end{aligned}$$

(2)

where $y_j^{{\tiny \textsf {BN}}}$ represents the output value of the j-th neuron after the normalisation. To calculate this, the input value $x_j$ is normalised by subtracting the mean $\mu _j$ and dividing by the square root of the variance $\sigma ^2_j + \epsilon$, where $\epsilon$ is a small constant added for numerical stability. The normalised value is then scaled and shifted by the trainable parameters $\gamma _j$ and $\beta _j$, respectively. Note that all of these parameters are constant during the inference process.

To simplify the explanation of the fusion process in the next section, we break down from Eq. (2) the next scaling factor:

$$\begin{aligned} y_{j}^{{\tiny \textsf {BN}}} = s_j \cdot (x_{j} - \mu _{j}) + \beta _{j} \quad \text {where}\quad s_j = \frac{\gamma _{j}}{\sqrt{\sigma ^2_{j} + \epsilon }}. \end{aligned}$$

(3)

3.1.2 Conv2D–BN–ReLU fusion

Considering the equations in the previous section, let us formulate the merging process for the Conv2D/FC–BN–ReLU sequence. The fusion of a Conv2D/FC layer with a BN transform is straightforward to formulate using the output $y_j$ from Eq. (1) for the Conv2D/FC layer as the input $x_j$ for the BN transform:

$$\begin{aligned} y_{j}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {BN}}} = s_j \cdot \Bigl ( \bigl ( b_{j} + \sum ^{n}_{i=0} x_{i} \cdot w_{ij}\bigl ) - \mu _{j} \Bigl )\ +\ \beta _{j} \end{aligned}$$

(4)

If we assume that $w'_{ij}$ and $b'_j$ are the respective versions of the weights and biases that have absorbed the $\mu _j$, $\sigma _j$, $\gamma _j$, and $\beta _j$ parameters of the BN layer, this equation can be further simplified as:

$$\begin{aligned} \begin{aligned} y_{j}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {BN}}}&= b_{j}' + \sum ^{n}_{i=0} x_{i} \cdot w_{ij}', \\ \text {where}\quad w_{ij}' = s_j \cdot w_{ij},\quad&\text {and}\quad b_j' = s_j \cdot ( b_{j} - \mu _{j}) + \beta _{j}. \end{aligned} \end{aligned}$$

(5)

With that, the fused layer output $y_j^{{\tiny \textsf {Conv/FC+BN+ReLU}}}$ can be obtained by applying the ReLU function to $y_{j}^{{{\text{Conv/FC + BN}}}}$:

$$\begin{aligned} y_{j}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {BN}}{\tiny \textsf {-}}{\tiny \textsf {ReLU}}} = {\left\{ \begin{array}{ll} 0 & \text {if }y_{j}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {BN}}}\le 0\\ y_{j}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {BN}}} & \text {otherwise.} \\ \end{array}\right. } \end{aligned}$$

3.1.3 Conv2D–ReLU–BN fusion

The fusion of the Conv2D/FC–ReLU–BN layers can be derived similarly. Firstly, the incorporation of Conv2D/FC–ReLU is accomplished by using the output value $y_j^{{\tiny \textsf {Conv/FC}}}$ as the input for the ReLU layer, i.e.

$$\begin{aligned} y_{j}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {ReLU}}} = {\left\{ \begin{array}{ll} 0 & \text {if }y_{j}^{{\tiny \textsf {Conv/FC}}}\le 0\\ y_{j}^{{\tiny \textsf {Conv/FC}}} & \text {otherwise}. \end{array}\right. } \end{aligned}$$

Then, the batch normalisation layer is chained as follows:

$$\begin{aligned} y_{j}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {ReLU}}{\tiny \textsf {-}}{\tiny \textsf {BN}}} = {\left\{ \begin{array}{ll} - s_j \cdot \mu _j + \beta _j & \text {if }\ \ y_{j}^{{\tiny \textsf {Conv/FC}}}\le 0\\ s_j \cdot (y_{j}^{{\tiny \textsf {Conv/FC}}} - \mu _{j}) + \beta _{j}& \text {otherwise}, \end{array}\right. } \end{aligned}$$

(6)

where the cases are based on Eq. (3), replacing the input $x_j$ by $y^{{\tiny \textsf {Conv/FC}}}$. There, since $s_j$, $\mu _j$ and $\beta _j$ are constants during inference, they can be grouped in the following threshold:

$$\begin{aligned} \tau _j = - s_j \cdot \mu _j + \beta _j. \end{aligned}$$

Finally, using $y^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {BN}}}$ defined in Eq. (4), the constant threshold $\tau _j$, and the previous scaling factor $\gamma _j$ as a polarity constant, we can further simplify Eq. (6) as:

$$\begin{aligned} y_{j}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {ReLU}}{\tiny \textsf {-}}{\tiny \textsf {BN}}} = {\left\{ \begin{array}{ll} \tau _j & \text {if }\gamma _j\le 0\ \text {and}\ y_{ij}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {BN}}}> \tau _j\ \text {or}\\ & \ \ \,\,\gamma _j> 0\ \text {and}\ y_{ij}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {BN}}}\le \tau _j\\ y_{j}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {BN}}} & \text {otherwise.}\\ \end{array}\right. } \end{aligned}$$

(7)

Although this fusion introduces some additional checks compared to the ReLU equation, it only requires the computation of $y^{{{\text{Conv/FC + BN}}}}$ according to Eq. (5), since the threshold $\tau _j$ and the polarity $\gamma _j$ are constants during inference.

3.2 Quantising Conv2D–BN–ReLU fusion

This section elaborates on the mathematical foundations of integer quantisation schemes for the previously described Conv2D–BN–ReLU layer fusion. Given the memory constraints of the MCUs, employing 8-bit integer quantisation, both for arithmetic and storage, is essential for deploying large DNNs. For that, we utilise the quantisation scheme proposed by Jacob et al. [12], which enables an efficient implementation of the arithmetic present in Conv2D and FC layers using only integer arithmetic operations.

We first introduce the general quantisation procedure to describe this scheme, denoting the quantised values with the (q) superindex. Assuming x is a tensor of real values, the quantised $x^{(q)}$ correspondence can be computed as follows:

$$\begin{aligned} x^{(q)} = \frac{x}{S^x} + Z^x, \end{aligned}$$

where $S^x$ and $Z^x$ are the scaling and zero-point translation values associated with the tensor x. The dequantisation formula can be defined analogously as $x = S^x \cdot (x^{(q)} - Z^x)$.

As shown in Eq. (5), the BN layer can be fused into the Conv2D, or FC layer and the quantised variant of this fusion can be performed as follows:

$$\begin{aligned} y_j^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {BN}}{^{(q)}}} = Z^y + \frac{S^x \cdot S^w}{S^y} \cdot \left( b_j^{\prime (q)} + \sum _{i=0}^n (x_i^{(q)} - Z^x) \cdot (w_{ij}^{\prime (q)} - Z^w) \right) , \end{aligned}$$

(8)

where the input (x), output (y), weights ($w'$), and biases ($b'$) tensors are quantised using different S and Z quantisation constants.

To extend this implementation into the proposed Conv2D/FC–ReLU–BN layer fusion, quantising the threshold $\tau$ using the same $S^y$ and $Z^y$ parameters is crucial. This quantisation will facilitate the internal comparison operations by the ReLU nonlinearity, i.e.

$$\begin{aligned} \tau _j^{(q)} = \frac{\tau _j}{S^y} + Z^y. \end{aligned}$$

(9)

With $\tau ^{(q)}$, it is possible to adapt Eq. (7) to operate only with quantised data:

$$\begin{aligned} y_{j}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {ReLU}}{\tiny \textsf {-}}{\tiny \textsf {BN}}{^{(q)}}} = {\left\{ \begin{array}{ll} \tau _j^{(q)} & \text {if }\gamma _j\le 0\ \text {and}\ y_{ij}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {BN}}{^{(q)}}}> \tau _j^{(q)}\ \text {or}\\ & \ \ \,\,\gamma _j> 0\ \text {and}\ y_{ij}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {BN}}{^{(q)}}}\le \tau _j^{(q)}\\ y_{j}^{{\tiny \textsf {Conv/FC}}{\tiny \textsf {-}}{\tiny \textsf {BN}}{^{(q)}}} & \text {otherwise.}\\ \end{array}\right. } \end{aligned}$$

It is important to note that, due to the internal implementation of TFLITE-micro, the comparisons performed by the ReLU layer within the FC/Conv2D layer occur when the data is quantised with the parameter $S^x\cdot S^w$. Consequently, in the experimentation conducted in this study, the threshold constant $\tau$ is quantised using the same parameters, that is, $\tau _j^{(q)} = \frac{\tau _j}{S^x\cdot S^w}$. Table 3 outlines the quantisation requirements for the proposed layer fusion (Conv2D–ReLU–BN). The table details the tensor to be quantised, its range, granularity, and any associated restrictions. For example, the weights are quantised within a range of $-127$ to 127 and are quantised per-axis.

Table 3 Quantisation parameters for int8 data type

Full size table

4 Layer fusion process

This section briefly overviews the layer fusion process carried out in this study.

4.1 Implementation

The proposed layer fusion is performed offline to prevent overloading resource-constrained devices. The resulting fused layer integrates seamlessly into the inference process, functioning like standard layers such as Conv2D or FC layers. Although the fusion was implemented using the TensorFlow library, the underlying mathematical principles are applicable and can be utilised to combine these layers in other libraries or ML compilers.

Figure 4 illustrates the result of the layer fusion performed by, for example, TensorFlow. The figure shows that TensorFlow successfully fuses the Conv2D layer with the ReLU activation function, but the BN layer is split into two separate operators: Mul and Add. In contrast to TensorFlow, our proposed method fuses all three layers into a single operator, FConv2D. Notably, as far as we are aware, none of the examined ML libraries perform a fusion of the Conv2D–ReLU–BN layers. The following paragraphs detail the procedure to build the new layer (FConv2D). Since the proposed layer fusion essentially creates a new operator-similar to a convolutional layer but with additional parameters derived from the previously described mathematical process-it enhances the layer’s functionality and flexibility.

The layer fusion process comprises three main phases: preparation, optimisation, and quantisation (see Fig. 5):

Preparation: During the preparation stage, the Conv2D and BN operators from TensorFlow are replaced with custom operators designed specifically for the fusion process. These custom operators remain active throughout the conversion and inference stages, ensuring seamless integration. Additionally, the ReLU activation function is removed from Conv2D–ReLU–BN patterns, as the new ReLU will be incorporated into the fused operation during inference. Crucially, the custom operator preserves all fine-tuned parameters during training, which is essential for an effective layer fusion. Parameters such as the threshold ($\tau _j$), polarity ($\gamma _j$), and epsilon ($\epsilon$) are incorporated directly into the new fused convolutional layer (FConv2D) and are computed based on the equations detailed in Sect. 3. The entire transformation is executed as a directed acyclic graph (DAG)-to-(DAG) conversion.
Optimisation: In this phase, two pivotal operations are conducted. The initial step combines addition and subtraction operators into the newly introduced layer FConv2D. Then, the new kernel bias, threshold and polarity are incorporated into the new fused convolutional layer FConv2D.
Quantisation: This step comes into play once the fusion layer has been completed. This process is used to reduce the size of the model and improve the inference performance on resource-constrained devices. This is achieved by converting 32-bit floating point values to 8-bit integers.

5 Experimental results

In this section, we experimentally evaluate the layer fusion optimisations described in this work using a state-of-the-art CNN on two Arduino-based MCUs. Precisely, we assess the model size and accuracy attained by a CNN while using the FConv2D layer, as well as the overall and per-convolutional layer-block inference performance benefits of the FConv2D in a VGG-like CNN.

5.1 Hardware setup

The accuracy-performance evaluation is conducted on the two following MCUs:

Arduino Nano 33 BLE Sense (Nano), a compact MCU board based on the Nordic Semiconductor nRF52840 chip comprising an ARM Cortex-M4 processor, operating at 64MHz, and 256KiB of SRAM and 1MiB of flash memory [13].
Arduino Portenta H7 Lite (Port), a powerful MCU board based on the STMicroelectronics STM32H747XI chip [14]. The Port board features two ARM Cortex-M4 and Cortex-M7 processors, operating at 240MHz and 480MHz respectively, and 1MiB of SRAM and 2MiB of flash memory [15].

5.2 DL framework and libraries

We utilised the EloquentTinyML v2.4.0, an Arduino library that facilitates the deployment of TFLITE models on Arduino boards using the Arduino Integrated Development Environment (IDE). This package incorporates the Arduino TFLITE-micro library v2.4.0-alpha internally linked with the Common Microcontroller Software Interface Standard for Neural Networks (CMSIS-NN) library v2.0.2. The latter offers optimised functions designed explicitly for microcontrollers, enabling efficient execution of DNNs on ARM Cortex-M platforms.

To automate the compilation and uploading of the binaries to the MCUs, we utilised the arduino-cli tool v0.33.0. To enable the GCC compiler high-performance optimisations, we have instructed the arduino-cli to set the-O3 optimisation flag. We also report results on the Nano Cortex-M4 processor using the GCC-Os optimisation flag, which optimises the code’s size.

5.3 Testbed

To evaluate our optimizations, we trained a custom VGG-like model [16] (see Table 1) on the CIFAR-10 dataset using the Conv2D–ReLU–BN order. To fit this model into the “reduced” flash memory of the target MCUs, we performed a parameter reduction on the original model, which involved removing blocks of layers while preserving the architectural pattern. Although there exists other DNN models for resource-constrained devices, such as MobileNet or SqueezeNet, the MCUs targeted in this work have strong RAM and flash memory limitations that prevent such models from fitting, even when heavily downscaled. On the other hand, as most of the Conv2D layer configurations in these DNNs are present in our VGG-like model, the presented evaluation provides enough insights into the performance improvements achieved by the FConv2D fused operator, which can be extrapolated to other DNN models.

Training the model mentioned above was conducted on Google Colab using the TensorFlow library v2.12.0. After training, the model was converted to TFLITE format using the DEFAULT optimisation flag and fully quantised to int8 using the TFLITE_BUILTINS_INT8 flag.

It is essential to note that, for full 8-bit integer quantisation, the ranges of all floating point tensors in the models need to be calibrated. To achieve this, we used a representative dataset from a small, equally balanced subset of CIFAR-10.

To reduce potential measurement variations, all experiments report the average results obtained from running the DNNs for a large number of inferences. This approach ensures an accurate performance assessment of the optimised convolution algorithms on the target MCUs.

5.4 Accuracy and size evaluation

After training the aforementioned VGG-like model, we compared the accuracy and size of the TensorFlow, TFLITE, and TFLITE with Conv2D–ReLU–BN fused layers models. For the TFLITE variants, this comparison also includes two data types: int8 and float32.

As depicted in Table 4, the accuracy is nearly identical regardless of the model and data types. The accuracy is similar for the TensorFlow model and both TFLITE-optimised variants with float32. The slight difference in accuracy, $\approx \pm 0.2\%$, arises when employing the int8 quantisation, with the standard TFLITE model exhibiting a slightly higher accuracy than the TensorFlow and TFLITE with layer fusion variants.

Table 4 Size and accuracy of different TensorFlow and TFLITE trained models

Full size table

Therefore, the primary distinction between the different models lies in their respective sizes, with the TFLITE variants being smaller than the TensorFlow one and the TFLITE with layer fusion slightly smaller than the TFLITE one.

5.5 Per-layer performance evaluation

Once the accuracy consistency among the different TensorFlow and TFLITE models is evaluated, the next step is to generate the TFLITE models for the aforementioned MCUs to test their performance.

It is important to recall that TFLITE does not support certain operations. As a result, when converting from the TensorFlow model to TFLITE, some operators are either fused or replaced by TFLITE’s built-in operators. In particular, given that the Conv2D–ReLU–BN sequence does not fall within the layer fusion patterns of TFLITE optimisations, the ReLU layer is fused with the Conv2D layer. In contrast, the BN layer is substituted with Mul and Add layers.

Figure 6 shows the execution time of the Conv2D–ReLU–BN layers, both with and without layer fusion on the Cortex-M4 processor in Nano and the Cortex-M4/-M7 processors in Port using the compiler optimisations for binary size (-Os flag) and for performance (-O3). The dark blue colour represents the fused Conv2D–ReLU layer, the yellow colour represents the Mul layer, the Add layer is denoted in pink, and the fused layer (FConv2D) is light blue. The X-axis represents the layer IDs to be fused, while the Y-axis represents the execution time in milliseconds.

Inspecting first the results for the Cortex-M4 of Nano (see top row of Fig. 7), we observe a consistent improvement in performance. The speedups range between 0.96$\times$ and 2.6$\times$ for Os and between 1.22$\times$ and 2.18$\times$ for O3 when the Conv2D–ReLU–BN layers are fused. Our layer fusion process successfully consolidates Conv2D–ReLU–BN into a single layer during converting from the TensorFlow model to TFLITE. That eliminates additional layers like Mul and Add in the final optimised model, significantly reducing inference time. It is noteworthy, however, that the-Os flag is generally less efficient than O3, with a specific instance in layer #18 where we observe a slight increase in execution time, yielding a slowdown of 0.96$\times$. We attribute this phenomenon to the particular parameters of the convolution operation in this layer, though this behaviour does not occur when the-O3 flag is employed. From the overall results, we conclude that optimal performance for both models (with and without layer fusion) is achieved using the-O3 flag.

Similar results are obtained with the Cortex-M4 and M7 processors in Port (see Fig. 7). When compared to the Cortex-M4 of Nano, the processors in Port exhibit lower execution times, attributed to their superior computational performance. Comparing the results obtained with the Cortex-M4 and M7 processors, we identify similar trends, indicating a lower execution time required by FConv2D compared to the original model without layer fusion. In this scenario, the fused layer #18 is also the only one that is slower than the layer without fusion when using the-Os flag with the Port Cortex-M4 (0.94$\times$), and when using the-O3 flag with the Port Cortex-M7 (0.98$\times$). This slowdown is produced due to the additional operations involving the threshold. At the same time, there is little improvement since the relative time spent with the multiplication and addition compared to the time of the Conv2D–ReLU is the smallest. Meanwhile, using the-O3 flag produces remarkable speedups ranging from 1.31$\times$ to 1.95$\times$ when using the fused layer version on the Port Cortex-M4 and from 0.95$\times$ to 1.92$\times$ on the Port Cortex-M7.

To compare the performance of the Conv2D–ReLU–BN pattern and FConv2D fused operator against the Conv2D–BN–ReLU layer ordering, the Conv2D–ReLU (dark blue bar) per se represents the execution time of the Conv2D–BN–ReLU pattern, since in this case TFLITE can effectively handle this fusion. Interestingly, in some scenarios the execution time of FConv2D is even lower than that of the Conv2D–ReLU layer alone (see the results for the Port Cortex-M4 optimised when using the-O3 flag of Fig. 7). This behaviour is because, in the TFLITE-optimised model, the Conv2D–ReLU layer relies on min-max operations, whereas FConv2D uses if conditions. As a result, performance differences depend on hardware capabilities and compiler optimisations, with min-max operations often benefiting from vectorisation on modern processors.

5.6 Inference performance evaluation

Figure 7 compares the cumulative time between the original model without fusion (in blue) and the proposed model with layer fusion (in orange) for the Cortex-M4 and M7 processors of Nano and Port. This comparison considers using both-Os and-O3 flags. The plots reveal a significant reduction in the overall cumulative time when implementing the optimisation process proposed in this study compared to the original model. For instance, in the worst cases, the fused layer is 1.26$\times$ faster than the original model (see the right-hand side plot in the bottom row of Fig. 7). In any case, the achievable speedups depend on the specific processor, MCU, and flag (-Os or-O3).

As expected, the higher performance is consistently achieved with the-O3 flag, resulting in an execution time 50% lower than the-Os flag alternative. Despite this, it should be noted that the execution time of the-Os fused version is closer to the-O3 non-fused one than from the-Os non-fused version.

6 Related work

This section provides a general view of state-of-the-art model optimisation using the fusion layers technique. Additionally, it includes the key differences between the current studies, which address the fusion layer, and our proposal.

Layer fusion has been widely researched in the field of network optimisation. This technique mainly reduces the model size (i.e. the number of layers) while enhancing the inference performance [17]. For instance, O’Neill et al. [18] suggested a new method to combine weights of similar layers, focusing primarily on convolutional, dense layers (fully connected) and attention layers. The authors compute the similarity between weights using the Bures metric and then combine the layers exhibiting a high correlation (lower distance) between weights. Consequently, the authors achieved a compression ratio of 3.3 and similar levels of accuracy compared with the original models.

Another method suggested for layer fusion is known as sequence graph substitution. In this approach, consecutive operations are replaced by one or more equivalent operations to enhance the efficiency of the layer’s operation. This method is commonly employed in popular libraries such as TFLITE and Apache TVM. For instance, TensorFlow fuses the sequence of convolution, batch normalisation, and ReLU activation function into a single kernel, reducing the overall number of operations needed during the inference phase. Fang et al. [17] adopted a similar approach to TFLITE by fusing consecutive layers using a set of substitution rules. In their work, the authors, for example, replace consecutive convolutional layers with a single operation and a “split” node to reduce computational time.

Although operation substitution and layer fusion are commonly used by different libraries and ML compilers, they are usually limited to a set of rules. To circumvent these limitations, Niu et al. [19] proposed a novel loop fusion framework, which considers the input and output of each operation to classify them into several groups. These groups determine the mapping type of each operator, e.g. an operator can have a relation one-to-one, one-to-many or many-to-many with another operator, allowing a wide range of fusion alternatives among operators.

While many DNN architectures are deployed on resource-constrained devices, significant challenges remain in optimising techniques like layer fusion to effectively address memory and computational constraints. For example, existing approaches often struggle to combine quantisation with fusion, a critical integration for minimising memory usage and computational overhead in low-resource environments.

In contrast to previous works, this study introduces a layer fusion method specifically targeting three frequently used layers in many DNNs: Conv2D, BN, and ReLU. Moreover, we incorporate a quantization process into the fusion technique, which not only reduces the computational load but also slightly decreases the model size. This dual approach addresses the identified gaps, offering a practical solution for deploying efficient DNNs on resource-limited devices while maintaining competitive performance.

7 Conclusions

We have introduced a novel optimisation procedure to fuse Conv2D–ReLU–BN layers into a unified layer, denoted as FConv2D, along with its quantisation process. This process is implemented using the CG optimisation technique. The proposed layer fusion method is tested on the Nano Cortex-M4 and the Port Cortex-M4 and M7 processors, widely employed in various IOT scenarios.

The primary goal of layer fusion is to alleviate the computational intensity of DNN models composed of Conv2D–ReLU–BN layers when deployed on resource-constrained devices. Our fusion technique reduces the inference time by more than 1.26 times compared to the original model, regardless of the MCU and compilation flag used. Remarkably, when the model with the FConv2D layer is compiled with the-O3 flag for Cortex-M4, the execution time of the FConv2D layer is even lower than the time required for the Conv2D–ReLU layer alone.

These findings have significant broader implications, especially in real-world applications with increasingly prevalent edge computing and IOT devices. By streamlining DNN models for resource-constrained devices, this optimisation enables more efficient use of hardware, leading to enhanced performance in critical areas such as real-time data processing, autonomous systems, and smart sensors. For example, improved processing speeds in healthcare can facilitate more responsive patient monitoring systems. In industrial settings, this can lead to more efficient predictive maintenance by enabling quicker data analysis from sensors on manufacturing equipment. Furthermore, in consumer electronics, this optimisation can enhance user experiences in smart home devices by reducing latency in tasks such as voice recognition and object detection. Reducing execution time also contributes to energy savings, which is crucial for battery-powered IoT devices, extending their operational life and reliability. Overall, the ability to deploy more efficient DNN models on constrained hardware opens up new possibilities for innovation cross various sectors, making advanced AI applications more accessible and practical in everyday scenarios.

As part of our future work, we would like to explore alternative forms of layer fusion, considering strategies to minimise data movements between layers. In particular, we plan to evaluate models that incorporate fully connected FC–ReLU–BN layer combinations, as we have shown that these can be fused in the same numerical manner as Conv2D–ReLU–BN. This opens up new avenues for improving computational efficiency and further reducing the overall latency in DL models. Moreover, our aim is to investigate other activation functions, such as the Leaky ReLU and parametric ReLU, which introduce additional parameters. In this case, the current formulation will need to be adapted to account for these extra parameters.

Notes

During the model training, we started with a learning rate of 0.1 and progressively reduced to 0.005 at epoch 80. Also, we used a batch size of 128.

References

Chakraborty C, Bhattacharya M, Pal S, Lee S-S (2024) From machine learning to deep learning: advances of the recent data-driven paradigm shift in medicine and healthcare. Curr Res Biotechnol 7:100164
Article MATH Google Scholar
Prado-Rujas I-I, García-Dopico A, Serrano E, Córdoba ML, Pérez MS (2024) A multivariable sensor-agnostic framework for spatio-temporal air quality forecasting based on deep learning. Eng Appl Artif Intell 127:107271
Article Google Scholar
Tavakoli M, Baldi P, Carlton AM, Chiu YT, Shmakov A, Van Vranken D (2024) AI for interpretable chemistry: predicting radical mechanistic pathways via contrastive learning. Adv Neural Inf Process Syst 36
Maciá-Lillo A, Barrachina S, Fabregat G, Dolz MF (2024) Optimizing convolutions for deep learning inference on ARM Cortex-M processors. IEEE Internet Things J 11(15):26203–26219. http://doi.org/10.1109/JIOT.2024.3395335
Article Google Scholar
Cai X, Wang Y, Zhang L (2021) Optimus: towards optimal layer-fusion on deep learning processors. In: Proceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. LCTES 2021, pp. 67–79. Association for Computing Machinery, New York, NY, USA http://doi.org/10.1145/3461648.3463848
Chen T, Moreau T, Jiang Z, Zheng L, Yan E, Cowan M, Shen H, Wang L, Hu Y, Ceze L, Guestrin C, Krishnamurthy A (2018) TVM: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation. OSDI’18, pp. 579–594. USENIX Association, USA
Li M, Liu Y, Liu X, Sun Q, You X, Yang H, Luan Z, Gan L, Yang G, Qian D (2021) The deep learning compiler: a comprehensive survey. IEEE Trans Parallel Distrib Syst 32(3):708–727. http://doi.org/10.1109/TPDS.2020.3030548
Article MATH Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift
Ducha A. CaffeNet benchmark - understanding batch normalization. http://github.com/ducha-aiki/caffenet-enchmark/blob/master/batchnorm.md. Accessed 21 Jan 2025
Batch Normalization Before or After ReLU?. http://www.reddit.com/r/MachineLearning/comments/67gonq/d_batch_normalization_before_or_after_relu/?rdt=65436. Accessed 21 Jan 2025
Barrachina S, Dolz MF, San Juan P, Quintana-Ortí ES (2022) Efficient and portable gemm-based convolution operators for deep neural network training on multicore processors. J Parallel Distrib Comput 167:240–254. http://doi.org/10.1016/j.jpdc.2022.05.009
Article MATH Google Scholar
Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard A, Adam H, Kalenichenko D (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2704–2713 http://doi.org/10.1109/CVPR.2018.00286
Arduino Nano 33 BLE Sense. http://store.arduino.cc/arduino-nano-33-ble-sense. Accessed 21 Jan 2025
STMicroelectronics: STM32H747XI Datasheet. Datasheet. http://www.st.com/en/microcontrollers-microprocessors/stm32h747xi.html. Accessed 21 Jan 2025
Arduino Portenta H7 Lite. http://www.arduino.cc/pro/hardware/product/portenta-h7-lite. Accessed 21 Jan 2025
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition
Fang J, Shen Y, Wang Y, Chen L (2020) Optimizing dnn computation graph using graph substitutions. Proc VLDB Endow 13(12):2734–2746. http://doi.org/10.14778/3407790.3407857
Article MATH Google Scholar
O’Neill J, Steeg GV, Galstyan A (2021) Layer-wise neural network compression via layer fusion. In: Balasubramanian, V.N., Tsang, I. (eds.) Proceedings of The 13th Asian Conference on Machine Learning. Proceedings of Machine Learning Research, vol 157, pp 1381–1396. PMLR, Online http://proceedings.mlr.press/v157/o-neill21a.html
Niu W, Guan J, Wang Y, Agrawal G, Ren B (2021) Dnnfusion: accelerating deep neural networks execution with advanced operator fusion. In: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. PLDI 2021, pp. 883–898. Association for Computing Machinery, New York, NY, USA http://doi.org/10.1145/3453483.3454083

Download references

Acknowledgements

This research was funded by project TED2021-129334B-I00 supported by MCIN/AEI/10.13039/501100011033 and by the “European Union NextGenerationEU/PRTR”. Manuel F. Dolz was also supported by the Plan Gen–T grant CIDEXG/2022/013 of the Generalitat Valenciana. Jose I. Mestre was also supported by the FPI grant GVA ACIF/2021/281.

Funding

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.

Author information

Jose I. Mestre, Sergio Barrachina, Darwin Quezada, and Manuel F. Dolz contributed equally to this work.

Authors and Affiliations

Department of Computer Science and Engineering, Universitat Jaume I, Av. Vicent Sos Baynat, 12071, Castellón de la Plana, Spain
Jose I. Mestre, Sergio Barrachina, Darwin Quezada & Manuel F. Dolz

Authors

Jose I. Mestre
View author publications
Search author on:PubMed Google Scholar
Sergio Barrachina
View author publications
Search author on:PubMed Google Scholar
Darwin Quezada
View author publications
Search author on:PubMed Google Scholar
Manuel F. Dolz
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Jose I. Mestre.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mestre, J.I., Barrachina, S., Quezada, D. et al. Deep learning inference optimisation for IoT: Conv2D-ReLU-BN layer fusion and quantisation. J Supercomput 81, 621 (2025). http://doi.org/10.1007/s11227-025-07107-y

Download citation

Accepted: 21 February 2025
Published: 14 March 2025
DOI: http://doi.org/10.1007/s11227-025-07107-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Part of a collection:

Deep learning inference optimisation for IoT: Conv2D-ReLU-BN layer fusion and quantisation

Abstract

Similar content being viewed by others

Optimizing convolutional neural networks for IoT devices: performance and energy efficiency of quantization techniques

An Intrusion Detection Model Based on Deep Learning and Multi-layer Perceptron in the Internet of Things (IoT) Network

Efficient human activity recognition on edge devices using DeepConv LSTM architectures

Explore related subjects

1 Introduction

2 Problem statement

3 Layer fusion

3.1 Conv2D–ReLU–BN fusion

3.1.1 Fundamental layers

3.1.2 Conv2D–BN–ReLU fusion

3.1.3 Conv2D–ReLU–BN fusion

3.2 Quantising Conv2D–BN–ReLU fusion

4 Layer fusion process

4.1 Implementation

5 Experimental results

5.1 Hardware setup

5.2 DL framework and libraries

5.3 Testbed

5.4 Accuracy and size evaluation

5.5 Per-layer performance evaluation

5.6 Inference performance evaluation

6 Related work

7 Conclusions

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords