1 Introduction

Spatiotemporal processes are the changing states of complex interactions between spatially interrelated features on a time series, accurate prediction is of great significance in many fields like rainfall prediction and flood prevention. According to the principle in method, existing spatiotemporal prediction methods can be divided into three categories: traditional parametric models, shallow machine learning models, and deep learning models [1]. Traditional parametric models and shallow machine learning models have strong interpretability. However, traditional parametric models (e.g. Weather Research and Forecasting, WRF) [2] and shallow machine learning models (e.g. support vector machine, SVM) [3] struggle to accurately capture temporal dependencies and perform poorly in the context of multiscale high-dimensional data. Furthermore, they can only capture the local spatial correlation between adjacent nodes and thus hardly perform global perception. Deep learning, although less interpretative, has significant advantages both in prediction accuracy and in handling large-scale datasets. It makes up for the shortcomings of traditional parametric models and shallow machine learning models that are only appropriate for small amounts of data and small sample data [4].

According to the predicted target, existing spatiotemporal deep learning methods can be divided into three categories: time series deep learning prediction methods, spatial feature deep learning prediction methods, and the spatiotemporal process deep learning prediction methods. For the time series deep learning prediction methods, e.g., recurrent neural networks (RNNs) [5], gated recurrent unit (GRU) networks [6], long short-term memory (LSTM) networks [7], and temporal convolutional networks (TCN) [8], they can extract temporal features from data but often ignore the spatial correlation of spatiotemporal data when applying them to the simulation of spatiotemporal processes. To this end, spatial feature deep learning prediction methods are proposed. Specifically, these methods try to extract and analyze features from gridded spatiotemporal data. Techniques such as convolutional neural networks (CNNs) [9], residual neural networks (ResNets) [10] and graph convolutional neural networks (GCNs) [11] are commonly employed to capture these spatial features. However, such models often ignore the temporal association of spatiotemporal sequence data. Specifically, the spatial correlation between spatial regions tends to shift over time in real scenarios. For instance, the difference in runoff volume generated under the same rainfall intensity is relatively large between rainy and dry seasons, and the risk of causing flooding is different. As a result, it is particularly essential to characterize the temporal and spatial dependence and the relationship between them for spatiotemporal process prediction [12].

Deep learning prediction methods that integrate both temporal and spatial features have been continually evolving. One prominent example is ConvLSTM [13], which combines CNN and LSTM. This method extends the traditional fully connected LSTM by transforming its one-dimensional tensor into a three-dimensional tensor, enabling the model to better capture spatiotemporal features within the data. This approach has been particularly effective in tasks like rainfall prediction using radar echo data. However, despite its success, ConvLSTM still relies on LSTM units, which have limitations in fully capturing complex spatiotemporal dependencies. To address this, ConvGRU [14] was introduced, where the LSTM units in ConvLSTM are replaced by gated recurrent units (GRUs). Specifically, ConvGRU maintains the temporal relationships and spatial feature extraction by applying convolution operations across different regions, improving efficiency and enhancing the model's ability to handle spatiotemporal dynamics. To further leverage the strengths of both convolutional and recurrent architectures, PredRNN [15] was introduced and applied to precipitation nowcasting for the first time. It incorporates a novel spatiotemporal LSTM (ST-LSTM) unit designed to simultaneously extract and store both spatial and temporal representations. However, the zigzag transfer of spatiotemporal memory states within this architecture made it vulnerable to the vanishing gradient problem, particularly as the model depth and the sequence length increased. To address this limitation, PredRNN++ [16] was developed, utilizing a CausalLSTM structure to better capture short-term dependencies. It also introduced the Gradient Highway, which mitigates the vanishing gradient issue, ensuring more stable gradient flow across layers and timesteps. While these advancements improved the modeling of spatiotemporal dependencies, previous models struggled to capture the nonstationary characteristics often present in time series data. To address this, the Memory in Memory (MIM) [17] framework was proposed. MIM consists of two cascaded modules designed to handle nonstationary and near-stationary components of spatiotemporal dynamics, drawing on a differencing approach to capture higher-order nonstationarity. Further analysis of PredRNN revealed that spatial and temporal memory states were often entangled, leading to redundant feature learning. In response, PredRNN-v2 [18] introduced memory decoupling loss to decouple the learning of spatial and temporal features, reducing redundancy and improving model efficiency. Besides, PredRNN-v2 also proposes reverse scheduled sampling (RSS) to enable the model to better capture long-term dynamic information by training with gradually increasing difficulty, allowing the model to learn both short- and long-term dependencies more effectively. To solve the problem of excessive calculation in recurrent networks, SimVP [19] is constructed, which consists of pure convolution modules and greatly reduces the number of parameters. Furthermore, MMVP [20] decouples motion and appearance information to further reduce the model size. Nevertheless, above networks perform well, their architectures are usually complex and vary in performance for different tasks. Therefore, designing a spatiotemporal prediction network architecture for multi-task learning remains a unique challenge.

The attention mechanism has gained significant popularity due to its success in various visual tasks, with self-attention becoming particularly widespread because of its ability to produce strong results. By dynamically adjusting weights based on input image features, the attention mechanism highlights the most important parts of complex inputs for generating the desired output. This makes it highly effective in uncovering spatiotemporal dependencies in large-scale spatiotemporal sequences. To leverage this, SA-ConvLSTM [21] was proposed, which integrates self-attention to capture long-term spatial dependencies using a Self-Attention Memory (SAM). This approach enables the model to better retain and model spatial relationships over time. Additionally, to reduce computational complexity, the authors replaced standard convolution operations with depthwise separable convolutions, which help maintain model efficiency while still capturing essential spatial features. Building on the need for more effective radar echo extrapolation, a novel interactive model called IDA-LSTM [22] was introduced. This model employs an interactive framework to improve the short-term dependence in ConvRNN models, enhancing their performance in predicting short-term spatiotemporal dynamics. Furthermore, the model incorporates a channel and spatial dual attention mechanism into the ST-LSTM structure, allowing it to retain long-term memory of past information more effectively. However, this added complexity comes at the cost of increased computational effort and a higher number of parameters, which could potentially limit its scalability in some applications. In general, though SA-ConvLSTM and IDA-LSTM reflect the growing trend of integrating attention mechanisms with spatiotemporal modeling to better capture dependencies, challenges related to computational efficiency and parameter optimization remain significant areas for future development.

Given above limitations in existing studies, our research aims to develop a deep learning network for spatiotemporal process prediction based on a stacked architecture: STPNet. The proposed model integrates a self-attention strategy with a dual attention mechanism and a gating mechanism to effectively capture and model complex spatiotemporal dynamics. Contributions of STPNet can be concluded as follow:

  1. 1)

    Dual attention mechanism The higher-order spatiotemporal dynamics are learned through the dual attention mechanism, enabling the model to focus on both channel-wise and spatial features, improving its ability to capture intricate relationships in the data.

  2. 2)

    Self-attention for long-term modeling The long-term modeling limitations of traditional recurrent neural networks (RNNs) are addressed by incorporating a self-attention mechanism. This enhancement allows the construction of a spatiotemporal attention unit (STA-LSTM), which can efficiently extract both long-term and short-term spatiotemporal features from the data.

  3. 3)

    Zigzag spatiotemporal memory transfer To ensure efficient information flow, spatiotemporal memory is transferred throughout the convolutional recurrent network in a zigzag pattern. This design facilitates the effective transfer of feature knowledge from input to output, minimizing information loss and improving the network's ability to model complex spatiotemporal dependencies.

By extending conventional convolutional recurrent networks to consider both long-term and short-term spatiotemporal relationships, STPNet significantly enhances the ability to represent complex interactions between spatially related elements over time. Validations on several representative spatiotemporal datasets have demonstrated its effectiveness in predicting spatiotemporal processes. This approach provides a robust solution to address the limitations of traditional models, which often struggle to simultaneously capture spatial and temporal dependencies—both of which are crucial for accurate spatiotemporal process prediction.

The remainder of this paper is organized as follows: Sect. 2 provides a detailed description of the STPNet implementation, including the design and functionality of the shuffle attention unit (SA-unit) and the spatiotemporal attention LSTM unit (STA-LSTM unit). These components form the core of the proposed spatiotemporal prediction network architecture. Section 3 presents the experiments and analysis, which cover several aspects: the datasets used in the study, the experimental setup, ablation studies to verify the rationality of the model design, performance comparisons between STPNet and other models on the chosen datasets, as well as discussions on STPNet’s applicability and limitations in real-world scenarios. Section 4 concludes the paper, summarizing key findings and suggesting directions for future research.

2 Method

2.1 Model overview

As shown in Fig. 1, the proposed spatiotemporal process prediction model (STPNet) is constructed by stacking multiple units and the architecture updates the spatiotemporal state in the zigzag direction as a spatiotemporal memory flow. There two kinds of basic and novel units, shuffle attention network (SA-Net) units and spatiotemporal attention-based LSTM (STA-LSTM) units. Specifically, the SA-Net is a self-attention module used to extract spatial features and STA-LSTM is used to aggregate temporal information. Given input spatiotemporal sequences (\({\rm X}_{0}, {X}_{1}, \dots , {X}_{n}\)), the features will be first processed by SA-Net to obtain pixel-level pairwise relationships and channel dependencies. After that, these features will be further processed by STA-LSTM units to construct deeper spatial and temporal relationships. Benefit from self-attention strategy adopt in STA-LSTM unit, the current features and the memorized features are fused together so as to calculate pairwise similarity scores of memorized features with long-term spatial and temporal dependence. Details of STPNet are described below:

Fig. 1
figure 1

STPNet network model architecture

2.2 Dual attention strategy

To learning higher-order nonlinear dependencies, a dual attention mechanism is embedded in STPNet. As shown in Fig. 2, the input spatiotemporal data are first processed through a shuffle attention network (SA-Net) [23] in the channel and spatial dimensions. For a given feature map \(X\in {\mathbb{R}}^{C\times H\times W}\), where C, H and W refer to the number of channels, height and width, respectively. Firstly, \(X\) is grouped into multiple sub-features along the channel dimension, and then they are processed in parallel, i.e., \({X=[{X}_{1},...,X}_{G}], {X}_{k}\in {\mathbb{R}}^{C/G\times H\times W}\). Next, for each group of feature words, the dependencies of features on the spatial domain and channel dimension are inscribed with shuffle units. Lastly, all features are integrated, and the component features are communicated through the channel shuffle operation.

Fig. 2
figure 2

The dual attention mechanism embedding in the proposed model

2.2.1 Channel attention

This is an improvement of SENet [24], which first embeds global information by generating channel statistics \(s\in {\mathbb{R}}^{C/2G\times 1\times 1}\) through a global average pool. The process can be described as:

$$s={\mathcal{F}}_{gp}\left({X}_{k1}\right)=\frac{1}{H\times W}{\sum }_{i=1}^{H}{\sum }_{j=1}^{W}{X}_{k1}\left(i,j\right).$$
(1)

where H and W represent height and width of \({X}_{k1}\), respectively. After that, a fully connected layer and a sigmoid activation are combined in series to obtain attention matrix and adapt it to input feature. This process can be represented as:

$$X_{k1}{\prime} = \sigma \left( {{\mathcal{F}} \left( s \right)} \right) \cdot X_{k1} = \sigma \left( {W_{1} s + b_{1} } \right) \cdot X_{k1} ,$$
(2)

where \({W}_{1}\) and \({b}_{1}\in {\mathbb{R}}^{C/2G\times 1\times 1}\) are weight and bias of fully connected layer, \(\sigma\) means the sigmoid activation.

2.2.2 Spatial attention

Spatial attention is a complement to channel attention, which adopts group standard operation and pays more attention to the interesting area. Given \({X}_{k2}\) to obtain the statistical information at the null domain level, and then a deepwise convolution \({\mathcal{F}}_{\mathcal{c}}\left(\bullet \right)\) is used to enhance it. The process can be described as follows:

$$X_{k2}^{\prime } = \sigma \left( {W_{2} \cdot GN\left( {X_{k2} } \right) + b_{2} } \right) \cdot X_{k2} ,$$
(3)

where \({W}_{2}\) and \({b}_{2}\in {\mathbb{R}}^{C/2G\times 1\times 1}\) denote the weight and bias of \({\mathcal{F}}_{\mathcal{c}}\left(\bullet \right)\). For deepwise convolution \({\mathcal{F}}_{\mathcal{c}}\left(\bullet \right)\), its kernel size is set to 3, with stride 2, padding 1.

2.2.3 Aggregation

After completing the previous two attention calculations, it needs to be integrated. First, a simple concatenate fusion is applied: \({X}_{k}^{{^{\prime}}}=[{X}_{k2}^{{^{\prime}}}, {X}_{k2}^{{^{\prime}}}]\in {\mathbb{R}}^{C/2G\times H\times W}\). Then, channel shuffle is used for intergroup communication. Finally, the ultimate output of SA-Net will be embedded into STPNet to refine inputs with spatial attention.

2.3 STA-LSTM unit

Compared to conventional LSTM, the proposed spatiotemporal attention unit in STPNet introduces a key difference: it no longer relies on the spatiotemporal memory state from the previous timestep within the same layer. Instead, it leverages the spatiotemporal state from the previous layer at the current timestep. Specifically, it utilizes the temporal memory state \({C}_{t}^{l-1}\), the spatial memory state \({M}_{t}^{l-1}\)​, and the hidden state \({H}_{t}^{l-1}\). As illustrated in Fig. 3, this design establishes a "circular highway" through the spatiotemporal memory flow, reducing information loss as it propagates from lower to higher layers. This enhancement allows STPNet to more effectively model short-term dynamics, improving its performance in capturing complex spatiotemporal relationships.

Fig. 3
figure 3

STA-LSTM unit structure

To capture the complex temporal features in the time series, the multi-correlation factor dependence, and the spatial dependence present in the corresponding spatial raster points. We employ the gating mechanism and cross attention in conventional LSTM, namely spatiotemporal attention-based LSTM (STA-LSTM) unit, it can simultaneously consider the implications of temporal dependencies and spatial relationships of spatiotemporal data, enabling better extraction of spatiotemporal features. As shown in Fig. 3, STA-LSTM mainly consists of two parallel pathways, which involve three memory state matrix and one input spatiotemporal data. For a l-th level STA-LSTM unit in STPNet, its three state matrices can be described as the temporal memory state \({C}_{t-1}^{l}\) of layer l at timestep t, the hidden state \({H}_{t-1}^{l}\) of layer l at timestep \(t-1\), and the spatial memory state \({M}_{t}^{l-1}\) of layer l − 1 at timestep t, respectively. Considering complex calculation in this unit and the simplicity in description, we break down the update steps into three interconnected sub steps for explanation:

2.3.1 Update temporal memory state

When a spatiotemporal feature \({X}_{t}\) passing through this unit, the temporal memory state \({C}_{t}^{l}\) at timestep t can be updated as follows:

$${i}_{t}=\sigma \left({W}_{xi}*{X}_{t}+{W}_{\text{hi}}*{H}_{t-1}^{l}\right),$$
(4)
$${g}_{t}=\text{tanh}\left({W}_{xg}*{X}_{t}+{W}_{\text{hg}}*{H}_{t-1}^{l}\right),$$
(5)
$${f}_{t}=\sigma \left({W}_{xf}*{X}_{t}+{W}_{\text{hf}}*{H}_{t-1}^{l}\right),$$
(6)
$${C}_{t}^{l}={{f}_{t}\odot C}_{t-1}^{l}+{i}_{t}\odot {g}_{t},$$
(7)

where \(*\) denotes a 3D convolution, \(\upsigma\) denotes the sigmoid activation function, \(tanh\) denotes the tanh activation function, \(\odot\) denotes a Hadamard product, \({i}_{t}\) is the input gate, \({g}_{t}\) is the input modulation gate, \({f}_{t}\) is the forget gate, \({W}_{xi}\) is the weight of spatiotemporal data and \({W}_{\text{hi}}\) is the weight of the hidden state in the input gate, \({W}_{xg}\) is the weight of spatiotemporal data and \({W}_{\text{hg}}\) is the weight of the hidden state in the input modulation gate, \({W}_{xf}\) is the weight of spatiotemporal data and \({W}_{\text{hf}}\) is the weight of the hidden state in the forget gate.

2.3.2 Refine inputs by stored hidden state and memory state

For a specific timestep t, the hidden state can be update as follow:

$${o}_{t}=\sigma \left({W}_{xo}*{X}_{t}+{W}_{\text{ho}}*{H}_{t-1}^{l}\right),$$
(8)
$${\widehat{H}}_{t}^{l}={o}_{t}\odot \text{tanh}\left({C}_{t}^{l}\right),$$
(9)

where \({o}_{t}\) is the output gate, \({\widehat{H}}_{t}^{l}\) is the in-process hidden state, \({W}_{xo}\) is the weight of the spatiotemporal data and \({W}_{\text{ho}}\) is the weight of the hidden state in the output gate weight. Then, feature map \({\widehat{H}}_{t}^{l}\) will be mapped to different feature spaces like transformer attention and obtain query \({Q}_{h}={W}_{hq}{\widehat{H}}_{t}^{l}\in {\mathbb{R}}^{C\times H\times W}\), key \({K}_{h}={W}_{hk}{\widehat{H}}_{t}^{l}\in {\mathbb{R}}^{C\times H\times W}\), and value \({V}_{h}={W}_{hv}{\widehat{H}}_{t}^{l}\in {\mathbb{R}}^{C\times H\times W}\). Also, feature map \({M}_{t}^{l-1}\) can be mapped to key \({K}_{h}={W}_{mk}{M}_{t}^{l-1}\in {\mathbb{R}}^{C\times H\times W}\) and value \({V}_{h}={W}_{mv}{M}_{t}^{l-1}\in {\mathbb{R}}^{C\times H\times W}\). Assume that the set of weights of 1 × 1 convolutions aforementioned as \(\left\{{W}_{hq},{W}_{hk},{W}_{hv},{W}_{mk},{W}_{mv}\right\}\), the similarity measurement between memory state and current obtained key matrix can be computed by matrix multiplication as follows:

$${e}_{\Delta }={Q}_{h}^{T}{K}_{\Delta }\in {R}^{\left(H\times W\right)\times \left(H\times W\right)},\Delta \in \{h,m\}$$
(10)

where T means current processing timestep, Q and K represent query matrix and key matrix transformed before, respectively. For the proposed STPNet, above similarity can be expressed as:

$${e}_{h;i,j}=\left({{\widehat{H}}_{t}^{lT}}{W}_{hq}^{T}\right)\left({W}_{hk}{\widehat{H}}_{t,j}^{l}\right)$$
(11)
$${e}_{m;i,j}=\left({{\widehat{H}}_{t,i}^{lT}}{W}_{hq}^{T}\right)\left({W}_{mk}{M}_{t,j}^{l-1}\right)$$
(12)

where the value in the location (i, j) of \({e}_{h}({e}_{\text{m}})\) mean the similarity between input and hidden (memory) state matrix. After that, like attention in conventional transformer attention, all weights will be normalized by a softmax function to get attention matrix. For our proposed STPNet, we apply softmax along column axis and the obtained attention can be expressed as:

$${{\alpha }}_{\Delta ;i,j}=\frac{\text{exp}{e}_{\Delta ;i,j}}{{\sum }_{k=1}^{H\times W}\text{exp}{e}_{\Delta ;i,k}}, \Delta \in \{h,m\}, i, j\in \{1, 2 ,\cdots , H\times W\}.$$
(13)

where \({\alpha }\) denotes the obtained attention matrix. Finally, we can refine inputs according two pathways and aggregate them as follow:

$${Z}_{\Delta ;i}={\sum }_{j=1}^{H\times W}{{\alpha }}_{\Delta ;i,j}{V}_{\Delta ;j}=\left\{\begin{array}{c}{\sum }_{j=1}^{H\times W}{{\alpha }}_{h;i,j}\left({W}_{hv}{\widehat{H}}_{t,j}^{l}\right)\\ {\sum }_{j=1}^{H\times W}{\alpha }_{m;i,j}\left({W}_{mv}{M}_{t,j}^{l-1}\right)\end{array}\right., \Delta \in \{h,m\}, i, j\in \{1 ,2 ,\cdots , H\times W\}.$$
(14)

where Z represent the refined feature by hidden state and memory state. Considering hidden state and memory state may conflict, all features are stored by a concatenate operation (\(Z=[{W}_{z}\left[{Z}_{h};{Z}_{m}\right], {\widehat{H}}_{t}^{l}]\)) for next processing.

2.3.3 Update hidden state and memory state

For other remaining states to be updated, spatial–temporal memory state \({M}_{t}^{l}\) and the hidden state \({H}_{t}^{l}\) of layer l at timestep t, they are obtained by calculating the temporal data \({X}_{t}\) at timestep t, the hidden state \({H}_{t-1}^{l}\) of layer l at timestep t – 1, the temporal memory state \({C}_{t}^{l}\) of layer l at timestep t, and the spatial–temporal memory state \({M}_{t}^{l-1}\) of layer l − 1 at timestep t. This process can be represented as follow

$${i}_{t}{\prime}=\sigma \left({W}_{m;zi}*Z+{W}_{m;hi}*{\widehat{H}}_{t}^{l}\right),$$
(15)
$${g}_{t}{\prime}=\text{tanh}\left({W}_{m;zg}*Z+{W}_{m;hg}*{\widehat{H}}_{t}^{l}\right),$$
(16)
$${M}_{t}^{l}=\left(1-{i}_{t}{\prime}\right)\odot {M}_{t}^{l-1}+{i}_{t}{\prime}\odot {g}_{t}{\prime},$$
(17)
$${o}_{t}{\prime}=\sigma \left({W}_{m;zo}*Z+{W}_{m;ho}*{\widehat{H}}_{t}^{l}\right),$$
(18)
$${H}_{t}^{l}={o}_{t}{\prime}\odot \text{tanh}\left({M}_{t}^{l}\right),$$
(19)

where \(i_{t}^{\prime }\) is the extra input gate, Z is the fusion feature, \(g_{{\text{t}}}^{\prime }\) is the extra input modulation gate, \({o}_{t}{\prime}\) is the extra output gate, \({W}_{m;zi}\) is the weight of the spatial memory state and \({W}_{m;hi}\) is the weight of the spatial memory state corresponding to the hidden state in the extra output gate, \({W}_{m;zg}\) is the weight of the spatial memory state and \({W}_{m;hg}\) is the weight of the spatial memory state corresponding to the hidden state in the extra input modulation gate, \({W}_{m;zo}\) is the weight of the spatial memory state and \({W}_{m;ho}\) is the weight of the spatial memory state corresponding to the hidden state in the extra output modulation gate.

For a specific timestep t, the fused features Z represent the fusion of features \({Z}_{h}\) and \({Z}_{m}\). Feature map \({\widehat{H}}_{t}^{l}\) is mapped to different feature spaces as query \({Q}_{h}={W}_{hq}{\widehat{H}}_{t}^{l}\in {\mathbb{R}}^{C\times H\times W}\), key \({K}_{h}={W}_{hk}{\widehat{H}}_{t}^{l}\in {\mathbb{R}}^{C\times H\times W}\), and value \({V}_{h}={W}_{hv}{\widehat{H}}_{t}^{l}\in {\mathbb{R}}^{C\times H\times W}\). Also, feature map \({M}_{t}^{l-1}\) can be viewed as key \({K}_{h}={W}_{mk}{M}_{t}^{l-1}\in {\mathbb{R}}^{C\times H\times W}\) and value \({V}_{h}={W}_{mv}{M}_{t}^{l-1}\in {\mathbb{R}}^{C\times H\times W}\). Assume that the set of weights of 1 × 1 convolutions aforementioned as \(\left\{{W}_{hq},{W}_{hk},{W}_{hv},{W}_{mk},{W}_{mv}\right\}\), the similarity measurement between memory state and current obtained key matrix can be computed by matrix multiplication as follows:

$${e}_{\Delta }={Q}_{h}^{T}{K}_{\Delta }\in {R}^{\left(H\times W\right)\times \left(H\times W\right)},\Delta \in \{h,m\}$$
(20)

where T means current processing timestep, Q and K represent query matrix and key matrix transformed before, respectively. For the proposed STPNet, above similarity can be expressed as:

$${e}_{h;i,j}=\left({{\widehat{H}}_{t}^{lT}}{W}_{hq}^{T}\right)\left({W}_{hk}{\widehat{H}}_{t,j}^{l}\right)$$
(21)
$$e_{{m;i,j}} = \left( {\hat{H}_{{t,i}}^{{lT}} W_{{hq}}^{T} } \right)\left( {W_{{mk}} M_{{t,j}}^{{l - 1}} } \right)$$
(22)

where the value in the location (i, j) of \({e}_{h}({e}_{\text{m}})\) mean the similarity between input and hidden (memory) state matrix. After that, like attention in conventional transformer attention, all weights will be normalized by a softmax function to get attention matrix. For our proposed STPNet, we apply softmax along column axis and the obtained attention can be expressed as:

$${{\alpha }}_{\Delta ;i,j}=\frac{\text{exp}{e}_{\Delta ;i,j}}{{\sum }_{k=1}^{H\times W}\text{exp}{e}_{\Delta ;i,k}}, \Delta \in \{h,m\}, i, j\in \{1, 2 ,\cdots , H\times W\}.$$
(23)

where \({\alpha }\) denotes the obtained attention matrix. Finally, we can refine inputs according two pathways and aggregate them as follow:

$${Z}_{\Delta ;i}={\sum }_{j=1}^{H\times W}{{\alpha }}_{\Delta ;i,j}{V}_{\Delta ;j}=\left\{\begin{array}{c}{\sum }_{j=1}^{H\times W}{{\alpha }}_{h;i,j}\left({W}_{hv}{\widehat{H}}_{t,j}^{l}\right)\\ {\sum }_{j=1}^{H\times W}{\alpha }_{m;i,j}\left({W}_{mv}{M}_{t,j}^{l-1}\right)\end{array}\right., \Delta \in \{h,m\}, i, j\in \{1 ,2 ,\cdots , H\times W\}.$$
(24)

where Z represent the refined feature by hidden state and memory state. Considering hidden state and memory state may conflict, all features are stored by a concatenate operation (\(Z=[{W}_{z}\left[{Z}_{h};{Z}_{m}\right], {\widehat{H}}_{t}^{l}]\)) for next processing.

3 Experiments

3.1 Datasets

Three popular spatiotemporal prediction datasets are selected to report model performance, including Moving MNIST, KTH Action and Radar Echo. Details of these datasets are described as below:

3.1.1 Moving MNIST

Moving MNIST [25] is the most popular dataset for spatiotemporal prediction. The challenging part in this dataset is to infer the state of motion of the number random velocity. Besides, the occlusion between digits leads to missing local information and deepens the complexity of short-term changes, making spatiotemporal prediction even more difficult. Moving MNIST contains 10,000 video sequences and each of them consists of 20 frames. In each sequence, the digits move independently within the frame and are intersected and rebounded with the edges. The training set comprising 97,800 sequences and the validation/test set containing 24,920 sequences. In our experiments, the first 10 frames are employed to simulate and the next 10 frames are expected to forecast.

3.1.2 KTH action

KTH action [26] is a dataset for recognition of human actions, it contains six actions in different scenes both indoors and outdoors: walking, jogging, running, punching, waving, and clapping. Considering the nuances of performance, each action is performed by 25 different people. These variations tested the capacity of the models to recognize actions independently of participant size, participant appearance, and context. In our experiments, the frames are resized to a resolution of 128 × 128. Training was performed using 16 people; testing was performed with the remaining 9 people. The training set comprising 108,717 sequences and the validation/test set containing 4086 sequences. For each time series, the last 20 frames were predicted by the first 10 frames to validate the model’s prediction of long-term conditions.

3.1.3 Radar echo

Radar echo is a contest dataset for the CIKM AnalytiCup 2017. It is used to report performance of models on predicting short-term precipitation; it contains radar data, peripheral precipitation, and meteorological information. Each radar map in this dataset covers 101 × 101 km2 and contains precipitation totals for the target site between the next 1–2 h. There are different radar maps for 15 time spans and 4 altitudes. In our experiments, we selected a continuous 10,000 samples as the training set and another continuous 4000 sequences the validation/test set. During model training, the first 10 frames are employed to predict the next 15 frames.

3.2 Evaluation metrics

Four widely used evaluation metrics are used to evaluate prediction results, including the mean squared error (MSE), the structural similarity index measure (SSIM), the learned perceptual image patch similarity (LPIPS), and the peak signal-to-noise ratio (PSNR). For Radar echo dataset, Heidecker skill score (HSS), critical success index (CSI) and mean absolute error (MAE) are used to further compare performance differences. Details of these metrics are described below:

MSE is the average energy of the difference between the real image and the noisy image, where the difference between them is the noise. This metric is the most direct indicator reflecting the reliability of prediction results, and its formula can be described as:

$$\text{MSE}=\frac{1}{mn}{\sum }_{i=0}^{m-1}{\sum }_{j=0}^{n-1}{\left[R\left(i,j\right)-P\left(i,j\right)\right]}^{2}$$
(25)

where m and n represent the number of pixels in the groundtruth image and the predicted image, \(R\left(i,j\right)\) and \(P\left(i,j\right)\) denote the pixel value in location \(\left(i,j\right)\) of the real image and the predicted image.

SSIM is used to evaluate the structure similarity between two images; it can quantify the properties of an image in terms of brightness, contrast, and structure and thus compare their differences. Its value range is limited from 0 to 1, a higher value denotes greater similarity between the images. This metric can be described as:

$$\left\{\begin{array}{c}l\left(x,y\right)=\frac{2{\mu }_{x}{\mu }_{y}+{c}_{1}}{{\mu }_{x}^{2}+{\mu }_{y}^{2}+{c}_{1}}\\ c\left(x,y\right)=\frac{2{\sigma }_{x}{\sigma }_{y}+{c}_{2}}{{\sigma }_{x}^{2}+{\sigma }_{y}^{2}+{c}_{2}}\\ s\left(x,y\right)=\frac{{\sigma }_{xy}+{c}_{3}}{{\sigma }_{x}{\sigma }_{y}+{c}_{3}}\end{array}\right.$$
(26)
$$\text{SSIM}\left(x,y\right)=\left[l{\left(x,y\right)}^{\alpha }\cdot c{\left(x,y\right)}^{\beta }\cdot s{\left(x,y\right)}^{\gamma }\right]$$
(27)
$$\text{SSIM}\left(x,y\right)=\frac{\left(2{\mu }_{x}{\mu }_{y}+{c}_{1}\right)\left(2{\sigma }_{xy}+{c}_{2}\right)}{\left({\mu }_{x}^{2}+{\mu }_{y}^{2}+{c}_{1}\right)\left({\sigma }_{x}^{2}+{\sigma }_{y}^{2}+{c}_{2}\right)}$$
(28)

where l, c, and s represent the quantitative results of the luminance, the contrast, and the structure for an image. For these quantitative formulas, x and y mean the groundtruth image and the predicted image, \({\mu }_{i}\) and \({\sigma }_{i}^{2}\) are the average value and the variance of pixels in image \(i\), and \({\mu }_{xy}\) is the covariance between x and y. To avoid the case of a zero denominator, \({c}_{1},{ c}_{2},{\text{and} c}_{3}\) are constants used to maintain stability and set to 0.0001, 0.0009 and 0.00045, respectively. Moreover, \(\alpha ,\beta ,\text{and} \gamma\) take the value of 1 for fair comparison of luminance, contrast and structure.

PSNR is a reference value for image quality that measures the difference between the maximum signal and the background noise. The unit of PSNR is dB, where a larger value denotes less image distortion. Its formula can be constructed from MSE as:

$$\text{PSNR}=10\cdot \text{log}\left(\frac{{\text{MAX}}_{\text{R}}^{2}}{\text{MSE}}\right),$$
(29)

where \({\text{MAX}}_{\text{R}}\) is the maximum value of pixel in the groundtruth image R.

LPIPS is applied to measure differences between two images and is aligned with human perception compared to traditional methods. The lower the LPIPS value, the more similar the two images will be. Given a groundtruth image reference block \(x\) and a noisy image distortion block \({x}_{0}\), the distance between \(x\) and \({x}_{0}\) can be described as:

$$d\left(x,{x}_{0}\right)={\sum }_{l}\frac{1}{{H}_{l}{W}_{l}}\sum_{h,w}{\Vert {w}_{l}\odot ({\widehat{y}}_{hw}^{l}-{\widehat{y}}_{0hw}^{l})\Vert }_{2}^{2}.$$
(30)

where \({\widehat{y}}_{hw}^{l}\) and \({\widehat{y}}_{0hw}^{l}\) are the l-th layer generated feature from \(x\) and \({x}_{0}\), \({w}_{l}\in {R}^{{C}_{l}}\) is the corresponding weights to refine feature for \({{\ell}}_{2}\) distance calculation, \({H}_{l}\) and \({W}_{l}\) represent the height and width of feature generated by l-th layer.

Heidecker skill score (HSS) and critical success index (CSI) metrics are two specific metrics for the evaluation of radar echo predicted results. A linear process is adopted for Radar Echo dataset. Firstly, each pixel value is converted to a radar reflectivity factor for rainfall.

$$\text{dBZ} = \text{pixel}\_\text{value} \times 95/255 - 10.$$
(31)

Then, 5 dBZ, 20 dBZ, and 40 dBZ are chosen as the thresholds to generate predicted rainfall map. After that, the counts for true positive predictions TP (when prediction and truth are both 1), true negative predictions TN (when prediction and truth are both 0), false positive predictions FP (when prediction is 1 but truth is 0), and false negative predictions FN (when prediction is 0 but truth is 1) are computed. Finally, HSS and CSI can be calculated as follows:

$$\text{HSS}=\frac{2\left(\text{TP}\times \text{TN}-\text{FN}\times \text{FP}\right)}{\left(\text{TP}+\text{FN}\right)\left(\text{FN}+\text{TN}\right)+\left(\text{TP}+\text{FP}\right)\left(\text{FP}+\text{TN}\right)},$$
(32)
$$\text{CSI}=\frac{\text{TP}}{\text{TP}+\text{FN}+\text{FP}}.$$
(33)

Apart from CSI and HSS, we also select MAE as an supplementary indicator in the prediction of radar echo precipitation processes. Its formula can be expressed as:

$$\text{MAE}=\frac{1}{mn}{\sum }_{i=0}^{m-1}{\sum }_{j=0}^{n-1}\left|\left[R\left(i,j\right)-P\left(i,j\right)\right]\right|.$$
(34)

where m and n represent the number of pixels in the groundtruth image and the predicted image, \(R\left(i,j\right)\) and \(P\left(i,j\right)\) denote the pixel value in location \(\left(i,j\right)\) of the real image and the predicted image after filtering by thresholds.

3.3 Experiments setting

Considering different machine may lead to the differences in results, all models are all trained from scratch to report their performance for fair comparison. We selected 8 methods to compare the performance of our method, including ConvLSTM [13], PredRNN [15], PredRNN++ [16], PredRNN-v2 [18], MIM [17], ConvGRU [14], MMVP [20] and SimVP [19]. All models are set as their official suggestions. For fair comparison and keep consistent with our previous recording method, MMVP [20] and SimVP [19] are rewritten from the framework of OpenSTL [27] to our own framework. Moreover, all models are totally train 80,000 iterations and performance of each 5000 iteration are reported. For the proposed STPNet and rewritten models, their learning rates are uniformly set to \({10}^{-4}\) and Adam with momentum range (0.9, 0.999) is chosen as the grad optimizer. During training, the strategy of scheduled sampling is applied for first 50,000 iterations and the sampling change rate are uniformly set to \({2\times 10}^{-5}\). Details of their setting are shown in Table 1. Unless otherwise specified, all other parameters remain consistent as before. All procedures are implemented based on python and PyTorch (2.0.1), on a 64 bits machine with 128 GB RAM and an NVIDIA GeForce RTX A6000 GPU with 48G memory size.

Table 1 Information of datasets in experiments

3.4 Experiments results and analysis

3.4.1 Ablations about model architecture

To obtain the best balance training efficiency and prediction accuracy, comparison of model performance under various architecture scenarios are conducted on Moving MNIST. We explored both the depth and width of the model to get the best potential of STPNet. Specifically, we define the channels of transferring features between STA-LSTM units as the depth and define number of layers in STPNet as the layer. It is worth noting that all ablations are conducted on the dataset of Moving MNIST and report their floating point operations (FLOPs) with input size 64 × 64. As shown in Table 2, it can be seen that STPNet with the depth 256 and layer 4 achieves the best performance. Besides, we can also find that the performance of model is gradually improved with both increasement of depth and layer. For example, MSE is decreased from 61.252 to 51.158 with the depth doubles. The same improvement also exists in the layer, MSE is gradually decreased from 49.130 to 47.400 with the number of layer increase from 2 to 4. Nonetheless, the increase in STPNet parameters and computational complexity is not simply linear. When the depth doubles, both the parameter and computational complexity increase by about 4 times. Similar to the depth, both the parameter and computational complexity increase by about 2 times as layer doubles. Considering the real application scenario, we select the STPNet with higher accuracy ranking but fewer parameters (depth 256 and layer 4) to compare with other state-of-the-art methods in the following experiments.

Table 2 Performance of STPNet with different width and depth
3.4.1.1 Comparison on moving MNIST

To clearly demonstrate the effect of model architecture, metric of all models during training are plotted as curves to compare their convergence speed. As shown in Fig. 4, a contrast for each metric during the training iterations is presented. It can be seen that the convergence process of all metrics in STPNet are faster and smoother than all other models. Additionally, the upper limit of the proposed STPNet is also better than other methods. Up until the maximum number of iterations, it continued to advance steadily. In particular, STPNet was ahead of other models in terms of MSE and PSNR metrics from the beginning. The training curves indicate that STPNet has high potential for application to the dynamic modeling of small sample data. According to Table 3, it can be found that STPNet was the best in all indicators. The proposed STPNet exhibited decreases of 24.5%, 12.3%, 15.9%, 10.2%, 2.6%, 44.48%, and 34.68% in MSE and increases of 6.6%, 5.1%, 3.3%, 1.5%, 0.9%, 8.4%, and 18.53% in SSIM compared to ConvLSTM, MIM, PredRNN, PredRNN-v2, PredRNN + + , MMVP, SimVP and IDA-LSTM, respectively. From the aspect of LPIPS metric, STPNet showed the most impressive performance, which is more reliable than other metrics in human perceptual evaluation. This improvement can be also proved by Fig. 5, where the predicted results of STPNet are the most visually similar to the true future frames.

Fig. 4
figure 4

The performance of each model was assessed through iterations, using metrics including MSE, SSIM, LPIPS, and PSNR for comparison

Table 3 Performance on the test set of Moving MNIST
Fig. 5
figure 5

Samples of predictions made on the Moving MNIST test dataset

Metrics of memory footprint, parameters, and FLOPs are all reported with Moving MNIST-like tensor with shape [1, T, 1, 64, 64], where T is the length of input sequence.

3.4.1.2 Comparison on KTH Action

Similar with experiments on the dataset of Moving MNIST, we also report the performance of models on KTH Action. As shown in Table 4, it can be seen that the best performance on the KTH Action dataset was still achieved by STPNet. Specifically, the PSNR was improved by 27.7% (from 23.58 to 30.11) compared to ConvLSTM and 1.2% (from 29.75 to 30.11) compared to PredRNN-v2. The visualization also illustrates the effectiveness of the proposed method. It can be observed from Fig. 6 that STPNet produced clearer predictions for long-term forecasts, indicating its superiority in generating high-fidelity images and significantly improving both long-term and short-term forecasts, resulting in more precise forecasts of upcoming movements and body features. Additionally, STPNet achieved the best performance in all metrics among all compared models, demonstrating its state-of-the-art (SOTA) performance.

Table 4 Performance on the test set of KTH Action
Fig. 6
figure 6

Samples of predictions made on the KTH Action test set

3.4.1.3 Comparison on Radar Echo

Table 5 shows evaluated results of radar echo dataset under different threshold results. The curves in Fig. 8 show the variation of HSS and CSI scores in the performance of the approaching precipitation prediction under different thresholds. To aid in comprehending and contrasting the outcomes, a few prediction outcomes using diverse approaches for the Radar Echo dataset are depicted in Fig. 7. The first five images in the top row correspond to the inputs, the remaining images signify the groundtruth output, and the subsequent rows display the prediction outcomes from the different models. As shown in Fig. 8, it can be seen that STPNet outperforms other models significantly at both the 5 and 20 thresholds of the metrics. It also achieves the best results in terms of average CSI, while its performance in terms of average HSS was second only to PredRNN++. Figure 8 illustrates that our approach had a consistent lead at both the 5 and 20 thresholds and at all times, demonstrating the substantial advantage of STPNet. Figure 7 illustrates that STPNet could better emulate long-term and short-term dependencies because of its cross attention, which could better preserve the echo value regions. As for PredRNN, it utilized only spatial memory units. As a result, it could only partially preserve the echo value region. However, the proposed IDA-LSTM slightly outperforms STPNet on metrics of HSS and CSI. Nonetheless, considering the results on this dataset are actually not ideal (only have SSIM of 0.347 and 0.338), we think this gap here can be appropriately ignored.

Table 5 Performance on the test set of Radar Echo
Fig. 7
figure 7

Samples of predictions made on the Radar Echo test set

Fig. 8
figure 8

The comparison of HSS and CSI scores is applied to investigate the performance changes of the approach to precipitation prediction under different thresholds

In general, above three experiments of crosswise contrast using datasets MovingMNIST, KTH Action, and Radar Echo showed that STPNet achieved the best performance in terms of SSIM, improving by 0.78%, 0.23%, and 4.97%, respectively, compared to the best model in the comparison group. Collectively, STPNet achieves the best performance on the MoveMNIST, KTH Action, and Radar Echo datasets.

3.4.1.4 Ablations about dual attention strategy

To verify the ability of dual attention in learning higher-order nonlinear dependencies, we also compared STPNet without SA-Net in a comprehensive experiment and report as STPNet w/o SA-Net in Tables 3, 4 and 5. From results, it can be seen that STPNet with SA-Net led to significant improvements in terms of MSE, SSIM, PSNR, and LPIPS. Comparing multiple dataset cases, the inclusion of dual attention allowed the model to further learn higher-order spatiotemporal dynamic dependencies, thus further enhancing dynamic modeling capabilities. In particular, on the MovingMNIST dataset, the MSE was reduced by 22.14%, 9.52%, 13.29%, 7.33%, 32.62% and 42.74% compared to ConvLSTM, MIM, PredRNN, PredRNN-v2, SimVP and MMVP, being second only to PredRNN++. The PSNR was improved by 26.34%, 8.13%, 4.64%, 0.13%, 9.16% and 7.58% on the KTH Action dataset compared to ConvLSTM, PredRNN, PredRNN++, PredRNN-v2, SimVP and MMVP. The average HSS and CSI were also higher than for PredRNN and PredRNN-v2 in radar precipitation prediction. Summing up above improvements, it can be seen that STPNet without SA-Net could better extract the spatiotemporal process features compared with the baseline model. The analysis of the mentioned consequences further verifies that the effectiveness of the attention strategy exhibits the superiority of lightweight dual attention in learning higher-order nonlinear dependencies.

3.5 Applicability

We selected three representative datasets to validate STPNet's long-term and short-term modeling capabilities. The leading performance of its iterative training process was verified in detail by comparing it with other models through the MovingMNIST dataset. Using the KTH dataset for long-timestep prediction, STPNet was verified to have better long-term prediction capability. Using the Radar Echo dataset, it was verified that STPNet had the best comprehensive performance for short-time prediction. The prediction performance of STPNet for different thresholds was also revealed. Although the accuracy decreased as the threshold value increased, STPNet achieved the best prediction performance for radar precipitation when all indicators were combined. By incorporating its attention mechanism, the experimental results indicate that STPNet had an enhanced learning capability for complex spatiotemporal dependencies and demonstrated superior dynamic modeling capability. STPNet is capable of effective prediction for a variety of spatiotemporal process tasks and has better performance for complex geographical spatiotemporal process prediction.

3.6 Limitation

Although the proposed model could improve the prediction performance, STPNet still had some limitations compared to other methods. First, the proposed model was designed to focus on cross attention. Although the first layer is lightweight double attention, which avoids the constant stacking within the unit that would lead to adding too many parameters and computational effort, there is still room for improvement in predicting high-echo-value regions. Secondly, despite the selection of multiple datasets for validation, a full evaluation can be further performed by applying it in other study areas or additional periods on geographic spatiotemporal datasets.

4 Conclusions

The present study proposes STPNet as a novel approach for spatiotemporal process prediction. Recognizing that past features can significantly improve the accuracy of predictions at the current timestep, we developed a network framework that integrates a dual attention mechanism to capture higher-order spatiotemporal dependencies. Additionally, the architecture incorporates a cross attention memory module and supplementary memory units designed to effectively capture long-term dependencies across both spatial and temporal dimensions. These elements empower the model to recognize and retain location-specific features that are distant from other critical locations in the dataset, while also learning and memorizing global spatiotemporal relationships. Though ablation experiments confirmed the efficacy of the self-attention mechanism in extracting spatiotemporal features and the dual attention approach in modeling higher-order nonlinear dependencies, there is still room for optimization in the complexity and size of the model. In future work, we will investigate more simple and efficient architecture of deep learning for spatiotemporal prediction task.