1 Introduction

The widespread use of social media platforms such as Facebook, Twitter and Instagram has led to the formation of complete industrial chains of viral marketing [1]. For instance, e-commerce companies are inclined to utilize viral marketing as a promotional strategy for their innovative products promotion [2]. This involves identifying high-quality users on the platforms and personalizing their advertisements in order to enhance the appeal of the products to the targeted followers. Such phenomena are termed as influence maximization (IM) in social network analysis, which has been proven to be a NP-hard problem [3]. Its goal is to identify a specific subset of influential users to maximize the influence spread of the product information, innovative ideas and even the authoritative opinion within the social networks. Given its practical and commercial significance, it has been employed widely in numerous real-world scenarios including viral marketing, rumors suppression, revenue maximization, opinion formation, collective decision-making, and even political movements [4, 5].

Kempe et al. [3] put forth a natural hill-climbing greedy strategy for the first time to address the IM problem. It selects one node with the most marginal return in each round by estimating the expected influence of the remaining nodes based on tens of thousands of Monte Carlo simulations. Theoretical outcomes indicate that the greedy-based method can yield a solution with an accuracy of not less than \((1 - 1/e - \epsilon )\) (\(\approx \)63%) of the optimal solution. However, the high time complexity resulted by the extensive Monte Carlo simulations limits its applications in large-scale networks. In contrast, centrality-based methods [6,7,8], which return the precise rankings of the node based on its position in the network topology, are efficient in terms of computational cost. However, such measurements fail to provide performance guarantees in influence spread because they consider solely the local or global structural features of the network or node attributes, while neglecting the overlapping influence. Over the last few years, meta-heuristic algorithms [9,10,11] have been developed to explore new efficient solutions by replacing the time-consuming Monte Carlo mechanisms with formulated fitness evaluating functions. Such algorithms employ discrete evolutionary mechanisms to identify the optimal seed set by optimizing the fitness functions in a more promising way. Yet the existing meta-heuristics tend to overly focus on a single metric such as solution accuracy while neglecting others like efficiency, especially in dealing with large-scale networks [12]. Therefore, balancing the solution quality and computational cost remains a challenge in designing meta-heuristic algorithms for the IM problem.

As one of the seminal meta-heuristics, differential evolution (DE) has demonstrated its significant advantages in solving the IM problem due to its strong global search capability and high robustness [13]. Its fundamental framework is to optimize the fitness of the expected influence spread of the candidate node sets iteratively through population evolution strategies based on mutation, crossover and selection operations. However, the performance of the algorithm is affected by its myopic evolutionary rules, and the high computational cost of fitness evaluation further limits its efficiency in large-scale networks. To address the problems, a novel competitive learning-driven differential evolution (CLDE) is proposed. Specifically, a competitive mechanism divides the individuals in the population into pairs randomly, in which individuals compete based on fitness value, with the winners forming an excellent subpopulation and the losers forming a common subpopulation. Such strategy is conceived to enhance the diversity of the solutions. It then implements the global exploration and local exploitation separately on the two subpopulations to speed up the convergence of the algorithm. Meanwhile, an adaptive probabilistic updating-driven local search strategy is designed to guide the evolution of the population to avoid being influenced by the network structure. The primary contributions of this paper can be summarized as follows:

  • To the best of our knowledge, the competitive mechanism of dividing the population into an excellent subpopulation and a common subpopulation by random pairing is employed in the discrete evolution optimization for the first time to address the IM problem.

  • A global exploration based on local-influence-descending and a local exploitation within node neighborhoods are conceived for different subpopulations to accelerate the convergence of the proposed CLDE.

  • An adaptive local search strategy is designed based on the updating probability for each individual in the common subpopulation to enhance the stability and robustness of the algorithm.

The remainder of the paper is organized into several sections. Section 2 gives a brief review of the related work on IM problem. Section 3 provides the preliminary information. The details of the proposed CLDE are presented in Sect. 4. Section 5 is devoted to the design and analysis of experiments. Section 6 discusses the advantages and the potential weak spots of the CLDE, which presents the future work of our work. Section 7 concludes this work.

2 Related work

Existing methods for the IM problem can be categorized into four primary groups: greedy-based, topological-based, machine learning-based, and meta-heuristic-based methods [14, 15]. Greedy-based algorithms require a substantial number of Monte Carlo simulations to evaluate the marginal gain of each node, so high computational costs and low efficiency are two typical challenges for the greedy algorithms [3, 16]. Compared to the simulation-based greedy algorithms [17], sampling-based greedy algorithms [18, 19] are more efficient; however, the time complexity of sampling-based greedy algorithms remains a concern. Topological-based methods are a class of metrics that utilize the topological structure of networks directly to assess the importance of the nodes and guide the search or optimization processes [20, 21]. Once the network topology is given, the seed node set based on topology-based methods is determined. Although such methods are efficient, they lack theoretical guarantees for their solutions. Since users in real-world networks tend to cluster, it also leads to the issue of overlapping influence inevitably [22]. In addition, machine learning-based models have demonstrated their high efficiency when dealing with large-scale networks [23, 24]. However, the high dependence on datasets leads to a decline in their performance when datasets are insufficient or of low quality. Meanwhile, such approaches tend to be often very time-consuming during the training process, which increases the computational costs [25].

As a new branch of the artificial intelligence, meta-heuristic algorithms do not rely on the mathematical model of the problem but instead seek optimal or near-optimal solutions by simulating the evolutionary principles observed in natural, social or physical phenomena [26]. In combinatorial optimization, meta-heuristic algorithms such as particle swarm optimization, genetic algorithm, simulated annealing and tabu search have been widely applied to the problems such as scheduling, routing and facility layout [27,28,29]. To address the NP-hard IM problem, Jiang et al. [30] made the first attempt by developing a metric called expected diffusion value to estimate the influence spread of candidate nodes. It was the first meta-heuristic algorithm applied to solve the IM problem, demonstrating a speed improvement of 2\(\sim \)3 orders of magnitude compared to traditional greedy algorithms. The meta-heuristic algorithm can find the optimal solution quickly due to the use of the optimization function. Chawla and Cheney [31] proposed a neighbor-hop mutation strategy for the genetic algorithm to enhance the optimization of influence maximization by incorporating network topology into the search process. Furthermore, Cunegatti et al. [32] and Konotopska and Iacca [33] explored graph-aware evolutionary algorithms that leverage topological information to ensure better propagation. Singh et al. [34] proposed a learning automata based discrete particle swarm optimization (LAPSO-IM) to address the issue of premature convergence of the original work on the IM problem [35]. Nevertheless, due to the need for the recalculation of probability and the reassessment of actions in the reinforcement process, it results in poor efficiency easily.

Tang et al. [36] investigated seed activation strategies for the adaptive influence maximization problem and proposed a novel bio-inspired meta-heuristic called discrete scheduled particle swarm optimization to address the challenge of determining optimal seeding times during the influence spreading. Considering the unreliable inherent to transmission in social network connections, Wang et al. [10] proposed an enhanced discrete moth-flame optimization algorithm and developed a new influence estimation model based on neighboring nodes to assess the expected spread influence of node sets effectively. While the algorithm demonstrates high influence spread and robustness, it incurs a significant time cost inevitably. To avoid being trapped into local optima in solving the IM problem, Khatri et al. [37] proposed a discrete Harris hawks optimization algorithm. For networks with community structures, the algorithm introduces a novel neighborhood scouting strategy and an adaptive random population initialization method to enhance the performance. Recently, Wang et al. [38] proposed a discrete particle swarm optimization to solve the IM problem in multilayer networks. By designing novel position updating rules and velocity updating strategies, the authors addressed the complex information propagation in multilayer networks in a more efficient way.

As an effective meta-heuristic, differential evolution operates on the differences among individuals in a population and employs mechanisms including the differential mutation, crossover and selection to generate new offsprings to achieve global optimization in the search space. The DE algorithm has achieved significant results in various fields, including large-scale optimization [39, 40] and multi-objective optimization [41, 42]. In addressing the IM problem, Qiu et al. [13] proposed a novel search strategy named local influence descending and an expected diffusion impact value function to save the running time and improve the influence spread. It symbolizes the first application of the DE algorithm to the IM problem. As for the IM problem, since the probability of most nodes in the network being included in the optimal seed set is typically zero, Biswas et al. [43] proposed a two-stage VIKOR-assisted multi-operator differential evolution algorithm. To enhance the algorithm’s capability of escaping from local optima, Zhu et al. [44] developed a phased evaluation-enhanced approach that integrates the random range search mechanism named RandRDE to enhance the diversity of solutions. Furthermore, an accelerated convergence strategy is utilized to facilitate an efficient search for optimal solutions.

Based on the existing research, it is evident that meta-heuristic algorithms demonstrate promising prospects in addressing the IM problem. However, several key challenges remain to be further explored: (1) researchers need to make a trade-off between the influence spread and computation time; (2) the scalability issues in large-scale networks need to be addressed [45]; (3) developing effective methods to model information propagation in multilayer networks; (4) improving the adaptability of parameter settings. In this context, competitive learning mechanisms offer new insights for the development of novel effective and efficient meta-heuristics. Competitive learning was first introduced by Cheng and Jin [46] to solve the continuous optimization, where a competition mechanism among solutions was implemented to promote population diversity. Du et al. [47] further combined competitive and cooperative learning to improve the performance on complex optimization problems. This mechanism can maintain the population diversity and facilitates the convergence in a better way. It demonstrates promising superiority in solving the combinatorial optimization problems. Therefore, incorporating competitive learning mechanisms into the meta-heuristic algorithms for the IM problem provides new avenues to enhance algorithm performance.

3 Preliminaries

3.1 Problem statement

A given social network is typically defined as a graph \(G = (V, E)\), where V represents the nodes and E represents the connections between the nodes. The key objective of the IM problem is to identify an influential node set of size k as the initial seed set S that can maximize the influence spread through the propagation capability of the network. Specifically, it can be formulated as Eq. (1).

$$\begin{aligned} \left\{ \begin{aligned} S^{*} = \arg \max \sigma (S) \\ subject \hspace{5.0pt}to \hspace{5.0pt}|S| = k \end{aligned} \right. \end{aligned}$$
(1)

where S is the set with k candidate nodes. \(\sigma (S)\) is the expected number of nodes activated by the set of candidate nodes S. \(S^*\) is the optimal seed set with the maximum influence spread.

3.2 Independent cascade model

In the IC model, at any given timestamp, each node is in one of two states: inactive or active. According to the rules, when node u is activated at time \(t-1\), it has a single opportunity to activate its inactive neighbor v under propagation probability \(p_{uv}\) at time t. Regardless of whether node v is activated successfully by node u, node u will not have chance to activate other nodes in following steps. If node v is activated successfully, it will have an opportunity to activate its direct inactive neighbors at time \(t+1\). The diffusion process terminates if no new nodes are activated at time T.

3.3 Expected diffusion estimation function

To replace the expensive Monte Carlo simulation, Qiu et al. [13] proposed a novel objective function called expected diffusion impact value (EDIV) function, which is designed to assess the potential influence spread of a set of nodes. This function integrates the local influence from the local influence value (LFV) function and incorporates the diffusion value in two hops. The formula for calculating the probability of activating a node by the candidate seed set is given in Eq. (2).

$$\begin{aligned} PS(u)=1-\prod _{v \in Pe(u) \cap S}(1-p_{vu}) \end{aligned}$$
(2)

where v and u represent two nodes, PS(u) represents the probability of node u being activated by node set S, Pe(u) denotes the one-hop neighbors of node u, \(p_{vu}\) denotes the activation probability from v to u.

The objective function is comprised of the original influence: the one-hop influence and the two-hop influence. The original influence can be defined as the initial advantage offered by the selected nodes, which is represented by the size of the node set. Then, the one-hop influence is determined by the amalgamation of the diffusion value of the nodes’ direct neighbors with their respective local influence. LFV function is evaluated according to Eq. (3).

$$\begin{aligned} LFV(u) = 1+\sum _{v \in N_u^ {(1)} \backslash {\{u\}}}\left(p_{uv}+p_{uv}\cdot \sum _{s \in N_v^ {(1)} \backslash {\{u,v\}}}p_{vs}\right) \end{aligned}$$
(3)

where LFV(u) represents the local influence of node u, \(N_u^{(1)}\) and \(N_v^{(1)}\) denote the first-order neighbors of nodes u and v, respectively. The variables \(p_{uv}\) and \(p_{vs}\) correspond to the activation probabilities from node u to node v and from node v to node s, respectively. The beneficial influence that the set S exerts on its direct neighboring nodes can be evaluated by Eq. (4).

$$\begin{aligned} f_1(S)=\sum _{u \in N_S^{(1)} \backslash S}PS(u) \cdot LFV(u) \end{aligned}$$
(4)

where \(N_S^{(1)}\) is the set of direct neighbor nodes of the set S.

When calculating the two-hop influence, the method takes into account the complicated interactions among the two-hop nodes, the activation status of the one-hop nodes and the diffusion capability of the two-hop nodes. It can accurately estimate the two-hop influence based on Eq. (5).

$$\begin{aligned} f_2(S)=\sum _{u \in N_S^{(2)}-N_S^{(1)}-S}\left[1-\prod _{v \in Pe(u) \cap N_S^{(1)}}(1-p_{vu} \cdot PS(v))\right] \cdot LFV(u) \end{aligned}$$
(5)

where \(f_2(S)\) represents the expected influence spread of S on its two-hop nodes, \(N_S^{(2)}\) represents the two-hop neighbor nodes of S.

The CLDE utilizes the three aforementioned computational components to estimate the influence spread. It employs three different parameters to depict the importance of the above three computational parts. The EDIV value is computed by Eq. (6).

$$\begin{aligned} EDIV(S)=\alpha \cdot |S|+\beta \cdot f_1(S)+\gamma \cdot f_2(S) \end{aligned}$$
(6)

where |S| represents the size of the node set S, \(f_1(S)\) denotes the one-hop influence of S, \(f_2(S)\) represents the two-hop influence of S. \(\alpha \), \(\beta \) and \(\gamma \) denote the relative importance of the three parts in propagating the influence of S.

4 Proposed method

4.1 The framework of CLDE

Figure 1 shows the flowchart of the proposed CLDE for the IM problem. It comprises four main components: parameter initialization, population partitioning, subpopulation evolution, individuals recombination and selection.

Fig. 1
figure 1

The framework of CLDE

The details are elaborated as follows:

  1. (1)

    At the initial stage, the algorithm generates the first generation of individuals to initialize the DE population. Each individual represents a candidate node set that symbolizes a potential solution for the influence maximization.

  2. (2)

    The population partitioning operation employs a competitive mechanism to divide the population into an excellent subpopulation and a common subpopulation in accordance with the fitness value of each individual in the current generation.

  3. (3)

    The excellent subpopulation evolves through global exploration based on the rules of differential mutation and crossover operations. The differential mutation alters specific number of individuals based on the differential set to generate the mutated individuals. The crossover process involves the mutated individuals with the original individuals of the current generation in accordance with the specified crossover probability to get a crossover subpopulation.

  4. (4)

    For each individual in the common subpopulation, it searches in the entire neighborhood or the neighborhood of a specific individual in a local exploitation way according to different local search probability, and replaces the suboptimal individuals with the ones with larger degree.

  5. (5)

    Merge the excellent subpopulation after the crossover and mutation with the common subpopulation obtained by the local search to form a recombined population for the selection process.

  6. (6)

    After the recombination operation, the change in the average fitness value of the population is calculated. If the change is less than or equal to zero, the evolutionary algorithm is considered to have converged, and the internal loop is terminated. Otherwise, return to steps (3) and (4).

  7. (7)

    The selection process compares the fitness value of the individuals from the previous generation with that of the recombined individuals. Individuals with higher fitness value are selected into the next generation.

  8. (8)

    Check the termination condition: whether the iteration counter has reached the predefined maximum iterations \(g_{\max }\). If satisfied, terminate the evolution and output the optimal solution; otherwise, return to step (2).

Algorithm 1 presents the pseudocode of the CLDE for the IM problem.

Algorithm 1
figure a

Framework of CLDE for IM problem

4.2 Population initialization

The CLDE first utilizes the LFV mechanism, defined in Eq. (3), to assess the local influence of each node in the network and sorts the nodes in descending order based on their local influence. The algorithm selects one node every k nodes and adds it into the current candidate node set. Once the requisite size of the current individual has been reached, the selection process ceases and the next individual begins to generate. Algorithm 2 presents the pseudocode of the aforementioned process.

Algorithm 2
figure b

Initialization (Gpopk)

4.3 Population partition

To balance between the global exploration and local exploitation, the proposed CLDE employs a competition mechanism that randomly partitions the population into subpopulations with the same size. As described in [46, 47], in each iteration, individuals in the population \(X=\{X_1,X_2\ldots X_{pop}\}\) are randomly divided into pop/2 pairs (assuming the population size pop is even), and the competition is held between the two individuals in each pair. The individual with better fitness is considered as the winner of the competition. For a population of size pop, competitions occur among pop/2 pairs, with all pop individuals in one single round of competitions. In CLDE, all the winning individuals form the excellent subpopulation \(XE=\{XE_{1},XE_{2}\ldots XE_{pop/2}\}\), while the remaining ones form the common subpopulation \(XC=\{XC_{1},XC_{2}\ldots XC_{pop/2}\}\). The mechanism helps maintain the diversity of the population, as even individuals with lower fitness have the chances to be assigned to the excellent subpopulation through random pairing, thus preventing the population from homogenization and reducing the risk of premature convergence.

4.4 Excellent subpopulation evolution

4.4.1 Local-influence-descending search strategy

To accelerate the convergence speed of the CLDE, this paper employs the local-influence-descending (LID) search strategy [13] for the mutation and crossover operations of the population. The search strategy aims at selecting a set of nodes with relatively high influence. Then it randomly selects alternative nodes from the set to update the population. Selecting nodes with largest LFV can obtain the local optimal solutions, whereas local optimality does not guarantee global optimality. The selection range of nodes at different positions should be flexible and governed by a dynamic upper bound. Consequently, taking into account the dynamic nature of this upper bound, it should be defined as a function of the index of node i. Consequently, the rule for the up-bound is defined by Eq. (7).

$$\begin{aligned} up \_ bound=\eta \cdot i+\theta \end{aligned}$$
(7)

where \(up \_ bound\) denotes the upper bound of the selection range for the node. \(\eta \) and \(\theta \) are two control parameters. The specific process of LID search strategy is introduced in Algorithm 3.

Algorithm 3
figure c

LID(Gk)

4.4.2 Differential mutation

Specifically, the differential mutation utilizes the differences between individuals to generate mutated individuals. For the discrete IM problem, \(XE_{r1}\) is generated from the excellent subpopulation as the baseline individual, with its nodes to be replaced by nodes from the set of nodes involved in the differential operation. The generating rule is defined in Eq. (8).

$$\begin{aligned} XE_{r1}=XE.get(random(pop/2)) \end{aligned}$$
(8)

\(XE_{r2}\) and \(XE_{r3}\) are used to generate the differential set. To prevent individuals in the population from converging to the same solution, the algorithm establishes a more rational differential strategy. The algorithm selects two distinct individuals randomly from the excellent subpopulation to serve as \(XE_{r2}\) and \(XE_{r3}\). The rules are described in Eq. (9).

$$\begin{aligned} \left\{ \begin{aligned} XE_{r2}&=XE.get(random(pop/2))\\ XE_{r3}&=XE.get(random(pop/2))\\&\textrm{subject} \hspace{5.0pt}to \hspace{5.0pt}XE_{r2} \ne XE_{r3} \end{aligned} \right. \end{aligned}$$
(9)

For each individual \(XE_{r1}\), the algorithm replaces a certain number of nodes with the nodes in the differential set of \(XE_{r2}\) and \(XE_{r3}\). The number is determined by both the zoom factor F and the size of the differential set. The specific rule of the number is defined in Eq. (10).

$$\begin{aligned} N = F \cdot |XE_{r2}-XE_{r3}| \end{aligned}$$
(10)

where N represents the number of nodes to be replaced, "−" signifies the differential operation.

Furthermore, the node to be replaced is the one with less influence in the individual. Therefore, the CLDE initially employs LFV to assess the influence of each node within the present \(XE_{r1}\). Subsequently, it replaces the node with the smallest influence. The specific rule for generating the index is defined in Eq. (11).

$$\begin{aligned} j=\arg \min [LFV(ME_{i,j})] \end{aligned}$$
(11)

where \(ME_{i,j}\) denotes the jth (\(j=1,\ldots ,k\)) node to be replaced in the mutated individual \(ME_{i}\) (\(i=1,\ldots ,pop/2\)), which is generated from the ith \(XE_{r1}\).

If \(ME_{i}\) is the same as the differential set of \(XE_{r2}\) and \(XE_{r3}\), the CLDE selects a unique node from the set obtained by the LID search strategy to replace \(ME_{i,j}\). This approach ensures the continuity of the evolution while enhancing the diversity of the population. Conversely, the algorithm replaces \(ME_{i,j}\) with a randomly selected node from the differential set of \(XE_{r2}\) and \(XE_{r3}\). The updating rules are described in Eq. (12).

$$\begin{aligned} ME_{i,j}=\left\{ \begin{aligned}&uniq(LID(G,k)),\hspace{5.0pt}if \hspace{5.0pt}ME_{i}=XE_{r2}-XE_{r3}\\&(XE_{r2}-XE_{r3}).get(random(|XE_{r2}-XE_{r3}|)),\hspace{5.0pt}otherwise \end{aligned} \right. \end{aligned}$$
(12)

Figure 2 shows an illustration of the differential mutation process. Assuming that k is 5, F = 0.6, \(XE_{r1}\) = {14, 20, 7, 5, 13}, \(XE_{r2}\) = {4, 3, 50, 35, 26} and \(XE_{r3}\) = {35, 50, 55, 26, 3}. Firstly, the CLDE implements the differential operation on \(XE_{r2}\) and \(XE_{r3}\), then the difference set can be obtained. Then, parameters N and j are calculated according to Eqs. (10) and (11), respectively. Finally, it replaces the nodes at the specified indices in \(XE_{r1}\) with the nodes from the differential set. After the above-mentioned process, a mutated individual \(ME_{i}\) is obtained.

Fig. 2
figure 2

Differential mutation operation

The specific details of the mutation are explained in Algorithm 4.

Algorithm 4
figure d

Mutation(XEFpop/2, k)

4.4.3 Crossover operation

In the CLDE, after the mutation operation, a set of targeted nodes is selected to produce the crossover individual from each original individual and its corresponding mutated individual in the current generation based on the crossover probability cr. The rules are given in Eq. (13).

$$\begin{aligned} CE_{i,j} = \left\{ \begin{aligned} ME_{i,j}&,\hspace{5.0pt}if \hspace{5.0pt}random< cr \hspace{5.0pt}and \hspace{5.0pt}ME_{i,j} \notin CE_{i} \\ uniq(LID(G,k))&,\hspace{5.0pt}if\hspace{5.0pt}random< cr \hspace{5.0pt}and \hspace{5.0pt}ME_{i,j} \in CE_{i} \hspace{5.0pt}and\hspace{5.0pt}XE_{i,j} \in CE_{i} \\ XE_{i,j}&,\hspace{5.0pt}if\hspace{5.0pt}random < cr \hspace{5.0pt}and \hspace{5.0pt}ME_{i,j} \in CE_{i} \hspace{5.0pt}and \hspace{5.0pt}XE_{i,j} \notin CE_{i} \\ XE_{i,j}&,\hspace{5.0pt}if\hspace{5.0pt}random \ge cr \hspace{5.0pt}and\hspace{5.0pt}XE_{i,j} \notin CE_{i} \\ uniq(LID(G,k))&,\hspace{5.0pt}if\hspace{5.0pt}random \ge cr \hspace{5.0pt}and\hspace{5.0pt}ME_{i,j} \in CE_{i} \hspace{5.0pt}and\hspace{5.0pt}XE_{i,j} \in CE_{i} \\ ME_{i,j}&, \hspace{5.0pt}if\hspace{5.0pt}random \ge cr \hspace{5.0pt}and\hspace{5.0pt}XE_{i,j} \in CE_{i} \hspace{5.0pt}and\hspace{5.0pt}ME_{i,j} \notin CE_{i} \end{aligned} \right. \end{aligned}$$
(13)

where \(CE_{i}\), \(CE_{i,j}\), \(XE_{i,j}\) and \(ME_{i,j}\) represent the crossover individual, the node of the crossover individual \(CE_{i}\), the node of the original individual \(XE_{i}\), the node of the mutated individual \(ME_{i}\), respectively. cr denotes the predetermined crossover probability in the excellent subpopulation. Figure 3 shows an illustration of the crossover operation. Assuming that k is 5, \(cr = 0.4. 0.32, 0.54, 0.76, 0.2\) and 0.46 represent the random numbers. During this process, if the random number is larger than cr, the node \(CE_{ij}\) is filled with the node \(XE_{ij}\). Otherwise, the node is filled with the node \(ME_{ij}\).

Fig. 3
figure 3

Crossover operation

4.5 Common subpopulation evolution

For the evolution of the common subpopulation, an adaptive probabilistic updating-driven local search strategy is developed to enable further exploitation in a specific local region to improve the quality and enhance the diversity of solutions. Based on the local search probability and random numbers, the range for the solution search is selected. The formula for local search probability is defined in Eq. (14).

$$\begin{aligned} LSp_{i+1}= \left\{ \begin{aligned} (1-ar) \cdot LSp_i&, \hspace{5.0pt}if \hspace{5.0pt}EDIV(XC_{i}) < 2*fitness \_ sum/pop \\ bp+(1-bp) \cdot&LSp_i,\hspace{5.0pt}otherwise \end{aligned} \right. \end{aligned}$$
(14)

where parameters ar and bp represent the reward and penalty between 0 and 1, respectively. The variable \(LSp_i\) denotes the local search probability for the ith individual. \(EDIV(XC_{i})\) represents the EDIV of the ith individual in the common subpopulation. The variable \(fitness \_ sum\) refers to the total EDIV of all individuals in the common subpopulation. The updating formula for the common subpopulation is defined in Eq. (15).

$$\begin{aligned} v^*= \left\{ \begin{aligned}&\arg \max _{v \in V \backslash S} d(v), \hspace{5.0pt}if \hspace{5.0pt}m < LSp_i\\&\arg \max _{v \in N(XC_{i,j})} d(v), \hspace{5.0pt}otherwise \end{aligned} \right. \end{aligned}$$
(15)

where \(v^*\) represents the node with the maximum degree, m is a random variable that is drawn from the interval [0, 1]. \(N(XC_{i,j})\) represents the neighbors of the jth node of the ith individual in the common subpopulation. The degree of a node v is denoted by d(v). The details of the evolution of common population are explained in Algorithm 5.

Algorithm 5
figure e

Localsearch(GXCpop/2, karbp)

4.6 Selection operation

In the final stage, the EDIV is utilized to assess the quality of the individuals and update the population by selecting better individuals. The selection operation compares \(X_i\) with \(P_i\), and selects the one with higher EDIV value into the next generation for further evolution, where X represents the original population of the current generation and P is the population formed by the recombination of the excellent subpopulation after the crossover and differential mutation (denoted as CE) with the common subpopulation after local search (denoted as PC). Specifically, the formula for the selection operation is defined in Eq. (16).

$$\begin{aligned} X_i=\arg \max \{EDIV(X_i),EDIV(P_i)\} \end{aligned}$$
(16)

4.7 Complexity analysis

The complexity of CLDE is determined by three main factors: pop, k, and F. First, the complexity of the initialization process in Algorithm 2 is \(O(pop \cdot k)\). The process of dividing the population using a competitive mechanism needs at most pop/2 operations. Once the iteration starts, the CLDE conducts five components: differential mutation of excellent subpopulation, crossover of excellent subpopulation, local search of common subpopulation, population reorganization and population selection. Due to the size of the differential set is variable, the complexity is unclear for the differential mutation process in Algorithm 4. The size of the differential set will decrease as the iteration proceeds. Therefore, this algorithm considers only the worst-case time complexity of \(O(pop \cdot F \cdot k)\). For the excellent subpopulation crossover process, it needs \(pop/2 \cdot k\) operations. For the process of common subpopulation local search of Algorithm 5, it needs \(pop/2 \cdot k \cdot {\hat{D}}\) operations, where \({\hat{D}}\) denotes the maximum degree of the graph. The time complexity of population reorganization is O(1). For the selection process, the complexity is O(pop). Moreover, the initialization includes LFV, while the common population local search and selection process separately includes EDIV computation with a complexity of \(O(n \cdot {\hat{D}})\) and \(O(k \cdot {\hat{D}})\). In summary, the complexity of time is \(O(pop \cdot k^2 \cdot {\hat{D}}^2)\).

5 Experiments and analysis

5.1 Datasets and benchmarks

This paper evaluates the performance of the proposed CLDE based on the following considerations across six real-world social networksFootnote 1 and three synthetic networks: network density and network type. Specifically, Ca-netscienceFootnote 2 describes a network of collaborative relationships among scientists, where the nodes represent scientists and the edges represent collaboration between scientists. Email-univ \(^{2}\) describes interactions among members via email exchanges. Blog [48] is a network of communication relationships between bloggers on the MSN website. PGPgiantcompo \(^{2}\) is an unweighted and undirected graph network that describes the interactions between users who exchange information using the pretty good privacy algorithm. Ca-HepThFootnote 3 is from the "High Energy Physics—Theory" section. The description of Feather-deezer-social \(^{3}\) pertains to the network of mutual connections among Deezer users from European countries. The synthetic small-world (Watts-Strogatz, WS) network, random (Erdös-Rényi, ER) network and scale-free (Barabási-Albert, BA) network were also generated to conduct experiments. Table 1 provides the detailed information, where |V|, |E|, \(d_{ave}\) and \(c_{ave}\) represent the number of vertices, the number of edges, average degree and average clustering coefficient of the networks, respectively.

Table 1 Statistical information of the three synthetic networks and the six real-world networks

All the procedures were encoded in Python and executed on an \(\hbox {Intel}^\circledR \) \(\hbox {Xeon}^\circledR \) 5218R CPU @ 2.10 GHz with 64 GB memory in a Windows system. To evaluate the performance of the proposed CLDE, it is compared with the state-of-the-art algorithms.

  • CELF [16]: presented in 2007, it is a classic greedy-based algorithm, which selects the node with the highest marginal gain into the seed set in each iteration.

  • LIDDE [13]: put forward in 2021, it employs a differential evolution that utilizes the EDIV evaluation function to find the optimal seed set.

  • LDD [49]: introduced in 2022, it is a heuristic algorithm for identifying influential nodes based on centrality metric.

  • LAPSO-IM [34]: developed in 2019, it integrates a learning automata with discrete particle swarm optimization, with modifications made to the original evolutionary mechanisms.

  • TS-VA-MODE [43]: developed in 2022, it employs a two-stage optimization process and incorporates the multi-operator differential evolution along with an improved differential evolution featuring multiple search operators.

  • PHEE [44]: proposed in 2024, it utilizes two different strategies to search optimal solutions to improve the quality of the solutions for the IM problem.

5.2 Parameter settings

To balance both the influence spread and efficiency of the proposed algorithm, the orthogonal experimental test [50] was employed to determine the optimal parameter settings for the proposed CLDE. The orthogonal experimental test provides comprehensive information with fewer experimental trials. This is achieved by systematically selecting experimental combinations to reduce unnecessary repetitive experiments. It was observed that eight parameters, including population size pop, the zoom factor F, the crossover probability cr, the initial local search probability \(LSp_0\), the reward parameter ar, the penalty parameter bp, two control parameters \(\eta \) and \(\theta \) impact the performance of the CLDE under different settings. Therefore, this paper employs the \(L_{32}\) orthogonal array method, as shown in Table 2, which displays the parameter levels and their combinations, to configure the optima parameter settings. For the parameter setting experiments, this paper sets \(g_{max}\) to 50, k to 60 and p to 0.05. It sets the number of Monte Carlo simulations r to 1000. \(\alpha \), \(\beta \) and \(\gamma \) all take 1.

Each combination runs 50 times to obtain the average results. The bolded ones in Table 2 are the optimal value under the current network. Configuration number 9 is optimal in networks Email-univ, Feather-deezer-social, and BA, and number 30 is optimal in the networks Ca-netscience, Blog, PGPgiantcompo, Ca-HepTh, feather-deezer-social and WS, and number 31 is optimal only in the network ER. From Table 2, the results indicate that the proposed CLDE can obtain the optimal value in most cases when \(F=0.6\), \(cr=0.4\) and \(LSp_0=0.6\). And when the population size pop, the reward parameter ar, the penalty parameter bp, two control parameters \(\eta \) and \(\theta \) are larger, the influence spread obtained is relatively larger, but they increase the time complexity of the algorithm concurrently. To achieve a balance between the efficiency and effectiveness, the parameter settings in number 30 was chosen as the optimal parameter configuration for the following experiments.

Table 2 The influence spread of the orthogonal experimental test on the nine datasets at \(k=60\)

5.3 Performance experiments

5.3.1 Comparison on the influence spread

To validate the influence spread of the CLDE across different networks, comparison experiments with the six state-of-the-art algorithms were conducted on the six real-world networks and three synthetic networks. CELF and LDD require no parameter settings, while the parameter settings for the other five algorithms are shown in Table 3. The experimental parameter settings of the comparison algorithms in Table 3 follow the parameter configurations in the original literature. The comparison results of the influence spread under the IC model under propagation probability p = 0.05 and 0.1 are illustrated in Figs. 4, 5, 6 and 7.

Table 3 Parameter settings
Fig. 4
figure 4

Comparison on the influence spread of the CLDE with the six algorithms on the six real-world networks (IC model, \(p = 0.05\))

Fig. 5
figure 5

Comparison on the influence spread of the CLDE with the six algorithms on the three synthetic networks (IC model, \(p = 0.05\))

In the four figures, the x-axis represents the number of seed nodes, while the y-axis indicates the influence spread. It is evident that the influence spread increases gradually with the enlargement of the seed set. Specifically, as illustrated in Fig. 4, the CLDE outperforms the other six baselines on small-scale networks such as Ca-netscience, Email-univ, and Blog. However, in the other three real-world networks, its performance is only surpassed by the greedy-based CELF. Taking Fig. 4e as an example, when k = 60, the performance of the proposed CLDE surpasses that of LIDDE, PHEE, TS-VA-MODE, LAPSO-IM, and LDD by 6.9%, 1.1%, 42.3%, 1.1% and 0.8%, respectively. In Fig. 5, the influence spread of CLDE outperforms those of the other six baselines in most cases on the synthetic networks.

Fig. 6
figure 6

Comparison on the influence spread of the CLDE with the six algorithms on the six real-world networks (IC model, \(p = 0.1\))

Fig. 7
figure 7

Comparison on the influence spread of the CLDE with the six algorithms on the three synthetic networks (IC model, \(p = 0.1\))

In Fig. 6, the influence spread of the CLDE is slightly inferior to that of CELF on the six real-world networks, while it outperforms the other five benchmarks when the propagation probability \(p = 0.1\). As shown in Fig. 7, the CLDE returned larger influence spread in the random network ER and the small-world network WS. In the scale-free synthetic network BA, the performance of the CLDE is slightly inferior to CELF, while it outperforms the other five algorithms. Since the CELF is a greedy-based approach that selects seed nodes one by one, its performance is superior to the performance of the proposed CLDE due to it can estimate the influence spread in a more accurately way based on the Monte Carlo simulation. Overall, the proposed CLDE achieves comparable results to CELF and demonstrates superior performance compared to the other five algorithms in most cases. As demonstrated in the influence diffusion comparisons in Figs. 4, 5, 6 and 7, CLDE maintains superior stability across networks of varying scales compared to other algorithms. It can be primarily attributed to its competitive mechanism with random population partitioning strategy, which prevents it from converging to local optima effectively.

5.3.2 Comparison on efficiency

The running time of the CLDE was compared with the under six algorithms across the six real-world networks and three synthetic networks, with \(p = 0.05\) and 0.1, and k was chosen as 60. The efficiency comparison results of the seven algorithms are shown in Figs. 8 and 9. Furthermore, to clearly illustrate the comparison results, the logarithmic value of the running time was used, instead of directly plotting the running time.

As can be seen from Figs. 8 and 9, the running time of the proposed CLDE in the six real-world and three synthetic networks is significantly lower than that of the PHEE, TS-VA-MODE, CELF, LAPSO-IM and LIDDE. Although the proposed CLDE integrates more operations than the LIDDE, the running times of CLDE and LIDDE vary across different networks, with each algorithm excelling in different scenarios. Such results are promising, as the influence spread is better than that of most baselines.

Fig. 8
figure 8

The comparison on running time on different networks (IC model, \(p = 0.05\))

Fig. 9
figure 9

The comparison on running time on different networks (IC model, \(p = 0.1\))

5.4 Performance analysis of seed sets

5.4.1 Comparison on the convergence speed

The convergence of CLDE is closely related to its exploration and exploitation capabilities within the search space. Theoretically, CLDE enhances the population diversity by randomly dividing the population into excellent subpopulation and common subpopulation to reduce the risk of premature convergence. The algorithm employs global exploration for extensive search, and local exploitation to further refine the solutions, with the aim of accelerating the convergence speed in a more efficiency way. The size of the seed set was set as \(k = 60\) for this experiment. Figure 10 illustrates the comparisons of the convergence speeds of CLDE, LIDDE, PHEE, LAPSO-IM and TS-VA-MODE on the six real-world networks and three synthetic networks under propagation probability \(p = 0.05\).

As seen in the plots, firstly, the experimental results indicate that the algorithm demonstrates superior exploration and exploitation capabilities during the optimization process, enabling it to approach to the global optimum more rapidly. Furthermore, it is evident from the figures that CLDE converges significantly faster than LIDDE while demonstrating its superior performance in optimizing the EDIV function. Compared with the TS-VA-MODE, LAPSO-IM and PHEE, CLDE converges slower in most scenarios but with superior fitness value and the final influence spread. Therefore, it is important to make a trade-off between the solution quality and the convergence speed.

Fig. 10
figure 10

The comparison of converge speed between LIDDE, CLDE, PHEE, LAPSO-IM and TS-VA-MODE (IC model, \(p = 0.05\))

5.4.2 Average shortest distance among seed nodes

To further analyze the reasons why the proposed CLDE can obtain a more influential seed set, the average shortest distance [51] of seed nodes is introduced to show the performance of the CLDE. This is expressed mathematically in Eq. (17).

$$\begin{aligned} L(S) = \frac{2}{|S|(|S|-1)} \sum _{v,u \in S} d_{uv} \end{aligned}$$
(17)

where L(S) represents the average shortest distance between nodes in the seed set S, |S| denotes the number of seed nodes, \(d_{uv}\) indicates the average shortest distance between nodes u and v.

The experimental results in the nine networks at fixed \(k = 30\) and \(k = 60\) under the propagation probability \(p=0.05\) are shown in Fig. 11. As shown in Fig. 11a, b, g, h and i, the proposed CLDE shows higher average shortest distances than that of the classical CELF and other five baselines, indicating that the seed nodes obtained by CLDE are scattered wider in the network. In Fig. 11, the proposed CLDE and CELF exhibit higher average shortest distances than that of other baselines, which indicates that the seed nodes can unfold efficient propagation capability, such evidence can be found in Fig. 10. Furthermore, it also suggests that the CLDE and CELF are capable of fully leveraging the characteristics of the network structure to enhance the propagation. The longer average shortest distance signifies that the seed nodes have superior propagation abilities, enabling them to disseminate the information to a broader range of nodes. It further reflects the importance of their dispersion in the network in allowing them to cover different areas. In addition, the average shortest path distances in Fig. 11 reveal that the seed nodes selected by CLDE are dispersed more rationally within the network, such fact indicates the algorithm’s capability to conduct effective global searches.

Fig. 11
figure 11

Comparison on the average shortest distance of the seed sets of the seven algorithms on the six real-world networks and three synthetic networks (IC model, \(p = 0.05\))

5.4.3 Similarity between the seed sets

For the IM problem, the selected seed sets play a crucial role. This section employs the Jaccard coefficient to quantify the similarity between the seed sets [52]. The purpose of the seeds similarity comparison is to evaluate the consistency and stability of the seeds generated by different algorithms. The definition is mathematically expressed in Eq. (18).

$$\begin{aligned} J(S_A,S_B) = \frac{|S_A \cap S_B|}{|S_A \cup S_B|} \end{aligned}$$
(18)

where \(S_A\) and \(S_B\) represent the seed node sets returned by algorithms A and B, respectively. \(|S_A|\) denotes the size of \(S_A\). The method highlights the similarities and differences in node selection strategies across various algorithms. The similarity among seed sets acquired by different algorithms on the six real networks under p = 0.1 with a seed set size of 60 is illustrated in Fig. 12. The \(avg\_sim\) denotes the average similarity. The upper triangular matrix provides the average similarity value.

According to the influence spread shown in Fig. 6, the CELF achieved the best results compared with the six other algorithms. Therefore, the CELF can be selected as the groundtruth in the similarity comparisons. The similarity results shown in Fig. 12 further corroborate CLDE’s advantage in maintaining population diversity. As shown in the six subfigures, the similarities between the seed nodes of the CLDE and the CELF are the highest. Its high similarity with the CELF algorithm indicates its ability to identify high-quality solutions, while lower similarity with other algorithms suggests its capacity to explore a broader search space. The results suggest that CLDE is comparable to CELF in terms of influence spread in seed selection. In contrast, the other algorithms show lower seed set similarity compared to CLDE, which indicates a reduced overlap of nodes between the seed sets chosen by CLDE and other algorithms, demonstrating that CLDE is capable of avoiding the selection of nodes with low influence.

Fig. 12
figure 12

The comparison of similarities between seed sets of different algorithms (IC model, \(p = 0.1\))

5.5 Statistical analysis

To validate the effectiveness of CLDE independently, rigorous statistical tests were conducted to examine whether there are significant statistical differences in the experimental results of the CLDE compared to the six other baselines across the six real networks and the three synthetic networks. The Wilcoxon rank-sum test, of which the confidence level was set at 0.05, was performed using SPSS software, and the results are presented in Table 4. The \(N-\) and \(N+\) represent the number of instances where the targeted algorithm outperforms the baselines and the number of instances where the targeted algorithm performs worse than the baseline, respectively. The Z-value and p-value reflect the extent of performance difference between the two algorithms and the probability of the observed difference occurring by chance, respectively. The p-value less than 0.05 typically indicates a statistically significant difference, suggesting that the observed improvement (or decline) in performance is unlikely due to random variation.

The experimental results in Table 4 reveal the following findings: (1) CLDE performs well on both the real-world networks and the synthetic networks, indicating that the type of network has a minimal impact on its performance. (2) When k is small, the performance advantage of CLDE is not pronounced such as \(k = 10\). As k increases, CLDE generally outperforms other algorithms, including some scenarios of the CELF. (3) Compared to CELF, CLDE exhibits superior performance, with a p-value \(<0.05\), which means a significant difference under \(k=60\).

Table 4 The statistical results on CLDE and the other six algorithms at \(\alpha = 0.05\) on the six real networks and three synthetic networks (IC model, \(p = 0.05\))

6 Discussion

This study proposes a competitive learning-driven differential evolution algorithm for the IM problem. CLDE shares certain similarities with LIDDE in the initialization, mutation, crossover, and selection steps. However, IT introduces key innovations in the design and application of these steps, particularly evident in the following aspects: (1) Competitive mechanism: CLDE introduces a mechanism that divides the population into "excellent subpopulation" and "common subpopulation". This division allows the algorithm to implement targeted operations based on the distinct characteristics of each subpopulation to enhance the efficiency of both global and local search operations. (2) Adaptive probabilistic updating-driven local search strategy: In CLDE, we have designed an adaptive probabilistic updating-driven local search mechanism that adjusts the intensity of local search based on the dynamic changes within the population. Compared to LIDDE, CLDE places greater emphasis on the dynamic adjustment of the search process, thereby improving the algorithm’s adaptability to complex optimization problems. (3) Innovative integration of operational processes: Although both algorithms employ mutation, crossover, and selection operations, CLDE optimizes the integration and adaptation of these operations. The excellent subpopulation and common subpopulation utilize differential mutation and local search techniques to achieve a balance between the global exploration and local exploitation.

Among the CELF, LAPSO-IM, LIDDE, PHEE, TS-VA-MODE and LDD, extensive experiments on nine datasets demonstrate that the CLDE has higher influence spread and faster convergence speed. Therefore, CLDE has achieved a substantial performance improvement. However, the running time of the CLDE is longer than that of the centrality-based LDD. To improve time efficiency, future research can further explore the use of variable neighborhood search or iterated local search as alternatives to the computationally intensive local search with node neighborhood [12].

The proposed CLDE not only offers new insights for theoretical research on the IM problem based on meta-heuristics but also provides strong support for strategy formulation in practical applications. From a theoretical perspective, we are the first to apply the competitive mechanism to solve the IM problem. Through this mechanism, we can gain a more comprehensive understanding of the complexity and dynamics of influence spread. From a practical perspective, CLDE provides a strong basis for applications such as disease prevention, social promotion and product marketing. In disease prevention, CLDE can categorize users into multiple groups based on their age. Decision-makers can adopt different disease prevention strategies for each group to control the transmission and prevalence of diseases in a more effective way. Within social media platforms, CLDE can dissect user behavior patterns and pinpoint content creators or trending topics that possess the high dissemination potential, bolstering the efficacy of platform recommendations and enhancing user engagement.

7 Conclusions

To balance the influence spread and efficiency when solving the influence maximization problem especially in large-scale social networks, this study proposes an optimization algorithm named CLDE, which is a meta-heuristic approach based on differential evolution. It employs a competitive mechanism of randomly partitioning the population to enhance the diversity of the population and reduce the risk of converging to local optima. Furthermore, the CLDE incorporates an adaptive probabilistic updating-driven local search strategy to make more effective exploration of the search space across different iterations. The method achieves a better balance between local exploitation and global exploration in accelerating the convergence speed. Experimental results demonstrate that the proposed CLDE outperforms state-of-the-art algorithms in influence spread and exhibits superior convergence speed.

Future research aims to develop more precise evolutionary mechanisms to further enhance the influence spread of the algorithm. Moreover, a more rational population partitioning mechanism and local search strategy such as variable neighborhood search or iterated local search will be designed to improve the efficiency of the evolutionary algorithm.