Variance estimation for semiparametric regression models by local averaging

Zhao, Jingxin; Peng, Heng; Huang, Tao

doi:10.1007/s11749-017-0553-3

Variance estimation for semiparametric regression models by local averaging

Original Paper
Published: 20 September 2017

Volume 27, pages 453–476, (2018)
Cite this article

Download PDF

Access provided by China Pharmaceutical University

TEST Aims and scope Submit manuscript

Variance estimation for semiparametric regression models by local averaging

Download PDF

Jingxin Zhao¹,
Heng Peng¹ &
Tao Huang²

485 Accesses
8 Citations
Explore all metrics

Abstract

Variance estimation is a fundamental problem in statistical modelling and plays an important role in the inferences after model selection and estimation. In this paper, we focus on several nonparametric and semiparametric models and propose a local averaging method for variance estimation based on the concept of partial consistency. The proposed method has the advantages of avoiding the estimation of the nonparametric function and reducing the computational cost and can be easily extended to more complex settings. Asymptotic normality is established for the proposed local averaging estimators. Numerical simulations and a real data analysis are presented to illustrate the finite sample performance of the proposed method.

Model averaging prediction for nonparametric varying-coefficient models with B-spline smoothing

Article 15 January 2021

Model averaging estimation for nonparametric varying-coefficient models with multiplicative heteroscedasticity

Article 27 April 2023

Effective identification and estimation for the semiparametric measurement error model

Article 03 June 2016

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In regression analysis, a resident part of the model is the error term, which is not only a representation of the randomness of the model, but also contains useful information such as the signal-to-noise ratio and the goodness-of-fit. Variance estimation is a fundamental problem in statistical modelling and plays an important role in the inferences after model selection and estimation. Moreover, it provides a benchmark of prediction error when a given model is compared to the “oracle” model. In practice, variance estimation has wide application in many areas, such as quality control (Box and Ramirez 1986), confidence interval construction (Carroll 1987), and immunoassay (Butt 1984), among others (Carroll and Ruppert 1988).

Many methods of variance estimation have been introduced throughout the decades. For parametric regression models, the variance can be estimated by the least squares method or the maximum likelihood estimation method. For nonparametric and semiparametric regression models, it is a bit more challenging to estimate the variance accurately. For example, consider a simple nonparametric regression model,

$$\begin{aligned} y_i=f(x_i)+\epsilon _i, \quad 1\le i \le n, \end{aligned}$$

(1)

where $f(\cdot )$ is an unknown smooth function, and the error terms $\epsilon _i$’s are i.i.d. with mean zero and constant variance $\sigma ^2$. Two main estimation approaches are reported in the literature. One approach is first to estimate the nonparametric function $f(\cdot )$ and then to calculate the estimate of residual variance based on the residual sum of squares (RSS) and its degrees of freedom. In theory, to estimate the nonparametric function $f(\cdot )$ accurately and efficiently, smoothness conditions are imposed on $f(\cdot )$ that may increase the possibility of model misspecification. In practice, smoothing techniques such as kernel methods or spline methods are implemented to estimate $f(\cdot )$. The selection of the bandwidth or the number and positions of basis functions is challenging and computationally intensive, which can greatly affect the estimation of $f(\cdot )$ and thus the estimations of RSS and its degrees of freedom.

Another approach is the difference-based method. This approach can directly estimate $\sigma ^2$ without knowing much information about the nonparametric function $f(\cdot )$. Rice (1984) proposed a first-order difference-based method to estimate the residual variance. The basic idea is to avoid the estimation of unknown function $f(x_i)$ and remove its effect by taking the differences for adjacent $x_i$’s and calculating the residual variance directly. More difference-based methods were introduced later, for example, by Gasser et al. (1986), Hall and Marron (1990), and Klipple and Eubank (2007). None of the above estimators achieves the asymptotic optimal efficiency for the mean squared error (Dette et al. 1998). Müller et al. (2003) proposed a class of difference-based estimators, and under certain assumptions for the difference weights, the asymptotic optimal efficiency can be achieved. Tong and Wang (2005) suggested a regression type of difference-based estimator for an equally spaced design. Moreover, Park et al. (2012) investigated the difference-based variance estimate for small sample nonparametric regression modelling problems. Brown and Levine (2007) considered the variance function estimation in nonparametric regression by the difference-based approach. The optimal convergence rates over broad function classes and bandwidths are fully characterized. Wang et al. (2008) further investigated the minimax rate of the convergence of the variance function estimation in nonparametric regression and particularly studied the effect of the unknown mean function on the estimation of the variance function. Cai et al. (2009) extended Wang et al. (2008)’s results to the variance function estimation in multivariate nonparametric regression with a fixed design.

The difference-based methods are much more robust and computationally simple, but they often assume the nonparametric regression model has certain simple structure and the design point is univariate and equally spaced. It is unclear how to generalize this to more complicated nonparametric and semiparametric models. Wang et al. (2011) and Brown et al. (2016) tried to use the difference-based approach to estimate the semiparametric partially linear and multivariate partially linear model and obtained an optimal efficient estimator of the linear components in the model. Furthermore, testing methods were also developed for the statistical inference problems. It is appropriate to try to extend the difference-based approach to more complex nonparametric or semiparametric regression models and not to focus solely on variance or variance function estimation.

In this paper, we propose a local averaging method to estimate the residual variance for nonparametric and semiparametric models based on the idea of partial consistency. First, we approximate the nonparametric function by a step constant function locally and then reparameterize nonparametric and semiparametric models into high-dimensional linear models. The ordinary least squares method can then be implemented to calculate the residual and estimate the residual variance. The reparameterized linear model is high-dimensional because the number of nuisance parameters increases with the sample size and cannot be estimated consistently. However, the residual variance can be estimated consistently and with near efficiency because of the partial consistency. Compared to the difference-based method, our proposed method can be easily generalized to more complicated models and does not require equally spaced assumptions for the design point.

The remainder of this paper is organized as follows. In Sect. 2, we first review the classical estimation methods for residual variance and then propose a local averaging variance estimation method and investigate its asymptotic properties. In Sect. 3, we extend the proposed method to semiparametric partial linear models and varying coefficient models and further propose a refined local averaging variance estimator and discuss numerical implementation issues. Furthermore, for the simple nonparametric regression model, by the ideas of moving averaging and kernel smoothing, we also propose a local moving average variance estimation method and a kernel-based variance estimation method for the unequally spaced design data. In Sect. 4, we discuss some applications of the proposed local averaging variance estimation method. Numerical studies and real data analysis are presented to illustrate the finite performance of the proposed method in Sect. 5. Some discussion and conclusions are given in Sect. 6. All of the proofs are relegated to Appendix (supplementary materials).

2 Variance estimation by local averaging

2.1 Review of classical methods

Consider the linear model,

$$\begin{aligned} y_i=\mathbf {x}_i^\mathrm{T} \varvec{\beta }+\epsilon _i \quad \text {or} \quad \mathbf {y}=\mathbf {X}\varvec{\beta }+\varvec{\epsilon }, \end{aligned}$$

where $\mathbf {y}=(y_1, \ldots , y_n)^\mathrm{T}$ is an n-vector of responses, $\mathbf {X}=(\mathbf {x}_1, \ldots , \mathbf {x}_n)^\mathrm{T}$ is an $n\times p$ design matrix, $\varvec{\beta }=(\beta _1, \ldots , \beta _p)^\mathrm{T}$ is a p-vector of parameters, and $\varvec{\epsilon }=(\epsilon _1, \ldots , \epsilon _n)$ is an n-vector of i.i.d. random errors with mean zero and variance $\sigma ^2$. The ordinary least squares estimator $\hat{\sigma }^2$ is

$$\begin{aligned} \hat{\sigma }^2=\frac{\mathbf {y}^\mathrm{T}(\mathbf {I}-\mathbf {H})\mathbf {y}}{\text{ tr }(\mathbf {I}-\mathbf {H})}, \end{aligned}$$

where $\mathbf {H}=\mathbf {X}(\mathbf {X}^\mathrm{T}\mathbf {X})^{-1}\mathbf {X}^\mathrm{T}$ is the projection operator onto the linear space generated by the column vectors of $\mathbf {X}$, $\mathbf {y}^\mathrm{T}(\mathbf {I}-\mathbf {H})\mathbf {y}$ is the residual sum of squares, and $\text{ tr }(\mathbf {I}-\mathbf {H})$ is the corresponding degrees of freedom. This estimator $\hat{\sigma }^2$ is the best linear unbiased estimator with asymptotic normality.

Consider the nonparametric regression model (1): in general, the residual vector and the RSS can be calculated, respectively, as

$$\begin{aligned} \hat{\varvec{\epsilon }}=\mathbf {y}-\hat{\mathbf {y}} = \mathbf {y} -\hat{\mathbf {f}} = (\mathbf {I}-\mathbf {S}) \mathbf {y}, \quad \text {RSS} = \mathbf {y}^\mathrm{T}(\mathbf {I}-\mathbf {S})^\mathrm{T}(\mathbf {I}-\mathbf {S})\mathbf {y}, \end{aligned}$$

where $\mathbf {S}$ is a smoothing matrix and, in practice, can be constructed by various methods such as the kernel method and the spline method. Although $f(\cdot )$ can be estimated efficiently, the bias would seriously affect the estimation of the residual variance. Moreover, it is a bit challenging to estimate the degrees of freedom of this RSS accurately because the matrix $(\mathbf {I}-\mathbf {S})$ is not a projection matrix as $\mathbf {H}$. In most situations, $\text{ tr }(\mathbf {I}-\mathbf {S})^\mathrm{T}(\mathbf {I}-\mathbf {S})$ is used to approximate the degrees of freedom of this RSS, whose calculation depends on the choice of the bandwidth h or the number and locations of basis functions. Furthermore, the optimal bandwidth h or basis functions for the efficient variance estimate would not be the same as the optimal bandwidth h or basis functions for the estimate of the mean function. Because selection of the optimal bandwidth or basis functions is involved in many complex computational procedures, it may lead to an unstable or even a biased estimate of the residual variance.

Another approach is the difference-based method proposed by Rice (1984). Consider model (1) and, without loss of generality, assume $0\le x_1 \le \cdots \le x_n \le 1$, then the difference-based estimator is defined as

$$\begin{aligned} \hat{\sigma }_{\text {R}}^2(k)=\frac{1}{2(n-k)}\sum _{i=k+1}^n(y_i-y_{i-k})^2. \end{aligned}$$

Note that $y_i-y_{i-k}=f(x_i)-f(x_{i-k})+\epsilon _i-\epsilon _{i-k}$. When $x_i$’s are equally spaced, the first term of the difference of the mean functions $f(x_i)$ and $f(x_{i-k})$ should be of the order $O(n^{-1})$ given k is constant. It is much smaller than the second term, the difference of errors $\epsilon _i$ and $\epsilon _{i-k}$, in terms of orders of magnitude, and thus can be ignored in calculating $\hat{\sigma }_{\text {R}}^2(k).$Gasser et al. (1986) proposed a second-order difference-based estimator to improve the estimation efficiency,

$$\begin{aligned} \hat{\sigma }^2_{\text {GSJ}}=\frac{1}{n-2}\sum _{i=2}^{n-1}c_i^2\hat{\epsilon }_i^2, \end{aligned}$$

where $\hat{\epsilon }_i$ is the difference between $y_i$ and the value at $x_i$ of the line joining $(x_{i-1},y_{i-1})$ and $(x_{i+1},y_{i+1})$, and $c_i$’s are chosen such that $E[c_i^2\hat{\epsilon }_i^2]=\sigma ^2$ for all i when $f(\cdot )$ is linear. Müller et al. (2003) considered a class of difference-based estimators,

$$\begin{aligned} \hat{\sigma }_{\text {MSW}}^2=\frac{1}{2\sum _{i\ne j}^nW_{ij}}\sum _{i\ne j}^n W_{ij}(y_i-y_j)^2, \end{aligned}$$

where $W_{ij}$’s are weights only depending on $x_i$’s. They showed under certain assumptions that the asymptotic optimal rate of the mean squared errors can be achieved by $\hat{\sigma }_{\text {MSW}}^2$. When $x_i$’s are equally spaced in [0, 1], i.e. $x_i=i/n$,

$$\begin{aligned} E[\hat{\sigma }_{\text {R}}^2(k)]\asymp \sigma ^2+Jd_k, \quad 1\le k\le m, \quad m=o(n), \end{aligned}$$

where $d_k=k^2/n^2$ and $J=\int _{0}^1 \{f'(x)\}^2 \text {d}x/2$. From the idea of the difference-based method, Tong and Wang (2005) considered a linear model $ s_k=\alpha +\beta d_k+e_k,$ where $s_k=\sum _{i=k+1}^n(y_i-y_{i-k})^2/\{2(n-k)\}$, $1\le k\le m$, and proposed a variance estimator,

$$\begin{aligned} \hat{\sigma }^2_{\text {T}}=\widehat{\alpha }, \end{aligned}$$

where $\widehat{\alpha }$ is the estimated intercept. This method reduces the asymptotic rate of mean squared errors to $O_p(n^{-1})$ with an optimal bandwidth m.

The difference-based methods have the advantages of avoiding the estimation of the nonparametric function and reducing the computational cost, but their extensions to more general settings are somewhat limited. For example, some methods such as Tong and Wang’s method requires the $x_i$’s to be equally spaced, and it is unclear how the variation of $x_i$’s affects the estimation of the residual variance. Moreover, the difference-based operation increases the complexity of the model structure, and it is unclear how to generalize it to more complicated nonparametric and semiparametric models.

2.2 Local averaging method

For the sake of simplicity, assume that $x_i$’s are ordered in an ascending order $x_1\le x_2\le \cdots \le x_n$, denote $x_{ij}=x_{I*(i-1)+j}$, and then split $x_i$ into $k=n/I$ groups with I observations in each group:

$$\begin{aligned} \underbrace{x_{11},x_{12},\ldots , x_{1I}}_1, \underbrace{ x_{21},x_{21},\ldots , x_{2I}}_2, \cdots \cdots \underbrace{x_{k1},x_{k2},\ldots , x_{kI}}_k. \end{aligned}$$

The number of samples I in each group is a fixed constant and independent of the sample size, and the number of group k is proportionate to the sample size. When the interval is small enough, those $x_i$’s falling in the same interval are expected to be close, so do the nonparametric function values $f(x_i)$’s. Therefore, for the ith group, we take the local average of $y_{ij}$’s as the estimated function values of $f(x_{ij})$’s:

$$\begin{aligned} \hat{f}(x_{i1})=\hat{f}(x_{i2})= \dots = \hat{f}(x_{iI})=\frac{1}{I}\sum _{j=1}^I y_{ij} \triangleq y_{i \cdot } \end{aligned}$$

and then propose a local averaging variance estimator

$$\begin{aligned} {\hat{\sigma }}_L^2=\frac{1}{n-k} \sum _{i=1}^{k}\sum _{j=1}^I(y_{ij} - \hat{f}(x_{ij}))^2 = \frac{1}{n-k} \sum _{i=1}^{k}\sum _{j=1}^I(y_{ij} - y_{i \cdot } ))^2. \end{aligned}$$

(2)

In fact, by the step function approximation, the nonparametric regression model (1) can be reparameterized and rewritten as a high-dimensional linear model:

$$\begin{aligned} y_{ij}=X_{ij}^\mathrm{T}\varvec{\alpha }+\varepsilon _{ij}^*, \quad i=1,\ldots ,k, ~~ j=1,\ldots ,I, \end{aligned}$$

(3)

where $y_{ij}=y_{I(i-1)+j}$, $X_{ij}$ is a k-vector with zero elements except one for the ith element, $\varvec{\alpha }=(\alpha _1,\ldots ,\alpha _{k})^\mathrm{T}$ is a k-vector of unknown parameters, and $\varepsilon _{ij}^*=\varepsilon _{ij}+f(x_{ij})-\frac{1}{I}\sum _{j=1}^{I} f(x_{ij})$. The ordinary least squares method can then be used to estimate the model, residuals, and the residual variance, where k can be regarded as the degrees of freedom of RSS.

Remark 1

Model (3) is quite similar to an ANOVA model in form, and one may think to apply the variance estimation approach to obtain an efficient estimate of the variance. However, model (3) is different from an ANOVA model in two significant ways. First, the parameters $\varvec{\alpha }$ in model (3) are treated as fixed effects since they depend on the unknown function $f(\cdot )$, and it would be inappropriate to regard them as random effects following some covariance matrix structure. Second, the error in model (3) combines both a real mean zero random error and an approximation error, and the effect of such an approximated error on the final variance estimate is unclear.

2.3 Theoretical properties

We need the following two regularity conditions to investigate the theoretical properties of the proposed local averaging variance estimator.

(A)
The residuals $\epsilon _i$’s are i.i.d. with $E[\epsilon _i]=0, E[\epsilon _i^2]=\sigma ^2$, and $E[\epsilon _i^4]=\mu _4<\infty $.
(B)
The function $f(\cdot )$ and its corresponding first and second derivatives are all bounded.

Theorem 1

Under Conditions (A) and (B), the local averaging variance estimator $\hat{\sigma }_L^2$ in (2) for the nonparametric model (1) is asymptotically normal,

$$\begin{aligned} \sqrt{n}(\hat{\sigma }_L^2-\sigma ^2)\overset{D}{\rightarrow }N\left( 0,\mu _4-\frac{I-3}{I-1}\sigma ^4\right) . \end{aligned}$$

Remark 2

The reparameterized linear model is high-dimensional because the number of parameters $\varvec{\alpha }$ increases in proportion to the sample size, and thus the estimation of $\varvec{\alpha }$ or $f(x_{ij})$ cannot be consistent. However, of primary interest is the estimation of the residual variance, and Theorem 1 shows that the proposed local averaging estimator $\hat{\sigma }^2_{L}$ in (2) is a consistent estimation of $\sigma ^2$. This is the so-called partial consistent phenomenon (Neyman and Scott 1948; Fan et al. 2005), where $\sigma ^2$ is called the structural parameter of the model and can be estimated consistently, and $\varvec{\alpha }$ is called the incidental parameter and cannot be estimated consistently.

3 Various extensions

3.1 Partial linear models

Consider a partial linear regression model,

$$\begin{aligned} y_i=\mathbf {z}_i^\mathrm{T}\varvec{\beta }+f(x_i)+\epsilon _i, 1\le i \le n, \end{aligned}$$

(4)

where $\mathbf {z_i}=(z_{i1}, \ldots , z_{ip})^\mathrm{T}$ is a p-vector of covariates, $\varvec{\beta }=(\beta _1,\ldots ,\beta _p)^\mathrm{T}$ is a p-vector of parameters, $f(\cdot )$ is an unknown smooth function, and the error terms $\epsilon _i$’s are i.i.d. with mean zero and constant variance $\sigma ^2$. Denote $\mathbf {y} = (y_1, \ldots , y_n)^\mathrm{T}$ and $\mathbf {Z} = (\mathbf {z_1}, \ldots , \mathbf {z_n})^\mathrm{T}$. Klipple and Eubank (2007) proposed a difference-based variance estimator for (4),

$$\begin{aligned} \hat{\sigma }^2_{K}=\frac{\mathbf {y}^\mathrm{T}\mathbf {D}^\mathrm{T}(\mathbf {I}-\mathbf {P})\mathbf {D}\mathbf {y}}{\mathrm {tr}\left\{ \mathbf {D}^\mathrm{T}(\mathbf {I}-\mathbf {P})\mathbf {D}\right\} }, \end{aligned}$$

where $\mathbf {D}$ is the so-called differencing matrix, and $\mathbf {P}=\mathbf {D}\mathbf {Z}(\mathbf {Z}^\mathrm{T}\mathbf {D}^\mathrm{T}\mathbf {D}\mathbf {Z})^{-1}\mathbf {Z}^\mathrm{T}\mathbf {D}^\mathrm{T}$ is the projection matrix. This method is quite complicated because the difference weights have to be chosen carefully to balance the bias and the variance.

By the local averaging method, we assume $x_i$’s are ordered in ascending order $x_1<x_2<\cdots <x_n$, and then split the data into $k=n/I$ groups by $x_i$’s. Denote the jth data in the ith group as $(x_{ij}, \mathbf {z}_{ij}, y_{ij}, \epsilon _{ij})=(x_{t}, \mathbf {z}_{t}, y_{t}, \epsilon _{t})$, $t=(i-1)\times I +j$. For the ith group, we approximate the function values $f(x_{ij})$’s by their average $\frac{1}{I}\sum _{j=1}^{I}f(x_{ij})$. When I is small, the approximation error $f(x_{ij})-\frac{1}{I}\sum _{j=1}^{I}f(x_{ij})$ is of the order $O(n^{-1})$, much smaller than the order of the difference of $\epsilon _{ij}$’s. Similarly, (4) can be reparameterized and rewritten as

$$\begin{aligned} y_{ij}=\alpha _i+\mathbf {z}_{ij}^\mathrm{T}\varvec{\beta }+\epsilon ^*_{ij}, 1<i<k, 1<j<I, \end{aligned}$$

(5)

where $\alpha _i=\frac{1}{I}\sum _{j=1}^{I}f(x_{ij})$ and $\epsilon ^*_{ij}=\epsilon _{ij}+f(x_{ij})-\frac{1}{I}\sum _{j=1}^{I}f(x_{ij})$. Thus, by the ordinary least squares method, we have

$$\begin{aligned} \hat{\varvec{\beta }}= & {} \left\{ \sum _{i,j}(\mathbf {z}_{ij}-\overline{\mathbf {z}}_{i\cdot })^\mathrm{T}(\mathbf {z}_{ij}-\overline{\mathbf {z}}_{i\cdot })\right\} ^{-1} \left\{ \sum _{i,j}(\mathbf {z}_{ij}-\overline{\mathbf {z}}_{i\cdot })^\mathrm{T}(y_{ij}-\overline{y}_{i\cdot })\right\} ,\\ \hat{\alpha }_i= & {} \frac{1}{I}\sum _{j=1}^{I}\left\{ y_{ij}-\mathbf {z}_{ij}^\mathrm{T}\hat{\varvec{\beta }}\right\} , \quad i=1,\ldots , k. \end{aligned}$$

Therefore, we propose a local averaging variance estimator

$$\begin{aligned} \hat{\sigma }^2_{L}=\frac{1}{n-k-p}\sum _{i=1}^{k}\sum _{j=1}^{I}(y_{ij}-\mathbf {z}_{ij}^\mathrm{T}\hat{\varvec{\beta }}-\hat{\alpha }_i)^2. \end{aligned}$$

(6)

Theorem 2

Under Conditions (A) and (B), the local averaging variance estimator $\hat{\sigma }^2_{L}$ in (6) for the partial linear model (4) is asymptotically normal,

$$\begin{aligned} \sqrt{n}\left( \hat{\sigma }^2_{L}-\sigma ^2\right) \overset{D}{\rightarrow }N\left( 0,\mu _4-\frac{I-3}{I-1}\sigma ^4\right) . \end{aligned}$$

Remark 3

The asymptotic variance of the estimator $\hat{\sigma }^2_{L}$ in (6) for the partial linear model (4) is the same as that of $\hat{\sigma }^2_{L}$ in (2) for the nonparametric model (1). This means that the estimation of $\varvec{\beta }$ has little effect on the variance estimation for the local averaging method. As shown by Fan et al. (2005), for such a high-dimensional model (6), the structural parameters $\varvec{\beta }$ can be estimated consistently and almost efficiently, whereas the incidental parameters $\varvec{\alpha }$ cannot be estimated consistently.

3.2 Varying coefficient models

Consider a varying coefficient model

$$\begin{aligned} y_i=\sum _{l=1}^p\beta _l(u_i)x_{il}+\epsilon _i = \mathbf {x}_i^\mathrm{T} \varvec{\beta }(u_i)+\epsilon _i, i=1,\ldots , n, \end{aligned}$$

(7)

where $u_i$ is the index variable, $\mathbf {x}_i=(x_{i1}, \ldots , x_{ip})^\mathrm{T}$ is a p-vector of covariates, and $\varvec{\beta }(\cdot )=(\beta _1(\cdot ), \ldots , \beta _p(\cdot ))^\mathrm{T}$ is a p-vector of unknown varying coefficient functions. By the local polynomial fitting, Zhang and Lee (2000) proposed a variance estimator

$$\begin{aligned} \hat{\sigma }^2_{\text {poly}}=\frac{1}{n}\sum \limits _{i=1}^n \hat{\sigma }^2_{\text {poly}}(u_i). \end{aligned}$$

For each given point $u_0$,

$$\begin{aligned} \hat{\sigma }^2_{\text {poly}}(u_0)=\frac{\mathbf {y}^\mathrm{T}\left\{ \mathbf {W}-\mathbf {W}\mathbf {X}(\mathbf {X}^\mathrm{T}\mathbf {W}\mathbf {X})^{-1}\mathbf {X}^\mathrm{T}\mathbf {W}\right\} \mathbf {y}}{\mathrm {tr}\left\{ \mathbf {W} -(\mathbf {X}^\mathrm{T}\mathbf {W}\mathbf {X})^{-1}\mathbf {X}^\mathrm{T}\mathbf {W}^2\mathbf {X}\right\} }, \end{aligned}$$

where $\mathbf {W}=\text {diag}(K_h(u_1-u_0),\dots ,K_h(u_n-u_0))$, $K_h(\cdot )=K(\cdot /h)/h$, $K(\cdot )$ is a kernel function and h is the bandwidth, and $\mathbf {X} = (\mathbf {x}_1, \ldots , \mathbf {x}_n)^\mathrm{T} \otimes (1, (u_1-u_0), \ldots , (u_1-u_0)^q)$, with $\otimes $ denoting the Kronecker product. As discussed above, it is not an easy task to select an appropriate bandwidth simultaneously for all varying coefficient functions. Moreover, an optimal bandwidth selection can be quite arduous and time-consuming, although the primary interest is only to estimate the residual variance.

For our proposed local averaging method, without loss of generality, we assume the index variable $u_i$’s are ordered in ascending order $u_1<u_2<\cdots <u_n$ and then split the data into $k=n/I$ groups by $u_i$’s. Denote the jth data in the ith group as $(u_{ij},\mathbf {x}_{ij},y_{ij},\epsilon _{ij})=(u_{t},\mathbf {x}_{t},y_{t},\epsilon _{t})$, $t=(i-1)*I+j$. If, for the ith group, we treat the coefficient functions as piecewise constant $\varvec{\beta }(u_{i1})=\varvec{\beta }(u_{i2})=\cdots =\varvec{\beta }(u_{iI})=\varvec{\beta }_i$, then (7) can be rewritten as

$$\begin{aligned} \mathbf {y}_i = \mathbf {X}_i\varvec{\beta }_i + \varvec{\epsilon }_i \quad \text {or} \quad \mathbf {y} = \mathbf {X}\varvec{\beta }+\varvec{\epsilon }, \end{aligned}$$

where $\mathbf {y}_i=(y_{i1},\ldots y_{iI})^\mathrm{T}$ is a I-vector of responses, $\mathbf {X}_i=(\mathbf {x}_{i1}, \ldots , \mathbf {x}_{iI})^\mathrm{T}$ is an $I \times p$ matrix of covariates, $\varvec{\beta }_i=(\beta _{i1}, \ldots \beta _{ip})^\mathrm{T}$ is a p-vector of unknown parameters, $\varvec{\epsilon }_i=(\epsilon _{i1}, \ldots ,\epsilon _{iI})^\mathrm{T}$ is an I-vector of random errors, and $\mathbf {y}=(\mathbf {y}_1,\ldots ,\mathbf {y}_k)^\mathrm{T}$, $\mathbf {X}= \text {diag}(\mathbf {X}_1, \ldots , \mathbf {X}_k)$, $\varvec{\beta }=(\varvec{\beta }_1,\ldots ,\varvec{\beta }_k)^\mathrm{T}$, $\varvec{\epsilon }=(\varvec{\epsilon }_1,\varvec{\epsilon }_2,\ldots ,\varvec{\epsilon }_k)^\mathrm{T}$. Thus, by the ordinary least squares method, we propose a local averaging variance estimator

$$\begin{aligned} \hat{\sigma }^2_{L}=\frac{\mathbf {y}^\mathrm{T}\mathbf {P}\mathbf {y}}{\text{ tr }(\mathbf {P})}, \end{aligned}$$

(8)

where $\mathbf {P}=\mathbf {I}-\mathbf {X}(\mathbf {X}^\mathrm{T}\mathbf {X})^{-1}\mathbf {X}^\mathrm{T}$.

We need the following regularity conditions to investigate the theoretical properties of the proposed local averaging variance estimator.

(A$'$) The residuals $\epsilon _i$’s are normally distributed with mean zero and variance $\sigma ^2$.
(B$'$) The coefficient functions $\beta _i(\cdot ), 1\le i\le p$, and their corresponding first and second derivatives are all bounded.
(C$'$) All predictors are bounded, $||\mathbf {x} ||<\infty $.

Theorem 3

Under Conditions (A$'$)–(C$'$), the local averaging variance estimator (8) for the varying coefficient model (7) is asymptotically normal,

$$\begin{aligned} \sqrt{n}\left( \hat{\sigma }^2_{L}-\sigma ^2\right) \overset{D}{\rightarrow }N\left( 0,\frac{2I\sigma ^4}{I-p}\right) . \end{aligned}$$

Remark 4

The normality assumption (A$'$) is assumed only to facilitate the proof. In general, the asymptotic variance is $\frac{I(\mu _4-3\sigma ^4)}{k(I-p)^2}\mathbf {p}^\mathrm{T}\mathbf {p}+\frac{2I\sigma ^4}{I-p}$, where $\mathbf {p}$ is the diagonal vector of $\mathbf {P}$. The first term disappears under the normality assumption because $\mu _4-3\sigma ^4=0$. Without the normality assumption, this asymptotic variance is still bounded. As the projection matrix $\mathbf {P}$ is idempotent and symmetric, we have

$$\begin{aligned} \frac{tr(\mathbf {P})^2}{n}=\frac{k^2(I-p)^2}{n}=\frac{(I-p)^2k}{I} \le \mathbf {p}^\mathrm{T}\mathbf {p}\le tr(\mathbf {P}^\mathrm{T}\mathbf {P})=tr(\mathbf {P})= k(I-p). \end{aligned}$$

Therefore, the asymptotic variance is bounded by $\frac{\mu _4-3\sigma ^4}{n}+\frac{2I\sigma ^4}{n(I-p)}$ and $\frac{1}{n}\frac{I}{I-p}(\mu _4-\sigma ^4)$.

Remark 5

For the nonparametric model, the partial linear model and the varying coefficient model, our proposed local averaging variance estimators all achieve the asymptotic optimal rate for the mean squared errors (Dette et al. 1998), i.e. $\text {MSE}(\hat{\sigma }^2)=n^{-1}\text {var}(\epsilon ^2)+o(n^{-1})$. The difference-based estimator proposed by Tong and Wang (2005) also achieves the asymptotic optimal rate for the mean squared errors but only under an equally spaced design setting. Moreover, our method is straightforward with light computational cost and can be easily extended to more complicated settings.

3.3 Refined local averaging variance estimator

Theorems 1–3 show that the asymptotic variance of our proposed local averaging estimators decreases as I increases. In Sect. 4, numerical studies show that our method is robust to the selection of interval length I. Even $I=2$ can lead to a consistent variance estimator, although $I>3$ for (2) and (6) and $I>p$ for (8) are required to make the method valid. Selecting an optimal I may, in practice, improve the performance of our proposed method but require additional computational cost. Here, we propose a refined local averaging variance estimator by aggregating estimates over different I’s, which is expected to be more stable.

First, consider the local averaging variance estimator $\hat{\sigma }^2_{L}$ in (2) for the nonparametric model (1). By some calculations, we can show that

$$\begin{aligned} \text {E}[\hat{\sigma }^2_{L}(I)]=\beta _1+\beta _2\frac{I(I+1)}{n^2}+o\left( \frac{1}{n^2}\right) , I=2,\ldots ,m, \end{aligned}$$

where $m=o(n)$ is a predefined constant, $\beta _1=\sigma ^2$, and $\beta _2=J=\frac{1}{12}\int _0^1{f'(x)}^2\text {d}x$. We naturally consider a linear model

$$\begin{aligned} \hat{\sigma }^2_{L}(I)=\beta _1+\beta _2\frac{I(I+1)}{n^2} +e_I,\ I=2,\ldots ,m. \end{aligned}$$

Under the normality assumption for $\epsilon $ and by Theorem 1, it is easy to show that $\text{ var }(e_I)\approx \frac{I}{I-1}\frac{2\sigma ^4}{n}$. Hence, we instead consider a linear model

$$\begin{aligned} s_I=\beta _1t_{I1}+\beta _2t_{I2}+e_I^*,\ I=2,\ldots ,m, \end{aligned}$$

where $s_I=w_I\hat{\sigma }^2(I),t_{I1}=w_I,t_{I2}=w_I\frac{I(I+1)}{n^2}$, and $w_I=\sqrt{n(I-1)/I}$. The variance of $e_I^*$ is now approximately a constant $2\sigma ^4$. By the ordinary least squares, we propose a refined local averaging variance estimator

$$\begin{aligned} \hat{\sigma }^2_*=\hat{\beta }_1=\frac{(\sum s_I t_{I1})(\sum t_{I2}^2)-(\sum t_{I1}t_{I2})(\sum \ t_{I2}s_I)}{(\sum t_{I1}^2)(\sum t_{I2}^2)-(\sum t_{I1}t_{I2})^2}. \end{aligned}$$

In this way, we use the regression technique to reduce the bias and variance of multiple local averaging variance estimators, which is expected to improve the stability of this estimator.

Similar techniques can be extended to the partial linear model and the varying coefficient model. We have, for the partial linear model,

$$\begin{aligned} \text {E}[\hat{\sigma }^2(I)]= \left( 1+\frac{I}{n(I-1)}\right) \sigma ^2+\frac{I(I+1)}{n^2}J+o\left( \frac{1}{n}\right) , \end{aligned}$$

where J is a constant depending on the nonparametric function $f(\cdot )$, and for the varying coefficient model,

$$\begin{aligned} \text {E}[\hat{\sigma }^2(I)]=\sigma ^2+C\frac{(I-1)(I+1)}{n^2}+o(1), \end{aligned}$$

where C is a constant depending on the varying coefficient functions $\varvec{\beta }(\cdot )$ and the covariance matrix $\Sigma $ of X.

3.4 Moving local average and kernel-based estimators

The proposed local averaging method has the advantages of avoiding to efficiently estimate the nonparametric functions and reducing much computational cost. However, when the number of observations in each group I is small, the bias of the nonparametric function estimate at the boundary is generally larger than that in the interior. In comparisons, by the idea of the moving average method or the K-mean method, we can simply estimate nonparametric function estimate by the mean of its nearest I observations and then propose a moving average variance estimator,

$$\begin{aligned} \hat{\sigma }^2_M(I) = \frac{\mathbf {y}^\mathrm{T}(\mathbf {I}-\mathbf {S})^\mathrm{T}(\mathbf {I}-\mathbf {S})\mathbf {y}}{\mathrm {tr}\left\{ (\mathbf {I}-\mathbf {S})^\mathrm{T}(\mathbf {I}-\mathbf {S})\right\} }, \end{aligned}$$

where $\mathbf {S}= \frac{1}{I} \mathrm {Ind}(|i-j|<(I+1)/2)$ is a projection matrix and $\mathrm {Ind}(\cdot )$ is the indicate function.

As most difference-based approaches, the proposed local average method and the moving average method assume the nonparametric regression model has certain simple structure and the design points are equally spaced. Otherwise, the usefulness of the proposed methods may be undermined. By the idea of kernel regression, when the observations are non-equally spaced, we propose a kernel-based variance estimator,

$$\begin{aligned} \hat{\sigma }^2_K(I) = \frac{\mathbf {y}^\mathrm{T}(\mathbf {I}-\mathbf {S})^\mathrm{T}(\mathbf {I}-\mathbf {S})\mathbf {y}}{\mathrm {tr}\left\{ (\mathbf {I}-\mathbf {S})^\mathrm{T}(\mathbf {I}-\mathbf {S})\right\} }. \end{aligned}$$

where $\mathbf {S}(i,j) = \mathrm {Ind}\left( |x_i-x_j|<h\right) / \sum \nolimits _{j=1}^n \mathrm {Ind}\left( |x_i-x_j|<h\right) $ is a projection matrix with the bandwidth h. In practice, the indicated function can be replaced by any kernel function. Compared to the local averaging estimate, the number of observations within each window for the kernel-based variance estimate is variant and depends on the sampling distribution of the observations and the bandwidth h.

Similar to the proposed local average method for the equally spaced design, the proposed moving average method and the kernel-based method can reduce the bias of nonparametric function estimate, though may also inflate its variance. Furthermore, we can also construct a regression model and use the weighted least squares method to derive refined moving average estimate and kernel-based estimate for the error variance. However, it is quite challenging to extend these two methods to more complex semiparametric models, as the constructed regression models are not simple linear regression models.

4 Applications of local averaging variance estimation

In this section, we focus on the simple nonparametric regression model (1) and illustrate some potential applications of the proposed local averaging variance estimation method. Similar procedures can be implemented for more complicated inference problems and for more complicated nonparametric and semiparametric models.

4.1 Confidence interval of variance estimation by local averaging

Theorem 1 shows that the asymptotic variance of the proposed local averaging variance estimate is $\mu _4-\frac{I-3}{I-1}\sigma ^4$, which depends on the unknown parameter $\sigma ^2$ and the fourth moment of the error distribution. Under the normality assumption for the error, it can be simplified as $\frac{2I}{I-1} \sigma ^4$ because $\mu _4=3\sigma ^2$. By the idea of the variance-stabilizing transformation (van der Vaart 1998), we can construct the confidence interval for $\sigma ^2$ based on the proposed local averaging variance estimate.

First, if the error follows a normal distribution, then by some calculations, the variance-stable transformation is

$$\begin{aligned} \phi (\sigma ^2)=\sqrt{\frac{I-1}{2I}} \log (\sigma ^2). \end{aligned}$$

By Theorem 1, we then have

$$\begin{aligned} \sqrt{n}(\phi (\hat{\sigma }_L^2)-\phi (\sigma ^2))\mathop {\rightarrow }\limits ^{d} N\left( 0,\phi '(\sigma ^2)^2\cdot \frac{2I}{I-1} \sigma ^4\right) \mathop {\rightarrow }\limits ^{d} N(0,1). \end{aligned}$$

This yields an asymptotic $1-\alpha $ level confidence interval for the variance $\sigma ^2$,

$$\begin{aligned}&\left( \exp \left( \sqrt{\frac{2I}{I-1}} \left( \sqrt{\frac{I-1}{2I}} \log {\hat{\sigma }_L^2}-\frac{1}{\sqrt{n}}z_{\alpha /2}\right) \right) ,\right. \nonumber \\&\quad \left. \exp \left( \sqrt{\frac{2I}{I-1}} \left( \sqrt{\frac{I-1}{2I}} \log {\hat{\sigma }_L^2}+\frac{1}{\sqrt{n}}z_{\alpha /2}\right) \right) \right) , \end{aligned}$$

(9)

where $z_\alpha $ is the $1-\alpha $ quantile of the standard normal distribution.

If the normality assumption is not satisfied by the error distribution, we can still use the variance-stable transformation

$$\begin{aligned} \phi (\sigma ^2)= \sqrt{\frac{I-1}{I-3}} \arcsin \left( \sqrt{\frac{I-3}{I-1}} \frac{\sigma ^2}{\sqrt{\mu _4}} \right) \end{aligned}$$

to stabilize the variance of the estimate. By using the proposed local averaging technique and some straightforward calculations, we can derive a consistent estimate of $\mu _4$,

$$\begin{aligned} \hat{\mu }_4= & {} \frac{I^4}{((I-1)^3+1)(I-1)}\left( \frac{1}{n}\sum \limits _{i=}^k\sum \limits _{j=1}^I(y_{ij}-y_{i\cdot })^4\right. \\&\quad \left. -\frac{6((I-1)^3+(I-1)(I-2)/2)}{I^4}\hat{\sigma }_L^4\right) , \end{aligned}$$

where $y_{ij}$ and $y_{i\cdot }$ are defined in Sect. 2. Thus, an asymptotic $1-\alpha $ level confidence interval for the variance $\sigma ^2$ is given by

$$\begin{aligned}&\left( \sqrt{\hat{\mu }_4}\sqrt{\frac{I-1}{I-3}} \sin \left( \sqrt{\frac{I-3}{I-1}} \left( \sqrt{\frac{I-1}{I-3}}\arcsin \left( \sqrt{\frac{I-3}{I-1}}\frac{\hat{\sigma }^2_L}{\sqrt{\hat{\mu }_4}}\right) -\frac{1}{\sqrt{n}}z_{\alpha /2}\right) \right) ,\right. \nonumber \\&\quad \left. \sqrt{\hat{\mu }_4}\sqrt{\frac{I-1}{I-3}} \sin \left( \sqrt{\frac{I-3}{I-1}} \left( \sqrt{\frac{I-1}{I-3}}\arcsin \left( \sqrt{\frac{I-3}{I-1}}\frac{\hat{\sigma }^2_L}{\sqrt{\hat{\mu }_4}}\right) +\frac{1}{\sqrt{n}}z_{\alpha /2}\right) \right) \right) \nonumber \\ \end{aligned}$$

(10)

4.2 Nonparametric hypothesis testing by local averaging

The proposed variance estimate and the idea of local averaging can also be used for the nonparametric testing. For example, for the nonparametric regression model (1), we consider the following hypothesis problem:

$$\begin{aligned} H_0: f(x)=a+bx \quad \text{ vs } \quad H_1: f(x) \ne a+bx \quad \text{ for } \text{ any } \text{ a } \text{ and } \text{ b }. \end{aligned}$$

By the idea of the generalized likelihood ratio of Fan et al. (2001), we can construct the following test statistic,

$$\begin{aligned} T= \frac{n(I-1)}{I} \frac{\text {RSS}_0-\text {RSS}_1}{\text {RSS}_1}, \end{aligned}$$

where $\text {RSS}_0$ is the sum of the squares of the residuals estimated by the least squares method under the null hypothesis $H_0$, and $\text {RSS}_1=(n-k) \hat{\sigma }_L^2$.

As shown by Fan et al. (2001), the null distribution of such test statistic T would be expected to hold the Wilks’ phenomenon, that is, it is model free and follows asymptotically $\chi ^2$ distribution where the degree of freedom may tend to infinity with the sample size. Hence, it is reasonable to suggest the following bootstrap procedure to approximate the null distribution of the test statistic T under the null hypothesis.

Step 1 Based on the samples $(X_i, Y_i), i=1,\ldots ,n $, first calculate the least squares estimate $\hat{a}$ and $\hat{b}$ under the null hypothesis and $\hat{\sigma }^2_L$ by the local averaging method, then compute the test statistic T.
Step 2 Construct a bootstrap example $(X_i, Y_i^*), i=1,\ldots ,n $ by
$$\begin{aligned} Y_i^*=\hat{a}+\hat{b}X_i +\varepsilon _i^*, \end{aligned}$$
where $\varepsilon _i^*$ is sampled from the normal distribution with mean zero and variance $\hat{\sigma }^2_L$. Then, based on the bootstrap sample, calculate the test statistic $T^*$.
Step 3 Repeat Step 2 for B times to obtain the bootstrapped test statistics and sort them in increasing order: $T_1^*\le \cdots \le T_B^*$.
Step 4 If $T\le T^*_{B(1-\alpha )}$, we accept the null hypothesis; otherwise, we reject the null hypothesis.

There is no need to estimate the nonparametric regression function for the proposed testing procedure, which avoids the use of a complex algorithm to select optimal tuning parameter, thus reducing the computational cost and increasing the stability of the testing results. Our numerical simulation results in S1.1 in the supplementary materials show the proposed testing procedure performs reasonably well. Further theoretical investigations are needed, especially for more complicated testing problems.

4.3 Variance function estimation by local averaging

Consider the following nonparametric regression model,

$$\begin{aligned} y_i=f(x_i)+\sigma (x_i)\varepsilon _i, \end{aligned}$$

where $\sigma (x_i)$ is a smoothing variance function, and $\varepsilon _i$ is a random variable with mean zero and variance one.

By the idea of the local averaging, define

$$\begin{aligned} \varepsilon ^{*2}_{Iij} \hat{=} \frac{I}{I-1}(y_{ij}-y_{i\cdot })^2, i=1,\ldots ,k, j=1,\ldots ,I, \end{aligned}$$

and then by some calculations, we can show that

$$\begin{aligned} \text{ E } \varepsilon ^{*2}_{Iij}=\sigma ^2(x_{ij})+O\left( \frac{1}{n}\right) . \end{aligned}$$

Hence, given $(x_{ij},\varepsilon ^{*2}_{Iij}), i=1,\ldots ,k, j=1,\ldots , I$, a local linear regression estimate $\hat{\sigma }_I^2(x_{ij})$ for the variance function $\sigma ^2(\cdot )$ can be obtained. Similar to the idea of the refined local averaging variance estimator, we can derive a refined estimate of variance function. Consider m different $I_1, \ldots , I_m$, define a weighted refined variance estimate

$$\begin{aligned} \varepsilon ^{*2}_{t} = \sum \limits _{l=1}^m I_l \varepsilon ^{*2}_{I_li_lj_l} \Bigg /{\sum \limits _{l=1}^m I_l}, \end{aligned}$$

where $t= (i_l-1)I_l+j_l,$ $l=1,\ldots ,m$, $j_l=1,\ldots ,I_l$. Then, given $(x_{t},\varepsilon ^{*2}_{t}), t=1,\ldots ,n$, a local linear regression estimate $\hat{\sigma }^2(x_{t})$ can be obtained. Such an estimate does not depend on the value of $I_l$ and is expected to be more stable.

5 Numerical studies

Example 1

Consider a nonparametric model

$$\begin{aligned} y_i=5\text {sin}(\omega \pi x_i)+\epsilon _i, 1\le i \le n, \end{aligned}$$

where $x_i=i/n$, and the random errors $\epsilon _i$’s follow an i.i.d. normal distribution with mean zero and variance $\sigma ^2$. Let $n=(100, 500)$, $\omega =(1, 2, 4)$, $\sigma =(0.5, 1.5, 4)$. For each combination of $(n, \omega , \sigma )$, we compute the local averaging variance estimate $\hat{\sigma }^2_L$ with $I=2$ and 5, the refined local averaging variance estimate $\hat{\sigma }^2_*$ based on the estimates of $\hat{\sigma }^2_{L}$ with $I=2,3,\ldots , 11$, the moving averaging-based estimate $\hat{\sigma }^2_M$ with $I=5$, and the kernel-based estimate $\hat{\sigma }^2_K$ with $I=5$. We repeat the simulation for 10000 times, and calculate the relative mean squared errors, $\text {RMSE}=n\text {MSE}/(2\sigma ^4)=\frac{n}{2\sigma ^4}(\text {bias}^2+\text {var})$ for each variance estimator. The closer the RMSE is to 1, the better is the estimator. When $x_i$ is equally spaced, the moving averaging-based variance estimate $\hat{\sigma }^2_M$ and the kernel-based variance estimate $\hat{\sigma }^2_K$ are equivalent. Hence, to make more comparable, $x_i, i=1,\ldots ,n$ is instead sampled from the [0, 1] uniform distribution for the simulation results of the kernel-based estimate $\hat{\sigma }^2_K$ with $I=5$. The proposed variance estimators are compared with $\hat{\sigma }_{\text {R}}^2, \hat{\sigma }_{\text {GSJ}}^2 \ \text {and}\ \hat{\sigma }_{\text {T}}^2(m_\mathrm{s})$. For $\hat{\sigma }_{\text {T}}^2(m_\mathrm{s})$, the data in Table 1 in Tong and Wang (2005) are referenced with $m_\mathrm{s}=n^{\frac{1}{2}}$.

Table 1 depicts the RMSE of all estimators and shows that, in general, $\text {RMSE}_{\hat{\sigma }_{\text {T}}^2(m_\mathrm{s})}<\text {RMSE}_{\hat{\sigma }^2_{L}(5)}<\text {RMSE}_{\hat{\sigma }_{\text {R}}^2}<\text {RMSE}_{\hat{\sigma }^2_{L}(2)}\bumpeq \text {RMSE}_{\hat{\sigma }_{\text {GSJ}}^2}$. As the sample size n tends to infinity, the RMSE of $\hat{\sigma }^2(I)$ tends to $\frac{I}{I-1}$ as shown in Theorem 1. Moreover, as expected, the performance of our local averaging estimator depends on the smoothness of the nonparametric function $f(\cdot )$, the sample size, and the signal-to-noise ratio. When f is rough and $\sigma $ is small, for instance, $(n,\omega ,\sigma )=(100,4,0.5)$, the RMSE of $\hat{\sigma }^2_{L}(5)$ is quite large because the bias as shown above is about $\frac{I(I+1)}{n^2}J$. When the sample size n is much larger than the group size I, the bias is negligible, and the RMSE then converges to $\frac{n}{2\sigma ^4}\text {var}$, which is $1+\frac{1}{I-1}$ for this example. For the refined local averaging variance estimate $\hat{\sigma }^2_*$, not only it generally performs better or at least as well as $\hat{\sigma }^2_{L}(2)$ and $\hat{\sigma }^2_{L}(5)$, but is also much more robust. The results of $\hat{\sigma }^2_{M}(5)$ are quite stable, and are better than $\hat{\sigma }^2_{GSJ}$. The performance of $\hat{\sigma }_K(5)$ is similar as the refined local averaging variance estimate $\hat{\sigma }^2_*$. When the sample variation is larger, such kernel-based variance estimate may out perform other estimators.

Table 1 RMSE for Example 1

Full size table

Figure 1 shows boxplots for the estimates of $\hat{\sigma }^2_{\text{ R }}, \hat{\sigma }^2_{\text{ GSJ }}, \hat{\sigma }^2_{\text{ T }}(m_\mathrm{s}) $, $\hat{\sigma }^2_{L}(2), \hat{\sigma }^2_{L}(5)$ and $\hat{\sigma }^2_{*}$ for $n=(100, 500), \omega =2$ and $\sigma =1.5$. Similar to the results shown by Table 1, the refined local averaging variance estimate $\hat{\sigma }_*^2$ performs better or at least as well as $\hat{\sigma }^2_{\text{ R }}, \hat{\sigma }^2_{\text{ GSJ }}$, $\hat{\sigma }^2_{L}(2)$ and $\hat{\sigma }^2_{L}(5)$, and is quite robust to the sample size. When the sample size is relative large, the performances of $\hat{\sigma }_*^2$ and $\hat{\sigma }_L^2(5)$ are quite comparable to that of the best estimate $\hat{\sigma }^{2}_{\text{ T }}(m_\mathrm{s})$.

Table 2 MSE for Example 1 with small sample size $n=15$

Full size table

To assess the performance of the proposed variance estimates for small sample size data, we consider the above example with $n=15$ and compare our proposed variance estimators with $\hat{\sigma }^2_{GSJ}$, $\hat{\sigma }^2_T(m_\mathrm{s})$ and $\hat{\sigma }_{ols}^2(d_k,m_1)$ proposed by Park et al. (2012) for small sample nonparametric regression. The numerical simulation results of RMSE of these estimates are shown in Table 2. When the signal-to-noise ratio is large, say $\sigma =0.01$, $\hat{\sigma }_{GSL}$ and $\hat{\sigma }_{ols}^2(d_k,m_1)$ perform much better than other estimators. When the signal-to-noise ratio is small, the performances of our proposed estimators, especially the moving averaging-based estimator with $I=2$, are comparable to other estimators (Table 2).

Example 2

Consider a bivariate nonparametric model

$$\begin{aligned} y_i= f(x_i,u_i)+\epsilon _i,\quad {0.5cm}i=1,\ldots , n, \end{aligned}$$

where $x_i=i/n$, $u_i$ is bernoulli distributed with probability 0.5, the bivariate function $f(x_i,u_i) =5\sin (\pi x_i)$ when $u_i=0$ and $f(x_i,u_i) =5\sin (2\pi x_i)$ when $u_i=1$, and the random errors $\epsilon _i$’s follow an i.i.d. normal distribution with mean zero and variance $\sigma ^2$. Our local averaging method continues to apply. The data can be naturally split into subgroups according to the categorical variables. Within each subgroup, the above bivariate nonparametric model reduces to a univariate nonparametric model and between subgroups, the univariate nonparametric model may be different. The local average method can be applied to obtain a variance estimate for each subgroup, and then a single variance estimator is calculated by taking a weighted average of these within-subgroup variance estimates. We consider $n=(200, 400, 800)$, $\sigma =(0.5,1.5,4)$, and for each combination of $(n, \sigma )$, run the simulation for 10000 times. The results of local average estimators of $\hat{\sigma }^2_{L}(2)$,$\hat{\sigma }^2_{L}(5)$ and $\hat{\sigma }^2_{L}(10)$ are summarized in Table 3.

Table 3 Simulation results for Example 2

Full size table

Table 4 Simulation results for Example 3

Full size table

Example 3

Consider a semiparametric partial linear model

$$\begin{aligned} y_i=\mathbf {z}_i ^\mathrm{T}\varvec{\beta }+f(x_i)+\epsilon _i, 1\le i \le n, \end{aligned}$$

where $\varvec{\beta }=(1,3,0,0,0)^\mathrm{T}$, $x_i=i/n$, $f(x_i)=5sin(\omega \pi x_i)$, and the random errors $\epsilon _i$’s follow an i.i.d. normal distribution with mean zero and variance $\sigma ^2$. $\mathbf {z}_i$ follows a multivariate normal distribution with mean zero and covariance matrix with 1 on the diagonal and 0.5 on the off-diagonal. Similar to Example 1, we consider $\omega =(1,2, 4)$, $\sigma =(0.5,1.5,4)$, and $n=(100, 500)$, and for each combination of $(\omega , \sigma , n)$, run the simulation for 10000 times. We calculate $\hat{\sigma }^2_{L}(I)$ for $I=2$ and 5, respectively, and then compare them to the difference-based estimator $\hat{\sigma }^2_{\text {K}}$ proposed by Klipple and Eubank (2007) with $m=2$ and the GSJ weights. It is worth mentioning that under this setting, the estimator $\hat{\sigma }^2_{\text {K}}$ is the same as the estimator proposed by Wang et al. (2011). Table 4 depicts the RMSE of $\hat{\sigma }^2_{L}(2), \hat{\sigma }^2_{L}(5)$, $\hat{\sigma }^2_{\text {K}}$, and $\hat{\sigma }^2_*$ with $m=11$. It shows that in general, $\text {RMSE}_{\hat{\sigma }^2_{L}(2)}\bumpeq \text {RMSE}_{\hat{\sigma }^2_{K}}<\text {RMSE}_{\hat{\sigma }^2_*} <\text {RMSE}_{\hat{\sigma }^2_{L}(5)}$, except for some unstable results for $\hat{\sigma }^2_{L}(5)$ with $\sigma =0.5$.

Example 4

Consider a semiparametric additive model

$$\begin{aligned} y_i=\mathbf {z}_i^\mathrm{T}\varvec{\beta }+f_1(x_{i,1})+f_2(x_{i,2})+f_3(x_{i,3})+\epsilon _i,\quad {0.5cm}i=1,\ldots ,n, \end{aligned}$$

where $\varvec{\beta }=(1,3,0,0,0)^\mathrm{T}$, $f_1(x)=-\sin (2x)$, $f_2(x)=x^2-25/12$, $f_3(x)=\text {exp}(-x)-2\sinh (5/2)/5$, and $\mathbf {z}_i$ is the same as in Example 2. $\mathbf {x}_i=(x_{i,1},x_{i,2},x_{i,3})$ is a three-dimensional random vector, each marginal distribution is a uniform distribution on [0, 1], and the correlation matrix is compound symmetric with $\rho $ on the off-diagonal. Thus, $\mathbf {x}$ is ensured to be bounded. We consider $\rho =(0.25, 0.75)$, $n=(200,400)$, and $\sigma =(0.5, 1.5, 4)$ and run the simulations for 10,000 times.

Table 5 Simulation results for Example 4

Full size table

Table 6 Simulation results for Example 5

Full size table

Rather than take $\mathbf {x}_i=(x_{i,1},x_{i,2},x_{i,3})$ as a whole and cut in on a three-dimensional space, we group $x_{i,1}, x_{i,2}$, and $x_{i,3}$ separately. For each $p, p=1,2,3$, we order $x_{i,p}$ in an ascending order and split $x_{i,p}$ into $k = n/I$ groups with I observations in each group. Denote $x_{ij,p}=x_{I*(i-1)+j,p}$, then for the ith group, we use the average to estimate the function values within this group, i.e. $\hat{f}_p(x_{ij,p})=\frac{1}{I}\sum _{j=1}^{I}f_p(x_{ij,p})\triangleq \alpha _{i,p}$. Thus, the semiparametric additive model can be written as

$$\begin{aligned} y_{i}= & {} \mathbf {z}_{i}^\mathrm{T}\varvec{\beta }+\alpha _{i_1,1}+\alpha _{ i_2,2}+\alpha _{ i_3 ,3}+\epsilon _{i},\\&\quad {0.25cm} i=1,\ldots ,n, \quad {0.25cm} i_1,i_2,i_3=1,\ldots ,k. \end{aligned}$$

The residual variance can then be estimated by the ordinary least squares method via a high-dimensional linear regression model

$$\begin{aligned} \mathbf {y}=\mathbf {D}\varvec{\theta }+\varvec{\epsilon }, \end{aligned}$$

where $\varvec{\theta }=(\varvec{\beta }^\mathrm{T},\alpha _{1,1},\ldots ,\alpha _{k,1},\alpha _{ 1,2},\ldots ,\alpha _{k,2},\alpha _{ 1 ,3},\ldots ,\alpha _{ k ,3})^\mathrm{T}$, $\mathbf {D}$ is the design matrix with the first five columns equal to $\mathbf {Z}$, and the rest is a zero–one matrix indicating the presence of $\alpha _{i,p}$. Table 5 depicts the results of $\hat{\sigma }^2_{L}(4)$, $\hat{\sigma }^2_{L}(5)$, and $\hat{\sigma }^2_{L}(10)$. It shows that the estimation results are better as the sample size increases and are quite robust to the correlation coefficient $\rho $ of the predictor $\mathbf {x}$.

Example 5

Consider a varying coefficient model

$$\begin{aligned} y_i=\text {sin}(2\pi u_i)x_{i1}+4u_i(1-u_i)x_{i2}+\epsilon _i, 1\le i \le n, \end{aligned}$$

where $u_i$’s follow a uniform distribution on [0, 1], $(x_{i1},x_{i2})$’s follow an i.i.d. bivariate normal distribution with means 0, variances 1, and correlation coefficient $1/\sqrt{2}$, and the errors $\epsilon _i$’s follow an i.i.d. normal distribution with mean zero and variance $\sigma ^2$. We consider $n=(100, 500)$ and $\sigma =(0.5, 1.5, 4)$ and run the simulation for 10,000 times. We calculate $\hat{\sigma }^2_{L}(I)$ for $I=5, 10$, respectively, and $\hat{\sigma }^2_*$ with $m=11$. We also calculate the variance estimator $\hat{\sigma }^2_{\text {poly}}$ proposed by Zhang and Lee (2000) with the tricube kernel function, $q=3$ and $h=0.3$ at $u_i=0.5$.

Table 6 lists the sample average and the sample standard deviation of all estimators. It shows that the sample average of each estimator becomes closer to the true value and the standard deviation decreases as the sample size increases. The sample standard deviation of local averaging variance estimators is smaller than that of Zhang and Lee’s variance estimator, for which the sample standard deviation of $\hat{\sigma }^2_{L}(5)$ is about half of that of $\hat{\sigma }^2_{\text {poly}}$.

6 Conclusion and discussion

The residual sum of squares method and difference-based method are two classical approaches in variance estimation. The residual sum of squares method is natural but needs to estimate the unknown function accurately and efficiently, whereas the difference-based method is robust and requires light computational cost but is limited to the univariate case and may not be optimal.

Our proposed local averaging method has the advantages of avoiding to efficiently estimate the nonparametric functions and reducing much computational cost, and it can be easily implemented for various nonparametric and semiparametric models. The basic assumption is that the unknown nonparametric function is smooth enough and can be approximated well by a constant step function locally. Thus, we can reparameterize the nonparametric and semiparametric models into a high-dimensional linear model and estimate the residual variance directly. Under some regularity conditions, we have proved that the local averaging variance estimator is asymptotically normal and achieves the asymptotic optimal rate $O_p(n^{-1})$. Simulation studies show that the local averaging estimator unavoidably involves bias. As the bias is closely related to the group size, we propose a refined local averaging variance estimator by aggregating variance estimates of different group sizes. Moreover, following the suggestion of one referee, by the idea of kernel regression, we also propose a local moving average variance estimation method and a kernel-based variance estimation method. These two methods may improve the efficiency and stability of the local average estimate for the unequally spaced design.

References

Albright SC, Winston WL, Zappe CJ (1999) Data analysis and decision making with Microsoft excel. Duxbury, Pacific Grove, CA
Box GEP, Ramirez J (1986) Studies in quality improvement: signal to noise ratios, performance criteria and statistical analysis: part II. Center for quality and productive improvement. Report 12, University of Wisconsin
Brown LD, Levine M (2007) Variance estimation in nonparametric regression via the differnce sequence method. Ann Stat 35:2219–2232
Article MATH Google Scholar
Brown LD, Levine M, Wang L (2016) A semiparametric multivariate partially linear model: a difference approach. J Stat Plan Inference 178:99–111
Article MathSciNet MATH Google Scholar
Butt WR (1984) Practical immunoassay: the state of the art. CRC Press
Cai TT, Levine M, Wang L (2009) Variance function estimation in multivariate nonparametric regression with fixed design. J Multivar Anal 100:126–136
Article MathSciNet MATH Google Scholar
Carroll RJ (1987) Truncated and censored samples from normal populations. J Am Stat Assoc 82(399):952–952
Article Google Scholar
Carroll RJ, Ruppert D, (1988) Transformation and weighting in regression. CRC Press
Cui X, Lu Y, Peng H (2014) Estimation of partially linear regression model under partial consistency property. arXiv:1401.2163
Dette H, Munk A, Wagner T (1998) Estimating the variance in nonparametric regression—what is a reasonable choice? J R Stat Soc B 60:751–764
Article MathSciNet MATH Google Scholar
Fan J, Peng H (2004) Nonconcave penalized likelihood with a diverging number of parameters. Ann Stat 32:928–961
Article MathSciNet MATH Google Scholar
Fan J, Yao Q (1998) Efficient estimation of conditional variance functions in stochastic regression. Biometrika 85:645–660
Article MathSciNet MATH Google Scholar
Fan J, Zhang C, Zhang J (2001) Generalized likelihood ratio statistics and Wilks phenomenon. Ann Stat 29:153–193
Article MathSciNet MATH Google Scholar
Fan J, Peng H, Huang T (2005) Semilinear high-dimensional model for normalization of microarray data: a theoretical analysis and partial consistency. J Am Stat Assoc 100:781–796
Article MathSciNet MATH Google Scholar
Fan J, Guo S, Hao N (2012) Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J R Stat Soc B 74:37–65
Article MathSciNet Google Scholar
Gasser T, Sroka L, Jennen-Steinmetz C (1986) Residual variance and residual pattern in nonlinear regression. Biometrika 73:625–633
Article MathSciNet MATH Google Scholar
Hall P, Marron JS (1990) On variance estimation in nonparametric regression. Biometrika 77:415–419
Article MathSciNet MATH Google Scholar
Klipple K, Eubank RL (2007) Difference based variance estimators for partially linear models. Festschrift in Honor of Distinguished Professor Mir Masoom Ali On the Occasion of His Retirement, Muncie, IN, USA, pp 313–323
Müller U, Schick A, Wefelmeyer W (2003) Estimating the error variance in nonparametric regression by a covariate-matched U-statistic. Stat J Theor Appl Stat 37:179–188
MathSciNet MATH Google Scholar
Neyman J, Scott EL (1948) Consistent estimates based on partially consistent observations. Econometrica 16:1–32
Article MathSciNet MATH Google Scholar
Park CG, Kim I, Lee Y-S (2012) Error variance estimation via least squares for small sample nonparametric regression. J Stat Plan Inference 142:2369–2385
Article MathSciNet MATH Google Scholar
Rice J (1984) Bandwidth choice for nonparametric regression. Ann Stat 12:1215–1230
Article MathSciNet MATH Google Scholar
Tong T, Wang Y (2005) Estimating residual variance in nonparametric regression using least squares. Biometrika 92:821–830
Article MathSciNet MATH Google Scholar
van der Vaart AW (1998) Asymptotic statistics. Cambridge University Press, Cambridge
Book MATH Google Scholar
Wang L, Brown LD, Cai TT, Levine M (2008) Effect of mean on variance function estimation in nonparametric regression. Ann Stat 36:619–641
Article MathSciNet MATH Google Scholar
Wang L, Brown LD, Cai TT (2011) A difference based approach to the semiparametric partial linear model. Electron J Stat 5:619–641
Article MathSciNet MATH Google Scholar
Zhang W, Lee SY (2000) Variable bandwidth selection in varying-coefficient models. J Multivar Anal 74:116–134
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank the Editor, associate editor, and anonymous referees for their helpful comments and suggestions that have helped us to substantially improve the quality of the paper.

Author information

Authors and Affiliations

Department of Mathematics, Hong Kong Baptist University, Hong Kong, China
Jingxin Zhao & Heng Peng
School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
Tao Huang

Authors

Jingxin Zhao
View author publications
Search author on:PubMed Google Scholar
Heng Peng
View author publications
Search author on:PubMed Google Scholar
Tao Huang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Tao Huang.

Additional information

Tao Huang’s research was supported in part by the State Key Program in the Major Research Plan of NSFC (No. 91546202), NSFC (No. 11771268) and Program for Innovative Research Team of SHUFE. Heng Peng’s research was supported part by CEGR Grant of the Research Grant Council of Hong Kong (Nos. HKBU202012 and HKBU12302615), FRG grant from the Hong Kong Baptist University (Nos. FRG214-15/064 and FRG2/16-17/042).

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 353 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, J., Peng, H. & Huang, T. Variance estimation for semiparametric regression models by local averaging. TEST 27, 453–476 (2018). http://doi.org/10.1007/s11749-017-0553-3

Download citation

Received: 18 May 2016
Accepted: 12 September 2017
Published: 20 September 2017
Issue Date: June 2018
DOI: http://doi.org/10.1007/s11749-017-0553-3

Keywords

Mathematics Subject Classification

62G08

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Variance estimation for semiparametric regression models by local averaging

Abstract

Similar content being viewed by others

Model averaging prediction for nonparametric varying-coefficient models with B-spline smoothing

Model averaging estimation for nonparametric varying-coefficient models with multiplicative heteroscedasticity

Effective identification and estimation for the semiparametric measurement error model

Explore related subjects

1 Introduction

2 Variance estimation by local averaging

2.1 Review of classical methods

2.2 Local averaging method

Remark 1

2.3 Theoretical properties

Theorem 1

Remark 2

3 Various extensions

3.1 Partial linear models

Theorem 2

Remark 3

3.2 Varying coefficient models

Theorem 3

Remark 4

Remark 5

3.3 Refined local averaging variance estimator

3.4 Moving local average and kernel-based estimators

4 Applications of local averaging variance estimation

4.1 Confidence interval of variance estimation by local averaging

4.2 Nonparametric hypothesis testing by local averaging

4.3 Variance function estimation by local averaging

5 Numerical studies

Example 1

Example 2

Example 3

Example 4

Example 5

6 Conclusion and discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 353 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification