Chapter 2 Non-parametric regression
In this chapter we will discuss different ideas and methods for nonparametric regression (Martinez & Martinez, 2007) but before explaining the nonparametric methods we will recall the linear regression model this will help us have a better understanding of different approaches to this problem. A refinement of the previous linear regression model that could move in nonlinear patterns could be a ploynomial regression:
\[\begin{equation} Y_{i}=\beta_{0}+\beta_{1}X+\beta_{2}X^{2}+...+\beta_{p}X^{p}+\epsilon, \quad p \in\mathbb{N} \tag{2.1} \end{equation}\]
we have to note that besides \(\beta\) we have used polynomial term such as \(X^2\) the model is considered to be linear with respect to coefficients \(\beta_{i}\). This model is considered to be parametric because we assume a certain model for the existing relationship between predictors and the response and estimate the coefficients. In general, parametric model have certain assumptions and are considered to be less flexible. We usually fit these models by estimating the parameter using various estimation methods. In case of linear regression, the model can be written in matrix format \(Y=X\beta+\epsilon\) and we assume \(E(\epsilon)=0\) and \(V(\epsilon)=\sigma^{2}I\). The estimate for the vector of parameter can be driven by solving the following equation: \(\beta=(X^{T}X)^{-1}X^{T}Y\).
In nonparametric case we do not assume a parametric format for our model and we will try to fit a more flexible model that could be expressed in the following format:
\[Y= f(X_{i})+\epsilon\]
Generally speaking, function \(f\) in this equation is consider to be smooth and bounded function that allow nonlinear relation between the predictors. In this project we focus on the cases where there is one response and one predictor variable. In this problem we are after the functions that can minimize the error term which is essentially the difference between the true value of the function and the estimated values. This has been given in the following expression: \[E[(Y_i -\hat f(X_{i}))^{2}]\]
where \(\hat f(X_{i})\) is the estimate of the function \(f\) at \(i-th\) position . In the following we will dicuss the kernel smoothing method:
2.1 Kernel Smoothing
Kernel smoothing estimator locally estimates regression which means it fits an estimate \(\hat y_0\) for each point \(x_0\). This estimate is a n-th degree polynomial which is estimated using weighted least square of the point in neighborhood of the point \(x_0\). In order to get a better estimate, closer points will be associated with larger weight and further points will be associated will smaller weights. This will be achieved by using a kernel function and usually has a smoothing parameter which controls the bandwidth of the neighborhood; moreover, it influences the weights that points would take. The polynomial at each point can be express by the following equation: \[\begin{equation} \hat Y=\beta_0+\beta_1(X_i-x)+\dots+\beta_d(X_i-x)^d \tag{2.2} \end{equation}\]
Now we associate each point with the weight obtain from the our kernel function
\[\begin{equation} K_h(X_i-x)=\frac{1}{h}K(\frac{X_i-x}{h}) \tag{2.3} \end{equation}\]
where in (2.3), h is the bandwidth. If the bandwidth is very small the curve become non smooth and wiggly because it only depends on the point in a very small neighborhood of the point this may result in over fitting. In contrast, If the h is chosen too large the estimated curve will become smooth and the bias of the model will be large. We are after \(\hat\beta_i\) which minimize the following equation: \[\Sigma^n_{i=1}K_h(X_i-x)(Y_i-\beta_0-\beta_1(X_i-x)-\dots-\beta_d(X_i-x)^d)^2\] Now if we put our terms in a matrix format we can have the following: \[X_x= \ \begin{pmatrix} 1 & X_1-x & \dots& (X_1-x)^d \\ \vdots & \vdots &\ddots & \vdots \\ 1 & X_n-x & \dots & (X_n-x)^d \end{pmatrix} \ \] and weights can be put in a diagonal matrix as following:
\[W_x=\ \begin{pmatrix} K_h(X_i-x) & 0 & \dots& 0 \\ \vdots & \vdots &\ddots & \vdots \\ 0 & 0 & \dots & K_h(X_n-x) \end{pmatrix} \ \] Now the estimates of the \(\hat \beta=(\hat \beta_0,\hat \beta_1,\hat \beta_2,\dots,\hat \beta_d)\) can be obtain using the following equation: \[\hat\beta=(X_x^TW_xX_x)^{-1}X_x^TW_xY\] We know that the estimator \(\hat y=f(x)\) is the intercept coefficient \(\beta_0\) of fitted estimate. We can calculate this \(\hat f\) by simply multiple the estimated \(\hat\beta\) by \(e_1^T=(1,\underbrace{0,\dots,0}_\text{$(d-1)$-times})\) which is the first element of the basis of \(d+1\) dimensional vector space where \(d\) is degree of the polynomial in (2.2) . Therefore we can have the following:
\[\hat f(x)= e_1^T(X_x^TW_xX_x)^{-1}X_x^TW_xY\]
More information on this topic can be found in (Martinez & Martinez, 2007).
2.2 Penalized spline
In this family of estimators instead of using a single polynomial for whole the data we can use different polynomials for different parts of the data. For example:
\[ Y_i = \begin{cases} \beta_{01}+\beta_{11}x_i+\beta_{21}x_i^2+\varepsilon_i & \quad\text{if}\quad x<\phi \\ \beta_{02}+\beta_{12}x_i+\beta_{22}x_i^2+\varepsilon_i & \quad\text{if}\quad x\geq\phi \\ \end{cases} \]
where \(\phi\) is called knot. On the boundaries between the regions, polynomials should be continuous but it is more common to use polynomial with available derivatives up to order two. In general, a linear spline with knots \(\phi_k\) \(k=1,\dots,k\) is a piece-wise polynomial continuous at each knot and this means that the set of all linear spline with fixed knots is a vector space. Therefore, a basis can be chosen for it. In the following we will introduce two common basis for this vector space. A cubic spline is a piece-wise cubic polynomial with continuous derivatives up to order 2 at each knot, we can write the equation as the following \[y=\sum^{K+4}_{m=1}\beta_mh_m(x)+\epsilon\] where \[h_k(x)=x^{k-1},\quad k=1,\dots,4\] and \[h_{k+4}=(x-\phi_k)^3_+,\quad k=1,\dots,K\]. where \((x-\phi_k)_+=x-\phi_k\) if \(x>\phi_k\) and \((x-\phi)_+=0\) otherwise In general, order-M spline is defined as below
\[h_k(x)=x^{k-1},\quad k=1,\dots,M.\] \[h_{k+M}(x)=(x-\phi_k)^{M-1}_+,\dots k=1,\dots,K.\]
\[ Y_i = \begin{cases} (x-\phi_k)^{M-1}_+= (x-\phi_k)^{M-1} & \quad\text{if}\quad x>\phi_k \\ (x-\phi_k)^{M-1}_+=0 & \quad \text{otherwise}\\ \end{cases} \]
However, in practice cubic spline is the highest order we use. Another class of basis that has been frequently used is B-splines(Burden, Faires, & Reynolds, 2001), they were introduced in 1946 by I.J.Schoenberg (Schoenberg, 1946) and in 1972, Carl DE Boor (De Boor, 1972) described recursion formula for evaluation with high computational stability. Now let \(\phi_0,\phi_1,\phi_2,\dots,\phi_k\) be the knots and sorted in a non-decreasing order. The element of the basis will be the following: \[B_{i,n}(x)= \begin{cases} 1 & \quad\text{if}\quad \phi_i\leq x \leq \phi_{i+1} \\ 0 & \quad\text{otherwise} \\ \end{cases}\] where \(i=1,\dots,(K+2M-1)\) The term \(B_{i,m}\) can be calculated by: \[B_{i,m}=\frac{x-\phi_i}{\phi_{i+m-1}-\phi_i}B_{i,m-1}+\frac{\phi_{i+m}-x}{\phi_{i+m}-\phi_{i+1}}B_{i+1,m-1}(x)\] where \(i=1,\dots,K+2M-m\) . If the number of knots is large overfitting may occur so to avoid over fitting penalized spline were introduced. The penalized spline is a function which minimizes the following equation: \[\begin{equation} RSS(f,\lambda)=\Sigma^{N}_{i=1}(y_i-f(x_i))^2+\lambda\int(f''(t))^2dt \tag{2.4} \end{equation}\]
where \(\lambda\) is fixed and called smoothing parameter and \(RSS\) is called penalized residual sum of squares. The first term is the usual term for measuring the goodness of fit and the second term penalizes curvature of the function \(f\). A larger \(\lambda\) will result in a less flexible and smoother curve. In contrast, a smaller \(\lambda\) will result in more a flexible model and more wiggly curve.
2.3 First generation wavelets
Wavelets and Wavelet transforms were first appeared in the mathematical literature around 30 years ago (Grossmann & Morlet, 1984). Continuous wavelet transform was created by modifying the windowed Fourier transform which has been presented below:
\[F_h[f](t,\omega)=\frac{1}{\sqrt{2\pi}}\int_{\mathbb{R}}f(y)\overline{h(y-t)}e^{-i\omega y}dy\]
where \(f\in L_2(\mathbb{R})\) is a time-continuous signal and \(h\in L_2(\mathbb{R})\). We merge The window function \(h\) and the Fourier transform component \(e^{i \omega y}\) into one window function which we call \(\psi\) and has the property of being scale-able(Jansen & Oonincx, 2005),(Apostol, 1974). After this modification we will end up with the continuous wavelet transform: \[W_\psi[f](a,b)=\frac{1}{\sqrt{a}}\int_{\mathbb{R}}f(t)\overline{\psi(\frac{t-b}{a})}dt\] which is a wavelet function \(\psi \in L_2(\mathbb{R})\), \(+\infty>a>0\) is called the scaling parameter and \(+\infty>b>0\) is localization parameter and the wavelet transform continuously depends on them (Jansen & Oonincx, 2005). These properties correspond to the two desirable features of the wavelets. Localization mean that the discontinuities in the regression function \(f(x)\) will only effect the \(\psi(x)\) near it. This will give us the ability to detect localized pattern of the data. The scaling feature means the the domain of each \(\psi(x)\) can be scaled separately. This means that we could ‘zoom in’ and analyze the regression function at a set of desired scales.
We can discretize the wavelet transform by restricting our domain to a discrete transform. We also replace the double integral by double summation; a common choice for the discrete subset is a dyadic lattice. The use a special case of basis of \(L_2(\mathbb{R})\) for transformation which commonly has the following form \[\{\psi(2^jt-k)|j,k\in\mathbb{Z}\}\]
Therefore, the reconstruction of the signal would be of the following form and should satisfy the following conditions:
\[m||\alpha||^2_2 \leq ||f||^2_2 \leq M||\alpha||^2_2 \]
\[f(t)=\sum_{j=-\infty}^{\infty}\sum_{k=-\infty}^{\infty}\alpha_{j,k}\psi_{j,k}(t)\]
where \(\alpha\) is a double indexed sequence that satisfies the equation above and \(\psi_{j,k}(t)=2^{j/2}\psi(2^{j}t-k)\)
and \(+\infty>m,M>0\) this guarantees the \(f\) to be bounded. It can be shown that the computational complexity of this method is \(\mathcal{O}(N)\) compared to the widely used fast Fourier transform which is \(\mathcal{O}(N\log N)\) which adds the benefits of faster convergance(Jansen & Oonincx, 2005). In estimation we apply the wavelet transform to the data and get the wavelet coefficient. The coefficient that are below the error threshold will be dropped. Then the reverse transformation is applied to get the fitted value.
The are some problems surrounding this issue. First, in practice the number of observations should be a power of two \(n=2^l\) for some \(l\in \mathbb{N}\). However this is not the major issue and it can be tackled rather easily. The major problem is that the observed data has to be on a regular spaced grid such as \(t_i=\frac{i}{n}\). One of the approaches to solve this problem is to interpolate the data to a regularly spaced grid which has benn incorporated into the R package Wavethresh(Nason, 2016). This may seem a good idea but in two dimensional cases it has proved to be computationally intensive and not useful in practice(Herrick, 2000). This has encouraged scientists to develop new techniques called “second-generation Wavelets” to overcome this problem. In the next section we will discussed this in more detail.
2.4 Second generation Wavelets
In previous section we introduced the first generation wavelet method and we listed some prominent features of it. We talked about the localization and scalability of the wavelets and its relatively fast computational efficiency. This qualities give this method a remarkable place among the regression estimation methods. We also noted that the main disadvantage of this method is its reliance on Fourier transformation which make it restricted to an equally-spaced grid. While there are solution such as interpolating data on to an equally spaced grid and performing the wavelet transformation before interpolating it back to actual grid, this solutions have been proven to be computationally expensive.
Consequently a new approach called second generation wavelets(Sweldens, 1998) has been developed in past two decades which not only does not rely on Fourier transforms for computation but also contains all the good properties of the wavelet method. The idea behind this new method is called Lifting schemes(Sweldens, 1996).
We wish to construct the lifting scheme and briefly explain how it works. Suppose \(\lambda_{0,k}=f(k)\) for \(k\in\mathbb{Z}\). We like to be able to extract the information in the data with smaller sub-samples but with larger distance between elements. In other words, our aim is to capture the information in fewer coefficients; However, this may not be achievable and in that scenario an approximation with acceptable margins of error is plausible. We reduce the number of coefficients by dividing them into two separate sequences of even and odd samples using the definition below;
\[\lambda_{-1,k}:= \lambda_{0,2k} \quad \text{for}\quad k \in\mathbb{Z}\]
we need to record the information that will be lost from \(\lambda_{0,k}\) to \(\lambda_{-1,k}\) in this procedure. This difference will be recorded in \(\gamma_{-1,k}\)
and we call them the wavelet coefficients. These coefficient will be calculated by averaging two adjacent even elements and subtracting from the odd sample which can be seen in the following:
\[\gamma_{-1,k}=\lambda_{0,2k+1}-\frac{\lambda_{-1,k}+\lambda_{-1,k+1}}{2}\]
and like the first generation wavelet methods the coefficients that are below the threshold we will ignored. In other words the wavelet coefficient are measuring the magnitude in which the function is different from being linear.This method does not depend on the actual distance between the point of the grid so it is not restricted to a dyadic or equally spaced grid.
The package that has been used in this project is CNLTreg
(Nunes & Knight, 2018) which has been written by the author of the article(Hamilton et al., 2018). In this article they have presented a new Complex-value approach to second generation wavelet, the lifting scheme has been constructed in two branches. One the real-value branch and imaginary branch.