-->

GoogleSearch



Scientist. Husband. Daddy. --- TOLLE. LEGE
외부자료의 인용에 있어 대한민국 저작권법(28조)과 U.S. Copyright Act (17 USC. §107)에 정의된 "저작권물의 공정한 이용원칙 | the U.S. fair use doctrine" 을 따릅니다. 저작권(© 최광민)이 명시된 모든 글과 번역문들에 대해 (1) 복제-배포, (2) 임의수정 및 자의적 본문 발췌, (3) 무단배포를 위한 화면캡처를 금하며, (4) 인용 시 URL 주소 만을 사용할 수 있습니다. [후원 | 운영] [대문으로] [방명록] [옛 방명록] [티스토리 (백업)]

이 블로그 검색

Kernel methods

라벨:


Kernel method (wikipedia) 


  1. In probability and statistics, density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. 
  2. The unobservable density function is thought of as the density according to which a large population is distributed; the data are usually thought of as a random sample from that population.
  3.  A kernel is a non-negative real-valued integrable function K satisfying the following two requirements:
    • \int_{-\infty}^{+\infty}K(u)\,du = 1\,;
    • K(-u) = K(u) \mbox{ for all values of } u\,.
  4. The density estimates are kernel density estimates using a Gaussian kernel. That is, a Gaussian density function is placed at each data point, and the sum of the density functions is computed over the range of the data.
Comparison of the histogram (left) and kernel density estimate (right) constructed using the same data. The 6 individual kernels are the red dashed curves, the kernel density estimate the blue curves. The data points are the rug plot on the horizontal axis.
  1. KMs approach the problem by mapping the data into a high dimensional feature space, where each coordinate corresponds to one feature of the data items, transforming the data into a set of points in a Euclidean space. In that space, a variety of methods can be used to find relations in the data. Since the mapping can be quite general (not necessarily linear, for example), the relations found in this way are accordingly very general. This approach is called the kernel trick.
    KMs owe their name to the use of kernel functions, which enable them to operate in the feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. 
In the table below, 1{…} is the indicator function.

Kernel Functions, K(u) \textstyle \int u^2K(u)du \textstyle \int K^2(u)du
Uniform K(u) = \frac12 \,\mathbf{1}_{\{|u|\leq1\}} Kernel uniform.svg   \frac13   \frac12
Triangular K(u) = (1-|u|) \,\mathbf{1}_{\{|u|\leq1\}} Kernel triangle.svg   \frac{1}{6}   \frac{2}{3}
Epanechnikov K(u) = \frac{3}{4}(1-u^2) \,\mathbf{1}_{\{|u|\leq1\}} Kernel epanechnikov.svg   \frac{1}{5}   \frac{3}{5}
Quartic
(biweight)
K(u) = \frac{15}{16}(1-u^2)^2 \,\mathbf{1}_{\{|u|\leq1\}} Kernel quartic.svg   \frac{1}{7}   \frac{5}{7}
Triweight
(tricube)
K(u) = \frac{35}{32}(1-u^2)^3 \,\mathbf{1}_{\{|u|\leq1\}} Kernel triweight.svg   \frac{1}{9}   \frac{350}{429}
Gaussian K(u) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}u^2} Kernel exponential.svg   1\,   \frac{1}{2\sqrt\pi}
Cosine K(u) = \frac{\pi}{4}\cos\left(\frac{\pi}{2}u\right) \mathbf{1}_{\{|u|\leq1\}} Kernel cosine.svg   1-\frac{8}{\pi^2}   \frac{\pi^2}{16}


A kernel smoother is a statistical technique for estimating a real valued function f(X)\,\,\left( X\in \mathbb{R}^{p} \right) by using its noisy observations, when no parametric model for this function is known. The estimated function is smooth, and the level of smoothness is set by a single parameter.

Nearest neighbor smoother

The idea of the nearest neighbor smoother is the following. For each point X0, take m nearest neighbors and estimate the value of Y(X0) by averaging the values of these neighbors.
Formally, h_m (X_0)=\left\| X_0 - X_{[m]} \right\|, where X[m] is the mth closest to X0 neighbor, and
D(t)= \begin{cases}
1/m & \text{if } |t| \le 1 \\
0 & \text{otherwise}
\end{cases}
Example:
NNSmoother.svg
In this example, X is one-dimensional. For each X0, the \hat{Y}(X_0) is an average value of 16 closest to X0 points (denoted by red). The result is not smooth enough.

Kernel average smoother

The idea of the kernel average smoother is the following. For each data point X0, choose a constant distance size λ (kernel radius, or window width for p = 1 dimension), and compute a weighted average for all data points that are closer than λ to X0 (the closer to X0 points get higher weights).
Formally, hλ(X0) = λ = constant, and D(t) is one of the popular kernels.
Example:
KernelSmoother.svg
For each X0 the window width is constant, and the weight of each point in the window is schematically denoted by the yellow figure in the graph. It can be seen that the estimation is smooth, but the boundary points are biased. The reason for that is the non-equal number of points (from the right and from the left to the X0) in the window, when the X0 is close enough to the boundary.

Local linear regression

In the two previous sections we assumed that the underlying Y(X) function is locally constant, therefore we were able to use the weighted average for the estimation. The idea of local linear regression is to fit locally a straight line (or a hyperplane for higher dimensions), and not the constant (horizontal line). After fitting the line, the estimation \hat{Y}(X_{0}) is provided by the value of this line at X0 point. By repeating this procedure for each X0, one can get the estimation function \hat{Y}(X). Like in previous section, the window width is constant hλ(X0) = λ = constant. Formally, the local linear regression is computed by solving a weighted least square problem.
For one dimension (p = 1):
\begin{align}
  & \min_{\alpha (X_0),\beta (X_0)} \sum\limits_{i=1}^N {K_{h_{\lambda }}(X_0,X_i)\left( Y(X_i)-\alpha (X_0)-\beta (X_{0})X_i \right)^2} \\ 
 & \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\Downarrow  \\ 
 & \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\hat{Y}(X_{0})=\alpha (X_{0})+\beta (X_{0})X_{0} \\ 
\end{align}
The closed form solution is given by:
\hat{Y}(X_0)=\left( 1,X_0 \right)\left( B^{T}W(X_0)B \right)^{-1}B^{T}W(X_0)y
where:
  • y=\left( Y(X_1),\dots,Y(X_N) \right)^T
  • W(X_0)= \operatorname{diag} \left( K_{h_{\lambda }}(X_0,X_i) \right)_{N\times N}
  • B^{T}=\left( \begin{matrix}
   1 & 1 & \dots & 1  \\
   X_{1} & X_{2} & \dots & X_{N}  \\
\end{matrix} \right)
Example:
Localregressionsmoother.svg
The resulting function is smooth, and the problem with the biased boundary points is solved.

Local polynomial regression

Instead of fitting locally linear functions, one can fit polynomial functions.
For p=1, one should minimize:
\underset{\alpha (X_{0}),\beta _{j}(X_{0}),j=1,...,d}{\mathop{\min }}\,\sum\limits_{i=1}^{N}{K_{h_{\lambda }}(X_{0},X_{i})\left( Y(X_{i})-\alpha (X_{0})-\sum\limits_{j=1}^{d}{\beta _{j}(X_{0})X_{i}^{j}} \right)^{2}}
with \hat{Y}(X_{0})=\alpha (X_{0})+\sum\limits_{j=1}^{d}{\beta _{j}(X_{0})X_{0}^{j}}
In general case (p>1), one should minimize:
\begin{align}
  & \hat{\beta }(X_{0})=\underset{\beta (X_{0})}{\mathop{\arg \min }}\,\sum\limits_{i=1}^{N}{K_{h_{\lambda }}(X_{0},X_{i})\left( Y(X_{i})-b(X_{i})^{T}\beta (X_{0}) \right)}^{2} \\ 
 & b(X)=\left( \begin{matrix}
   1, & X_{1}, & X_{2},... & X_{1}^{2}, & X_{2}^{2},... & X_{1}X_{2}\,\,\,...  \\
\end{matrix} \right) \\ 
 & \hat{Y}(X_{0})=b(X_{0})^{T}\hat{\beta }(X_{0}) \\ 
\end{align}







라벨:





Scientist. Husband. Daddy. --- TOLLE. LEGE
외부자료의 인용에 있어 대한민국 저작권법(28조)과 U.S. Copyright Act (17 USC. §107)에 정의된 "저작권물의 공정한 이용원칙 | the U.S. fair use doctrine" 을 따릅니다. 저작권(© 최광민)이 명시된 모든 글과 번역문들에 대해 (1) 복제-배포, (2) 임의수정 및 자의적 본문 발췌, (3) 무단배포를 위한 화면캡처를 금하며, (4) 인용 시 URL 주소 만을 사용할 수 있습니다. [후원 | 운영] [대문으로] [방명록] [옛 방명록] [티스토리 (백업)] [신시내티]

-