Kernel trick
|
The kernel trick was first published in the 1964 paper Theoretical foundations of the potential function method in pattern recognition learning.
The kernel trick uses Mercer's condition, which states that any positive semi-definite kernel K(x, y) can be expressed as a dot product in a high-dimensional space.
More specifically, if the arguments to the kernel are in a measurable space X, and if the kernel is positive semi-definite, i.e.,
- <math>\sum_{i,j} K(x_i,x_j) c_i c_j \ge 0<math>
for any finite sequence of x1, ..., xn of X and sequence c1, ..., cn of real numbers; then there exists a function φ(x) whose range is in an inner product space of possibly high dimension, such that
- <math>K(x,y) = \phi(x)\cdot\phi(y).<math>
The kernel trick transforms any algorithm that solely depends on the dot product between two vectors. Wherever a dot product is used, it is replaced with the kernel function. Thus, a linear algorithm can easily be transformed into a non-linear algorithm. This non-linear algorithm is the linear algorithm operating in the range space of φ. However, because kernels are used, the φ function is never explicitly computed. This is desirable, because the high-dimensional space may be infinite-dimensional (as is the case when the kernel is a Gaussian).
The kernel trick has been applied to several algorithms in machine learning and statistics, including:
The coiner of the term kernel trick is unknown.
References
- M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821--837, 1964.