Supervised, Semi-supervised, and Unsupervised Learning

Kernel Based Algorithms for Mining Huge Data Sets

Te-Ming Huang, Vojislav Kecman, Ivica Kopriva
Series Studies in Computational Intelligence, Vol 17
Springer Verlag, Berlin, Heidelberg, 2006
260 pp. 96 illus., Hardcover, ISBN 3-540-31681-7

This is the first book that treats the fields of supervised, semi-supervised and unsupervised machine learning in a unifying way. In particular, it is the first presentation of the standard and improved graph based semisupervised (manifold) algorithms in a textbook. The book presents both the theory and the algorithms for mining huge data sets by using support vector machines (SVMs) in an iterative way. How the kernel based SVMs can be used for the dimensionality reduction (feature elimination) is shown in a detail and with a great care. The book also shows the similarities and differences between the two most popular unsupervised techniques, namely between the principal component analysis (PCA) and the independent component analysis (ICA). It is demonstrated that PCA, which decorrelates data pairs, is optimal for Gaussian sources and suboptimal for non-Gaussian ones. It is also pointed to the necessity of using ICA for non- Gaussian sources as well as to ICA’s inefficiency in the case of Gaussian ones. PCA algorithm known as whitening, or sphering transform, is derived. Batch and adaptive ICA algorithms are derived through the minimization of the mutual information which is an exact measure of statistical (in)dependence between data pairs.

The theory presented is followed by software and/or algorithmic solutions which make the presentation much easier to understand. The book is rich in graphics and contains a lot of examples which, in addition to understanding the concepts in a much pleasant way, enables the reader to develop his/her own codes for solving the problems. All the algorithms presented are used in solving several benchmarking real-world applications in bioinformatics (gene microarrays), text-categorization, numerals recognition, as well as in the images and audio signals de-mixing (blind source separation).

The book focuses on a broad range of machine learning algorithms and it is aimed at senior undergraduate students, graduate students and practicing researchers and scientists who want to use and develop the kernels based models rather than simply study them.

This book is accompanied with this site for downloading the data, software ISDA and SemiL for huge data set modeling in a supervised and semisupervised manner respectively. In addition, it contains MATLAB based PCA and ICA routines for unsupervised learning, as well as the MATLAB implementation of a conjugate gradient algorithm for solving linear systems of equations with box-constraints. It also contains some other material used in the book, as well as some additional links to related websites. Thus, it may be very helpful for readers to make occasional visits to this site and to download the newest version of software and/or data files introduced in the book.