Linked e-resources
Details
Table of Contents
Front Cover
Machine Learning
Copyright
Contents
About the Author
Preface
Acknowledgments
Notation
1 Introduction
1.1 The Historical Context
1.2 Arti cial Intelligence and Machine Learning
1.3 Algorithms Can Learn What Is Hidden in the Data
1.4 Typical Applications of Machine Learning
Speech Recognition
Computer Vision
Multimodal Data
Natural Language Processing
Robotics
Autonomous Cars
Challenges for the Future
1.5 Machine Learning: Major Directions
1.5.1 Supervised Learning
Classi cation
Regression
1.6 Unsupervised and Semisupervised Learning
1.7 Structure and a Road Map of the Book
References
2 Probability and Stochastic Processes
2.1 Introduction
2.2 Probability and Random Variables
2.2.1 Probability
Relative Frequency De nition
Axiomatic De nition
2.2.2 Discrete Random Variables
Joint and Conditional Probabilities
Bayes Theorem
2.2.3 Continuous Random Variables
2.2.4 Mean and Variance
Complex Random Variables
2.2.5 Transformation of Random Variables
2.3 Examples of Distributions
2.3.1 Discrete Variables
The Bernoulli Distribution
The Binomial Distribution
The Multinomial Distribution
2.3.2 Continuous Variables
The Uniform Distribution
The Gaussian Distribution
The Central Limit Theorem
The Exponential Distribution
The Beta Distribution
The Gamma Distribution
The Dirichlet Distribution
2.4 Stochastic Processes
2.4.1 First- and Second-Order Statistics
2.4.2 Stationarity and Ergodicity
2.4.3 Power Spectral Density
Properties of the Autocorrelation Sequence
Power Spectral Density
Transmission Through a Linear System
Physical Interpretation of the PSD
2.4.4 Autoregressive Models
2.5 Information Theory
2.5.1 Discrete Random Variables
Information.
Mutual and Conditional Information
Entropy and Average Mutual Information
2.5.2 Continuous Random Variables
Average Mutual Information and Conditional Information
Relative Entropy or Kullback-Leibler Divergence
2.6 Stochastic Convergence
Convergence Everywhere
Convergence Almost Everywhere
Convergence in the Mean-Square Sense
Convergence in Probability
Convergence in Distribution
Problems
References
3 Learning in Parametric Modeling: Basic Concepts and Directions
3.1 Introduction
3.2 Parameter Estimation: the Deterministic Point of View
3.3 Linear Regression
3.4 Classi cation
Generative Versus Discriminative Learning
3.5 Biased Versus Unbiased Estimation
3.5.1 Biased or Unbiased Estimation?
3.6 The Cramér-Rao Lower Bound
3.7 Suf cient Statistic
3.8 Regularization
Inverse Problems: Ill-Conditioning and Over tting
3.9 The Bias-Variance Dilemma
3.9.1 Mean-Square Error Estimation
3.9.2 Bias-Variance Tradeoff
3.10 Maximum Likelihood Method
3.10.1 Linear Regression: the Nonwhite Gaussian Noise Case
3.11 Bayesian Inference
3.11.1 The Maximum a Posteriori Probability Estimation Method
3.12 Curse of Dimensionality
3.13 Validation
Cross-Validation
3.14 Expected Loss and Empirical Risk Functions
Learnability
3.15 Nonparametric Modeling and Estimation
Problems
MATLAB® Exercises
References
4 Mean-Square Error Linear Estimation
4.1 Introduction
4.2 Mean-Square Error Linear Estimation: the Normal Equations
4.2.1 The Cost Function Surface
4.3 A Geometric Viewpoint: Orthogonality Condition
4.4 Extension to Complex-Valued Variables
4.4.1 Widely Linear Complex-Valued Estimation
Circularity Conditions
4.4.2 Optimizing With Respect to Complex-Valued Variables: Wirtinger Calculus
4.5 Linear Filtering.
4.6 MSE Linear Filtering: a Frequency Domain Point of View
Deconvolution: Image Deblurring
4.7 Some Typical Applications
4.7.1 Interference Cancelation
4.7.2 System Identi cation
4.7.3 Deconvolution: Channel Equalization
4.8 Algorithmic Aspects: the Levinson and Lattice-Ladder Algorithms
Forward and Backward MSE Optimal Predictors
4.8.1 The Lattice-Ladder Scheme
Orthogonality of the Optimal Backward Errors
4.9 Mean-Square Error Estimation of Linear Models
4.9.1 The Gauss-Markov Theorem
4.9.2 Constrained Linear Estimation: the Beamforming Case
4.10 Time-Varying Statistics: Kalman Filtering
Problems
MATLAB® Exercises
References
5 Online Learning: the Stochastic Gradient Descent Family of Algorithms
5.1 Introduction
5.2 The Steepest Descent Method
5.3 Application to the Mean-Square Error Cost Function
Time-Varying Step Sizes
5.3.1 The Complex-Valued Case
5.4 Stochastic Approximation
Application to the MSE Linear Estimation
5.5 The Least-Mean-Squares Adaptive Algorithm
5.5.1 Convergence and Steady-State Performance of the LMS in Stationary Environments
Convergence of the Parameter Error Vector
5.5.2 Cumulative Loss Bounds
5.6 The Af ne Projection Algorithm
Geometric Interpretation of APA
Orthogonal Projections
5.6.1 The Normalized LMS
5.7 The Complex-Valued Case
The Widely Linear LMS
The Widely Linear APA
5.8 Relatives of the LMS
The Sign-Error LMS
The Least-Mean-Fourth (LMF) Algorithm
Transform-Domain LMS
5.9 Simulation Examples
5.10 Adaptive Decision Feedback Equalization
5.11 The Linearly Constrained LMS
5.12 Tracking Performance of the LMS in Nonstationary Environments
5.13 Distributed Learning: the Distributed LMS
5.13.1 Cooperation Strategies
Centralized Networks
Decentralized Networks.
5.13.2 The Diffusion LMS
5.13.3 Convergence and Steady-State Performance: Some Highlights
5.13.4 Consensus-Based Distributed Schemes
5.14 A Case Study: Target Localization
5.15 Some Concluding Remarks: Consensus Matrix
Problems
MATLAB® Exercises
References
6 The Least-Squares Family
6.1 Introduction
6.2 Least-Squares Linear Regression: a Geometric Perspective
6.3 Statistical Properties of the LS Estimator
The LS Estimator Is Unbiased
Covariance Matrix of the LS Estimator
The LS Estimator Is BLUE in the Presence of White Noise
The LS Estimator Achieves the Cramér-Rao Bound for White Gaussian Noise
Asymptotic Distribution of the LS Estimator
6.4 Orthogonalizing the Column Space of the Input Matrix: the SVD Method
Pseudoinverse Matrix and SVD
6.5 Ridge Regression: a Geometric Point of View
Principal Components Regression
6.6 The Recursive Least-Squares Algorithm
Time-Iterative Computations
Time Updating of the Parameters
6.7 Newton's Iterative Minimization Method
6.7.1 RLS and Newton's Method
6.8 Steady-State Performance of the RLS
6.9 Complex-Valued Data: the Widely Linear RLS
6.10 Computational Aspects of the LS Solution
Cholesky Factorization
QR Factorization
Fast RLS Versions
6.11 The Coordinate and Cyclic Coordinate Descent Methods
6.12 Simulation Examples
6.13 Total Least-Squares
Geometric Interpretation of the Total Least-Squares Method
Problems
MATLAB® Exercises
References
7 Classi cation: a Tour of the Classics
7.1 Introduction
7.2 Bayesian Classi cation
The Bayesian Classi er Minimizes the Misclassi cation Error
7.2.1 Average Risk
7.3 Decision (Hyper)Surfaces
7.3.1 The Gaussian Distribution Case
Minimum Distance Classi ers
7.4 The Naive Bayes Classi er
7.5 The Nearest Neighbor Rule.
7.6 Logistic Regression
7.7 Fisher's Linear Discriminant
7.7.1 Scatter Matrices
7.7.2 Fisher's Discriminant: the Two-Class Case
7.7.3 Fisher's Discriminant: the Multiclass Case
7.8 Classi cation Trees
7.9 Combining Classi ers
No Free Lunch Theorem
Some Experimental Comparisons
Schemes for Combining Classi ers
7.10 The Boosting Approach
The AdaBoost Algorithm
The Log-Loss Function
7.11 Boosting Trees
Problems
MATLAB® Exercises
References
8 Parameter Learning: a Convex Analytic Path
8.1 Introduction
8.2 Convex Sets and Functions
8.2.1 Convex Sets
8.2.2 Convex Functions
8.3 Projections Onto Convex Sets
8.3.1 Properties of Projections
8.4 Fundamental Theorem of Projections Onto Convex Sets
8.5 A Parallel Version of POCS
8.6 From Convex Sets to Parameter Estimation and Machine Learning
8.6.1 Regression
8.6.2 Classi cation
8.7 In nitely Many Closed Convex Sets: the Online Learning Case
8.7.1 Convergence of APSM
Some Practical Hints
8.8 Constrained Learning
8.9 The Distributed APSM
8.10 Optimizing Nonsmooth Convex Cost Functions
8.10.1 Subgradients and Subdifferentials
8.10.2 Minimizing Nonsmooth Continuous Convex Loss Functions: the Batch Learning Case
The Subgradient Method
The Generic Projected Subgradient Scheme
The Projected Gradient Method (PGM)
Projected Subgradient Method
8.10.3 Online Learning for Convex Optimization
The PEGASOS Algorithm
8.11 Regret Analysis
Regret Analysis of the Subgradient Algorithm
8.12 Online Learning and Big Data Applications: a Discussion
Approximation, Estimation, and Optimization Errors
Batch Versus Online Learning
8.13 Proximal Operators
8.13.1 Properties of the Proximal Operator
8.13.2 Proximal Minimization
Resolvent of the Subdifferential Mapping.
8.14 Proximal Splitting Methods for Optimization.
Machine Learning
Copyright
Contents
About the Author
Preface
Acknowledgments
Notation
1 Introduction
1.1 The Historical Context
1.2 Arti cial Intelligence and Machine Learning
1.3 Algorithms Can Learn What Is Hidden in the Data
1.4 Typical Applications of Machine Learning
Speech Recognition
Computer Vision
Multimodal Data
Natural Language Processing
Robotics
Autonomous Cars
Challenges for the Future
1.5 Machine Learning: Major Directions
1.5.1 Supervised Learning
Classi cation
Regression
1.6 Unsupervised and Semisupervised Learning
1.7 Structure and a Road Map of the Book
References
2 Probability and Stochastic Processes
2.1 Introduction
2.2 Probability and Random Variables
2.2.1 Probability
Relative Frequency De nition
Axiomatic De nition
2.2.2 Discrete Random Variables
Joint and Conditional Probabilities
Bayes Theorem
2.2.3 Continuous Random Variables
2.2.4 Mean and Variance
Complex Random Variables
2.2.5 Transformation of Random Variables
2.3 Examples of Distributions
2.3.1 Discrete Variables
The Bernoulli Distribution
The Binomial Distribution
The Multinomial Distribution
2.3.2 Continuous Variables
The Uniform Distribution
The Gaussian Distribution
The Central Limit Theorem
The Exponential Distribution
The Beta Distribution
The Gamma Distribution
The Dirichlet Distribution
2.4 Stochastic Processes
2.4.1 First- and Second-Order Statistics
2.4.2 Stationarity and Ergodicity
2.4.3 Power Spectral Density
Properties of the Autocorrelation Sequence
Power Spectral Density
Transmission Through a Linear System
Physical Interpretation of the PSD
2.4.4 Autoregressive Models
2.5 Information Theory
2.5.1 Discrete Random Variables
Information.
Mutual and Conditional Information
Entropy and Average Mutual Information
2.5.2 Continuous Random Variables
Average Mutual Information and Conditional Information
Relative Entropy or Kullback-Leibler Divergence
2.6 Stochastic Convergence
Convergence Everywhere
Convergence Almost Everywhere
Convergence in the Mean-Square Sense
Convergence in Probability
Convergence in Distribution
Problems
References
3 Learning in Parametric Modeling: Basic Concepts and Directions
3.1 Introduction
3.2 Parameter Estimation: the Deterministic Point of View
3.3 Linear Regression
3.4 Classi cation
Generative Versus Discriminative Learning
3.5 Biased Versus Unbiased Estimation
3.5.1 Biased or Unbiased Estimation?
3.6 The Cramér-Rao Lower Bound
3.7 Suf cient Statistic
3.8 Regularization
Inverse Problems: Ill-Conditioning and Over tting
3.9 The Bias-Variance Dilemma
3.9.1 Mean-Square Error Estimation
3.9.2 Bias-Variance Tradeoff
3.10 Maximum Likelihood Method
3.10.1 Linear Regression: the Nonwhite Gaussian Noise Case
3.11 Bayesian Inference
3.11.1 The Maximum a Posteriori Probability Estimation Method
3.12 Curse of Dimensionality
3.13 Validation
Cross-Validation
3.14 Expected Loss and Empirical Risk Functions
Learnability
3.15 Nonparametric Modeling and Estimation
Problems
MATLAB® Exercises
References
4 Mean-Square Error Linear Estimation
4.1 Introduction
4.2 Mean-Square Error Linear Estimation: the Normal Equations
4.2.1 The Cost Function Surface
4.3 A Geometric Viewpoint: Orthogonality Condition
4.4 Extension to Complex-Valued Variables
4.4.1 Widely Linear Complex-Valued Estimation
Circularity Conditions
4.4.2 Optimizing With Respect to Complex-Valued Variables: Wirtinger Calculus
4.5 Linear Filtering.
4.6 MSE Linear Filtering: a Frequency Domain Point of View
Deconvolution: Image Deblurring
4.7 Some Typical Applications
4.7.1 Interference Cancelation
4.7.2 System Identi cation
4.7.3 Deconvolution: Channel Equalization
4.8 Algorithmic Aspects: the Levinson and Lattice-Ladder Algorithms
Forward and Backward MSE Optimal Predictors
4.8.1 The Lattice-Ladder Scheme
Orthogonality of the Optimal Backward Errors
4.9 Mean-Square Error Estimation of Linear Models
4.9.1 The Gauss-Markov Theorem
4.9.2 Constrained Linear Estimation: the Beamforming Case
4.10 Time-Varying Statistics: Kalman Filtering
Problems
MATLAB® Exercises
References
5 Online Learning: the Stochastic Gradient Descent Family of Algorithms
5.1 Introduction
5.2 The Steepest Descent Method
5.3 Application to the Mean-Square Error Cost Function
Time-Varying Step Sizes
5.3.1 The Complex-Valued Case
5.4 Stochastic Approximation
Application to the MSE Linear Estimation
5.5 The Least-Mean-Squares Adaptive Algorithm
5.5.1 Convergence and Steady-State Performance of the LMS in Stationary Environments
Convergence of the Parameter Error Vector
5.5.2 Cumulative Loss Bounds
5.6 The Af ne Projection Algorithm
Geometric Interpretation of APA
Orthogonal Projections
5.6.1 The Normalized LMS
5.7 The Complex-Valued Case
The Widely Linear LMS
The Widely Linear APA
5.8 Relatives of the LMS
The Sign-Error LMS
The Least-Mean-Fourth (LMF) Algorithm
Transform-Domain LMS
5.9 Simulation Examples
5.10 Adaptive Decision Feedback Equalization
5.11 The Linearly Constrained LMS
5.12 Tracking Performance of the LMS in Nonstationary Environments
5.13 Distributed Learning: the Distributed LMS
5.13.1 Cooperation Strategies
Centralized Networks
Decentralized Networks.
5.13.2 The Diffusion LMS
5.13.3 Convergence and Steady-State Performance: Some Highlights
5.13.4 Consensus-Based Distributed Schemes
5.14 A Case Study: Target Localization
5.15 Some Concluding Remarks: Consensus Matrix
Problems
MATLAB® Exercises
References
6 The Least-Squares Family
6.1 Introduction
6.2 Least-Squares Linear Regression: a Geometric Perspective
6.3 Statistical Properties of the LS Estimator
The LS Estimator Is Unbiased
Covariance Matrix of the LS Estimator
The LS Estimator Is BLUE in the Presence of White Noise
The LS Estimator Achieves the Cramér-Rao Bound for White Gaussian Noise
Asymptotic Distribution of the LS Estimator
6.4 Orthogonalizing the Column Space of the Input Matrix: the SVD Method
Pseudoinverse Matrix and SVD
6.5 Ridge Regression: a Geometric Point of View
Principal Components Regression
6.6 The Recursive Least-Squares Algorithm
Time-Iterative Computations
Time Updating of the Parameters
6.7 Newton's Iterative Minimization Method
6.7.1 RLS and Newton's Method
6.8 Steady-State Performance of the RLS
6.9 Complex-Valued Data: the Widely Linear RLS
6.10 Computational Aspects of the LS Solution
Cholesky Factorization
QR Factorization
Fast RLS Versions
6.11 The Coordinate and Cyclic Coordinate Descent Methods
6.12 Simulation Examples
6.13 Total Least-Squares
Geometric Interpretation of the Total Least-Squares Method
Problems
MATLAB® Exercises
References
7 Classi cation: a Tour of the Classics
7.1 Introduction
7.2 Bayesian Classi cation
The Bayesian Classi er Minimizes the Misclassi cation Error
7.2.1 Average Risk
7.3 Decision (Hyper)Surfaces
7.3.1 The Gaussian Distribution Case
Minimum Distance Classi ers
7.4 The Naive Bayes Classi er
7.5 The Nearest Neighbor Rule.
7.6 Logistic Regression
7.7 Fisher's Linear Discriminant
7.7.1 Scatter Matrices
7.7.2 Fisher's Discriminant: the Two-Class Case
7.7.3 Fisher's Discriminant: the Multiclass Case
7.8 Classi cation Trees
7.9 Combining Classi ers
No Free Lunch Theorem
Some Experimental Comparisons
Schemes for Combining Classi ers
7.10 The Boosting Approach
The AdaBoost Algorithm
The Log-Loss Function
7.11 Boosting Trees
Problems
MATLAB® Exercises
References
8 Parameter Learning: a Convex Analytic Path
8.1 Introduction
8.2 Convex Sets and Functions
8.2.1 Convex Sets
8.2.2 Convex Functions
8.3 Projections Onto Convex Sets
8.3.1 Properties of Projections
8.4 Fundamental Theorem of Projections Onto Convex Sets
8.5 A Parallel Version of POCS
8.6 From Convex Sets to Parameter Estimation and Machine Learning
8.6.1 Regression
8.6.2 Classi cation
8.7 In nitely Many Closed Convex Sets: the Online Learning Case
8.7.1 Convergence of APSM
Some Practical Hints
8.8 Constrained Learning
8.9 The Distributed APSM
8.10 Optimizing Nonsmooth Convex Cost Functions
8.10.1 Subgradients and Subdifferentials
8.10.2 Minimizing Nonsmooth Continuous Convex Loss Functions: the Batch Learning Case
The Subgradient Method
The Generic Projected Subgradient Scheme
The Projected Gradient Method (PGM)
Projected Subgradient Method
8.10.3 Online Learning for Convex Optimization
The PEGASOS Algorithm
8.11 Regret Analysis
Regret Analysis of the Subgradient Algorithm
8.12 Online Learning and Big Data Applications: a Discussion
Approximation, Estimation, and Optimization Errors
Batch Versus Online Learning
8.13 Proximal Operators
8.13.1 Properties of the Proximal Operator
8.13.2 Proximal Minimization
Resolvent of the Subdifferential Mapping.
8.14 Proximal Splitting Methods for Optimization.