Machine Learning: a Lecture Note

Published 6 May 2025 in cs.LG and stat.ML | (2505.03861v1)

Abstract: This lecture note is intended to prepare early-year master's and PhD students in data science or a related discipline with foundational ideas in machine learning. It starts with basic ideas in modern machine learning with classification as a main target task. These basic ideas include loss formulation, backpropagation, stochastic gradient descent, generalization, model selection as well as fundamental blocks of artificial neural networks. Based on these basic ideas, the lecture note explores in depth the probablistic approach to unsupervised learning, covering directed latent variable models, product of experts, generative adversarial networks and autoregressive models. Finally, the note ends by covering a diverse set of further topics, such as reinforcement learning, ensemble methods and meta-learning. After reading this lecture note, a student should be ready to embark on studying and researching more advanced topics in machine learning and more broadly artificial intelligence.

Abstract PDF Upgrade to Chat

Summary

The paper presents an energy function framework that unifies inference, learning, and various machine learning paradigms.
The note details how gradient-based techniques like backpropagation and SGD optimize parameters in models including classifiers and neural networks.
It explores probabilistic modeling and advanced topics such as VAEs, RBMs, and adaptive optimizers for scalable, efficient ML implementations.

This lecture note presents machine learning concepts through the lens of an energy function. An energy function $e(x, z, \theta)$ assigns a real value to a pair of observed instance $x$ and latent instance $z$ , parametrized by $\theta$ . Low energy indicates high compatibility or preference. This unifying perspective allows deriving various machine learning paradigms by minimizing this energy function with respect to different variables.

Inference: Given a partial observation, inference means minimizing the energy function with respect to the unobserved part. For instance, in supervised learning with observed pairs $(x, y)$ and no latent variable $z$ , predicting the output $\hat{y}$ for a new input $x'$ involves minimizing $e([x', y], \varnothing, \theta)$ over possible $y$ . In clustering, with observed $x$ and latent cluster assignment $z$ , inference is finding $\hat{z} = \arg\min_z e(x, z, \theta)$ .

Learning: Estimating the parameter $\theta$ typically involves minimizing the expected energy on observed data, often with regularization to ensure high energy for undesirable inputs. When latent variables exist, learning requires simultaneously solving the inference problem. The lecture emphasizes that all algorithms are presented to work with scalable implementations, particularly stochastic gradient descent (SGD) and its variants.

The core machine learning problem is thus decomposed into three aspects:

Defining the energy function $e$ (parametrization).
Estimating $\theta$ from data (learning).
Inferring missing parts given partial observations (inference).

Classification

Classification is presented as a supervised learning problem where the output $y$ is a discrete category. With no latent variable, inference is $\hat{y}(x) = \arg\min_{y' \in \mathcal{Y}} e([x, y'], \varnothing, \theta)$ . A common parametrization uses a feature extractor $f(x, \theta)$ that outputs a vector of scores for each category, $e([x, y], \varnothing, \theta) = \mathrm{1}(y)^\top f(x, \theta)$ . A simple example is linear classification: $f(x, \theta) = Wx + b$ , where $\theta=(W, b)$ .

Learning $\theta$ is framed as minimizing an average loss function over a training dataset.

Zero-One Loss: $L_{0-1}([x, y], \theta) = \mathds{1}(y \neq \hat{y}(x))$. This loss is non-differentiable and piecewise constant, making gradient-based optimization difficult.
Margin Loss (Hinge Loss): $L_{\mathrm{margin}}([x, y], \theta) = \max(0, m + e([x, y], \varnothing, \theta) - e([x, \hat{y}'], \varnothing, \theta))$ , where $\hat{y}'$ is the second-best incorrect output. This loss encourages a margin $m>0$ between the true output's energy and the best incorrect output's energy. Perceptron loss is a special case with $m=0$ . The gradient is non-zero only when the margin is violated.
Softmax and Cross-Entropy Loss: This approach converts energy scores $a_y = e([x, y], \varnothing, \theta)$ into a probability distribution $p_\theta(y|x)$ using the softmax function: $p_\theta(y|x) = \frac{\exp(-e([x,y],\varnothing,\theta))}{\sum_{y'\in\mathcal{Y}} \exp(-e([x,y'],\varnothing,\theta))}$ . This derivation is shown to arise from maximizing entropy subject to normalization constraints. The learning objective is the negative log-likelihood or cross-entropy loss: $L_{\mathrm{ce}}([x, y], \theta) = -\log p_\theta(y|x)$ . The gradient of the cross-entropy loss is $\nabla_\theta e([x,y], \varnothing, \theta) - \mathbb{E}_{y'|x;\theta} [\nabla_\theta e([x, y'], \varnothing, \theta)]$ . This "Boltzmann machine learning" rule involves a "positive phase" (decreasing energy of the correct output) and a "negative phase" (increasing expected energy over all outputs, weighted by the model's probability). The cross-entropy loss is widely used in practice due to its differentiability and probabilistic interpretation.

Backpropagation

Backpropagation is presented as the method for computing gradients of the loss function with respect to model parameters, particularly for composite, differentiable energy functions (like neural networks). It's a form of reverse-mode automatic differentiation.

For a linear energy function $e([x, y], \theta) = -w_y^\top x - b_y$ , the gradients are simple: $\nabla_{w_y} e = -x$ and $\nabla_{b_y} e = -1$ . Applying this to the perceptron loss shows updates that lower the energy of the correct output and raise the energy of the predicted incorrect output.

The core idea of backpropagation is shown by considering the gradient of the loss with respect to an intermediate transformation of the input, $h=F(x, \theta')$ . If the loss gradient w.r.t. $h$ is $\nabla_h L$ , then the gradient w.r.t. $\theta'$ can be computed using the chain rule. This "back-propagation" of the gradient signal $\nabla_h L$ through the transformation $F$ (e.g., $F(x) = \sigma(U^\top x + c)$ ) allows computing gradients w.r.t. parameters $U$ and $c$ . For $h = \sigma(U^\top x + c)$ , $\nabla_U L = x (\nabla_h L \odot h')^\top$ and $\nabla_c L = \nabla_h L \odot h'$ , where $h' = \sigma'(U^\top x + c)$ . The term $h'$ accounts for the non-linearity's derivative. By stacking such differentiable transformations, the gradient can be propagated backward through the entire network. The feasibility of computing these gradients efficiently makes backpropagation the standard for training neural networks. Libraries like PyTorch and Jax automate this process.

Stochastic Gradient Descent

Minimizing the average loss $f(\theta) = \frac{1}{N} \sum_{i=1}^N f_i(\theta)$ for large $N$ and large $|\theta|$ is computationally expensive. SGD addresses this by using a stochastic gradient estimate $g_{i_t} = \nabla f_{i_t}(\theta_t)$ based on a single example $i_t$ (or a small minibatch) at each step: $\theta_{t+1} = \theta_t - \alpha_t g_{i_t}$ .

The Descent Lemma states $f(y) \leq f(x) + \nabla f(x)^\top (y - x) + \frac{L}{2} \| y - x \|^2$ for $L$ -Lipschitz gradients. For full gradient descent with step $\alpha_t$ , this implies $f(\theta_{t+1}) \leq f(\theta_t) - (\alpha_t - \frac{L}{2}\alpha_t^2) \|\nabla f(\theta_t)\|^2$ , suggesting an optimal step of $1/L$. For SGD, the expected value of the next loss step is $\mathbb{E}[f(\theta_{t+1})] \leq f(\theta_t) - \alpha_t \|\nabla f(\theta_t)\|^2 + \alpha_t^2 \frac{L}{2} \mathbb{E}\|g_{i_t}\|^2$ . The variance of the stochastic gradient introduces a positive term. To ensure expected progress or convergence near a minimum, the learning rate $\alpha_t$ typically needs to decrease over time.

Adaptive Learning Rate Methods: Instead of a fixed scalar learning rate, adaptive methods adjust the learning rate for each parameter dynamically based on past gradients.

Adagrad (Tanaka et al., 2011) scales the learning rate for each parameter inversely to the square root of the sum of its past squared gradients. This helps parameters with small gradients take larger steps and vice versa. Limitation: Learning rates monotonically decay, potentially stopping training prematurely.
Adam (Kingma et al., 2014) uses exponential moving averages of both the gradient (momentum, $m_t$ ) and the squared gradient (variance, $v_t$ ). The update rule is $\theta_t^i \leftarrow \theta_{t-1}^i + \alpha \frac{m_t^i}{\sqrt{v_t^i + \epsilon}}$ . Adam's per-parameter learning rates do not decay monotonically, often performing better in practice, especially for non-convex optimization. Adam or its variants are the de facto standard optimizers.

Generalization and Model Selection

The goal is to minimize the expected risk $R(\theta) = \mathbb{E}_{\mathrm{data}}[L([x,y], \theta)]$ , which is intractable. We minimize the empirical risk $\hat{R}(\theta) = \frac{1}{N} \sum_{n=1}^N L([x^n, y^n], \theta)$ on a training set $D$ . Generalization bounds quantify the likely gap $|R(\theta) - \hat{R}(\theta)|$ . Using Hoeffding's inequality, for a fixed $\theta$ , $R(\theta) < \hat{R}(\theta) + \sqrt{\frac{1}{2N}\log\frac{2}{\delta}}$ with probability $1-\delta$ . For a finite hypothesis space $\Theta$ of size $|\Theta|$ , a union bound gives $R(\theta) < \hat{R}(\theta) + \sqrt{\frac{\log|\Theta| - \log\delta}{2N}}$ with probability $1-\delta$ for any $\theta \in \Theta$ . This shows that the generalization gap increases with model complexity (size of $\Theta$ ) and decreases with data size $N$ . For infinite hypothesis spaces, concepts like VC dimension [cs/9901010] or PAC-Bayesian bounds are needed. PAC-Bayesian bounds [cs/9901010, cs/9807005] offer a more actionable perspective, relating the expected risk under a distribution of models $Q(\theta)$ to the empirical risk under $Q$ , and the KL divergence between $Q$ and a prior $P$ . $R(Q) \leq \hat{R}(Q) + \sqrt{\frac{1}{2N}(D_{\mathrm{KL}}(Q\|P) + \log\frac{N+1}{\delta})}$ . This suggests minimizing empirical risk is important, but also keeping the learned model distribution $Q$ close to a prior $P$ helps generalization.

Bias, Variance, and Uncertainty: The expected squared error can be decomposed into irreducible error (noise in data, aleatoric uncertainty), bias $^2$ (how well the average prediction across model variations matches the true mean), and variance (how much predictions vary across model variations, epistemic uncertainty). Complex models tend to have low bias but high variance; simple models have high bias but low variance. Learning involves balancing this tradeoff.

Uncertainty in Error Rate:

Confidence Interval: Quantifies uncertainty in the test set error for a fixed model, assuming the test set is a random sample. Using the Central Limit Theorem (for large test sets), a confidence interval for accuracy can be estimated using the sample mean and variance of the per-instance loss.
Credible Interval: Quantifies uncertainty in test error due to model variation (e.g., from different training runs, random initialization). This requires considering a distribution over models $p(\theta|D)$ .
Training Set Variation: The uncertainty due to the specific training set realization can be assessed using methods like Bootstrap resampling, creating multiple "training sets" by sampling with replacement and training models on each.

Hyperparameter Tuning: Hyperparameters $\lambda$ control the learning process itself. Tuning involves finding $\lambda$ that minimizes the generalization error, typically estimated using a validation set $D_{\mathrm{val}}$ . $\mathrm{Tune}(D_{\mathrm{val}}, D; \epsilon') = \arg\min_\lambda \mathbb{E}_\epsilon [\hat{R}(\mathrm{Learn}(D; \lambda, \epsilon); D_{\mathrm{val}})]$ . Because the learning process $\mathrm{Learn}$ is often a blackbox function of $\lambda$ , blackbox optimization methods are used:

Random Search (Qadri et al., 2011): Sample $\lambda$ from a prior distribution and evaluate them in parallel on the validation set.
Sequential Model-Based Optimization (SMBO) [9807003, (Brandao et al., 2012)]: Build an uncertainty-aware model (e.g., Gaussian Processes, though not explicitly mentioned) that predicts the validation risk of $\lambda$ given previously evaluated $\lambda$ -risk pairs. Use this model's prediction and uncertainty to select the next promising $\lambda$ to evaluate, often by maximizing an acquisition function like Expected Improvement.

A separate test set $D_{\mathrm{test}}$ must be used for final, unbiased evaluation after hyperparameters are tuned, often by training the final model on $D \cup D_{\mathrm{val}}$ with the best $\lambda$ .

Building Blocks of Neural Networks

Beyond simple linear transformations, neural networks are built from various differentiable blocks:

Normalization: Techniques applied within the network to normalize activations, improving optimization conditioning (making Hessian closer to identity).
- Batch Normalization (Batch Norm) (Ioffe et al., 2015): Normalizes inputs across the batch dimension (mean and variance are computed per feature across the batch). Has different behavior during training (uses batch stats) and inference (uses population stats, typically running averages).
- Layer Normalization (Layer Norm) (Ba et al., 2016): Normalizes inputs across feature dimensions within each instance. Avoids the training/inference discrepancy of Batch Norm but can break relationships between instances if not used carefully (e.g., magnitude-based classification).
Convolutional Blocks: Leverage spatial (or temporal) structure by applying learned filters locally and repeatedly. This introduces translation equivariance (or invariance when combined with pooling/reduction) and is suitable for grid-like data (images, time series).
Recurrent Blocks: Process sequential data by applying the same function iteratively, maintaining a hidden state (memory) that depends on previous inputs. Allows unbounded context size in principle. Examples like Gated Recurrent Units (GRUs) (Cho et al., 2014) use gating mechanisms to mitigate vanishing gradients [9402007].
Attention Blocks: Operate on sets or sequences, allowing each output element to be computed by aggregating information from all input elements based on learned compatibility scores (attention weights). This provides permutation equivariance for sets. For sequences, positional encoding (additive or relative (Su et al., 2021)) is added to input features to inject position information. Masking (causal masking) is used for autoregressive sequence processing so output at position $i$ only depends on inputs up to position $i-1$ . Multi-headed attention allows learning different aggregation patterns.

Probabilistic Machine Learning and Unsupervised Learning

Probabilistic models explicitly define probability distributions. An energy function $e(x, z, \theta)$ can define a joint distribution $p(x,z; \theta) \propto \exp(-e(x,z,\theta))$ . Directed models decompose $p(x,z)$ as $p(z)p(x|z)$ . Unsupervised learning focuses on modeling the data distribution $p(x)$ , often by marginalizing out latent variables $z$ : $p(x) = \int p(z)p(x|z)dz$ .

Variational Inference (VI) [9803056]: Approximates the intractable true posterior $p(z|x)$ with a tractable distribution $q(z; \phi(x))$ by minimizing their KL divergence $D_{\mathrm{KL}}(q \| p)$ . This minimization is equivalent to maximizing the Evidence Lower Bound (ELBO): $\log p(x; \theta) \geq \mathbb{E}_{z\sim q}[\log p(x|z; \theta)] - D_{\mathrm{KL}}(q(z; \phi(x)) \| p(z))$ . Maximizing the ELBO jointly optimizes $q$ (finding a better approximation to $p(z|x)$ ) and $\theta$ (improving the model's fit to $x$ ).

Variational Gaussian Mixture Models (MoG): With $z$ as a discrete cluster ID, $p(z)$ a categorical prior, and $p(x|z)$ a Gaussian, the ELBO can be optimized via Expectation-Maximization (EM). In this case, the optimal $q(z|x)$ is the true posterior, making the ELBO tight. This connects to K-Means clustering as a hard-assignment version of EM (or MoG at $\beta \to 0$ ).
Continuous Latent Variable Models: With $z$ $z$ as a continuous vector and $p(x|z)$ $p (x ∣ z)$ Gaussian with mean $F(z;\theta)$ $F (z; θ)$ , the ELBO is $\mathbb{E}_{z\sim q} [-\frac{1}{2}\|x-F(z;\theta)\|^2] - D_{\mathrm{KL}}(q(z;\phi(x)) \| p(z))$ . For linear $F$ $F$ and Gaussian $q, p$ $q, p$ , this relates to Probabilistic PCA [9905026]. Generally, gradients w.r.t. $\phi(x)$ $ϕ (x)$ are intractable.
- Variational Autoencoders (VAEs) (Kingma et al., 2013, Rezende et al., 2014): Address tractability for nonlinear $F$ and Gaussian $q$ by:
- 1. Amortized Inference: Using an inference network $G(x; \theta_G)$ to predict the parameters $\phi(x)$ of $q(z|x)$ (e.g., mean and variance for a Gaussian $q$ ). This avoids storing per-instance parameters and enables generalization to new data.
- 2. Reparametrization Trick: Sampling $z \sim q(z|\phi(x))$ as $z = g(\epsilon; \phi(x))$ where $\epsilon$ is sampled from a fixed distribution (e.g., $\mathcal{N}(0,I)$ for Gaussian $q$ ). This makes the sample $z$ a deterministic function of $\phi(x)$ and $\epsilon$ , allowing backpropagation through the sampling process.
- The VAE objective becomes maximizing $\mathbb{E}_{\epsilon}[-\frac{1}{2}\|x - F(G(x;\theta_G) + \sigma\epsilon; \theta)\|^2] - D_{\mathrm{KL}}(q\|p)$ (using the reparametrization). This can be trained end-to-end with SGD. The objective has a reconstruction term and a prior matching/regularization term for $q$ .
Importance Sampling: Provides an unbiased way to estimate the marginal likelihood $p(x)$ from VAEs or other generative models after training. By sampling $z$ from the learned approximate posterior $q(z|x)$ instead of the prior $p(z)$ and re-weighting, $\mathbb{E}_{z\sim p(z)}[p(x|z)] = \mathbb{E}_{z\sim q(z|x)}[\frac{p(x|z)p(z)}{q(z|x)}]$ . The learned $q$ serves as an effective proposal distribution for a low-variance estimate.

Undirected Generative Models

Undirected models (e.g., Boltzmann Machines) define a joint distribution $p(x,z) \propto \exp(-e(x,z))$ directly from an energy function, without assuming a causal direction. The challenge is the intractable normalization constant (partition function) in high dimensions.

Restricted Boltzmann Machines (RBMs) [86] use a bipartite graph between observed $x$ and binary latent $z$ . The energy $e(x,z) = -x^\top Wz - x^\top b - z^\top c$ . Marginalizing $z$ yields $p(x)$ as a Product of Experts (PoE) [0204345]. Training involves minimizing negative log-likelihood, requiring gradients $\nabla_\theta e(x, \theta) - \mathbb{E}_{x'\sim p(x';\theta)} [\nabla_\theta e(x', \theta)]$ . The expectation over $p(x';\theta)$ is intractable.
MCMC Sampling [Hastings 1970] is used to approximate the expectation by drawing samples from $p(x;\theta)$ $p (x; θ)$ . Gibbs sampling is tractable for RBMs' conditional distributions $p(z|x)$ $p (z ∣ x)$ and $p(x|z)$ $p (x ∣ z)$ .
- Contrastive Divergence (CD) [0204345] approximates the negative phase using only a few steps of Gibbs sampling starting from training data.
- Persistent Contrastive Divergence (PCD) runs MCMC chains in the background, persisting samples across SGD steps, which helps the samples better track the evolving model distribution and converges asymptotically to exact log-likelihood gradients.
Energy-Based Generative Adversarial Networks (GANs) (Zhao et al., 2016, Goodfellow et al., 2014) train a generator $g(\epsilon; \theta_g)$ (a sampler from a simple distribution $p(\epsilon)$ ) and an energy function $e(x;\theta)$ in an adversarial minimax game. The energy function is trained to assign low energy to training data and high energy to generated samples. The generator is trained to produce samples with low energy. This avoids MCMC sampling for training the energy function. The generator's objective may include terms like MMD (Oana et al., 2012) to encourage diversity. An autoencoder's reconstruction error can be used as an energy function.

Autoregressive Models

Instead of latent variables, autoregressive models factorize the joint distribution of an observation $X = (x_1, \ldots, x_d)$ using the chain rule: $p(X) = \prod_{i=1}^d p(x_i | x_{<i})$ . A single neural network models all conditional distributions $p(x_i | x_{<i})$ .

Implemented using recurrent networks (like GRUs) or masked attention networks (like Transformers (Vaswani et al., 2017) with causal masking (Sutskever et al., 2014)). Causal masking ensures $p(x_i|x_{<i})$ only depends on elements $x_1, \ldots, x_{i-1}$ .
Major advantages: Exact computation of log-probability for any $X$ and exact, tractable sampling (by generating $x_1$ , then $x_2$ conditioned on $x_1$ , etc.). This contrasts with the intractability in latent variable and undirected models. Used widely in sequence modeling (e.g., LLMs (Brown et al., 2020)).

Further Topics

Reinforcement Learning (RL) (Mnih et al., 2016): Learning to maximize expected reward from actions taken in an environment. In a single-step setting, training a classifier to maximize expected reward involves the Policy Gradient: $\nabla_\theta \mathbb{E}_{y|x;\theta}[R(y)] = \mathbb{E}_{y|x;\theta}[R(y) \nabla_\theta \log p(y|x;\theta)]$ . This can be estimated with a single sample $(R(\tilde{y}) - b(x))\nabla_\theta \log p(\tilde{y}|x;\theta)$ , where $b(x)$ is a learned baseline to reduce variance. For multi-step RL with a Markov environment and cumulative discounted reward, Temporal Difference (TD) learning trains a value function (critic) by minimizing the difference between the estimated value and a bootstrapped target using the immediate reward and the value of the next state. Actor-Critic methods train a policy (actor) using the value function (critic) to provide a low-variance estimate of the advantage.
Ensemble Methods: Combining multiple models to improve performance and estimate uncertainty.
- Bagging [breiman1996bagging]: Average predictions from models trained on bootstrap resamples of the training data. Theory shows averaging reduces variance. Stochasticity in SGD training (initialization, minibatches) implicitly generates diverse models suitable for ensembling.
- Bayesian ML: Defines a posterior distribution $p(\theta|D) \propto p(\theta)\prod_{x\in D} p(x|\theta)$ . Prediction involves marginalizing $\theta$ (averaging predictions from $p(x|\theta)$ weighted by the posterior). Sampling models from the posterior and averaging them corresponds to bagging. SGD, especially with appropriate learning rates or noise, can be seen as an approximate sampler of the posterior (Kamaraju et al., 2010).
- Gradient Boosting [0107107]: Iteratively trains weak learners to fit the negative gradient of the loss function w.r.t. the current ensemble's prediction.
Regression (Mixture Density Networks - MDNs) [bishop1994mixture]: For continuous outputs $y \in \mathbb{R}^d$ , modeling $p(y|x)$ as a mixture of simple distributions (e.g., Gaussians) conditioned on $x$ . A neural network predicts the parameters of the mixture components (means, variances, weights) based on $x$ . This allows capturing multimodal predictive distributions, unlike standard regression. Training maximizes log-likelihood. Sampling from the learned distribution is easy, allowing estimation of credible regions.
Causality: Most ML relies on association. Causal inference aims to understand relationships under intervention, which is crucial for robustness to distribution shifts (Out-of-Distribution generalization) and controlling systems. Examples like confounding (common unobserved causes) and selection bias (conditioning on a common effect) show how correlation arises without causation. This is a complex topic beyond standard ML.

The note concludes by highlighting that this lecture covers foundational machine learning concepts through the unifying lens of energy functions, optimization, and probabilistic modeling, providing a basis for understanding and implementing modern AI techniques.

Markdown