ST349 ML Framework Week 2: Coding and Learning set-ups

Department of Statistics, University of Warwick

Authors

Teaching Assistant: Hugo Queniat

Professor: Wenkai Xu

Published

January 22, 2026

1. Install relevant Python tools and packages

Install Python and IDE

We will use Anaconda (and Jupyter notebook) to demonstrate in the class, which can be downloaded from here.

Install Packages

Recommended Packages to Install: You should install useful packages: sklearn, torch, torchvision, PIL, rpy2.

You can install these via Anaconda Prompt/Terminal using the following commands:

Code
# Basic install command
conda install -c conda-forge scikit-learn pytorch torchvision pil r::rpy2

# For sklearn specific issues:
# conda install conda-forge::scikit-learn

# To list all the installed packages 
conda list

# To check whether a specific package is installed
conda list sklearn

Install Jupyter

Jupyter lab or Jupyter notebook will be used for demonstration. If not included (e.g. you used miniconda to start with), you need to install them manually, e.g.

Code
conda install -c conda-forge jupyterlab

# To activate/open Jupyterlab
jupyter lab

Some useful shortcut keys when operating in Jupyter notebook can be found (interactively with demo) here.

Handling different projects

For a different project, a separate virtual enviroment that contains necessary packages and their corresponding version can be helpful. To create a virtual environment,

Code
conda update conda
conda create -n ENVNAME

# If need to specify a particular version of Python in Anaconda, one can use
# conda create -n ENVNAME python=x.x anaconda

# To activate it
source activate ENVNAME

# To deactivate it
source deactivate

# To check installed packages in the virtual environment
conda ENVNAME list

# To delete a virtual enviroment
conda remove -n ENVNAME -all

2. Linear Algebra and Matrix Operations

Many machine learning algorithms, especially deep neural networks, heavily involve matrix and tensor operations. Here we review a few commonly used concepts in linear algebra on matrices.

A matrix is a rectangular two-dimensional array of scalars, written as an \(m \times n\) matrix \(A \in \mathbb{R}^{m \times n}\), where \(m\) is the number of rows and \(n\) is the number of columns.

For example, the \(2 \times 3\) matrix

\[ A = \begin{pmatrix} 2.0 & 3.1 & 5.2 \\ 4.0 & 5.0 & 1.1 \end{pmatrix} \]

has \(2\) rows and \(3\) columns.

Definitions

  • Matrix element
    Each scalar value is indexed by its position \(i j\).
    The entry \(A_{i j}\) denotes the element in row \(i\) and column \(j\) of \(A\).
    For example, \(A_{1 2} = 3.1\).

  • Square matrix
    A matrix with the same number of rows and columns.

  • Diagonal entries
    For a square matrix \(M \in \mathbb{R}^{d \times d}\), the entries \(M_{i i}\) are called diagonal entries.

  • Diagonal matrix
    A \(k \times k\) matrix with zero entries everywhere except possibly on the diagonal.
    It is often written as \[ D = \mathrm{diag}(d_1, d_2, \dots, d_k) \]

  • Identity matrix
    The identity matrix \(I_d \in \mathbb{R}^{d \times d}\) has ones on the diagonal and zeros elsewhere.

  • Matrix addition
    For matrices of the same dimension \(A\) and \(B\) in \(\mathbb{R}^{2 \times 3}\), the sum \(S = A + B\) is defined by \[ S_{i j} = A_{i j} + B_{i j} \]

  • Matrix multiplication
    Let \(A \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{n \times r}\).
    The product \(M = A B \in \mathbb{R}^{m \times r}\) is defined by \[ M_{i j} = \sum_{k = 1}^{n} A_{i k} B_{k j} \]

    Matrix multiplication is not commutative in general, meaning \(A B \neq B A\).
    The identity satisfies \(I \times A = A\) whenever dimensions allow.

  • Scalar multiplication
    For a scalar \(\lambda \in \mathbb{R}\) and a matrix \(A \in \mathbb{R}^{m \times n}\), \[ \lambda A \text{ has entries } \left(\lambda A\right)_{i j} = \lambda A_{i j} \]

  • Matrix inverse
    If there exists a matrix \(B\) such that \(A \times B = I\), then \(B\) is the inverse of \(A\), denoted \(A^{-1}\).

  • Determinant
    The determinant of a matrix is denoted by vertical bars.
    The identity matrix satisfies \(|I_d| = 1\).
    For a diagonal matrix \[ D = \mathrm{diag}(d_1, \dots, d_k) \] the determinant is \[ |D| = \prod_{i = 1}^{k} d_i \]

  • Transpose
    The transpose of a matrix \(A\) is denoted \(A^T\).
    If \(A \in \mathbb{R}^{2 \times 3}\), then \(A^T \in \mathbb{R}^{3 \times 2}\) and \[ A_{i j} = \left(A^T\right)_{j i} \]

  • Orthonormal matrix
    A matrix \(U \in \mathbb{R}^{n \times n}\) is orthonormal if \[ U^T U = I_n \]

  • Eigendecomposition
    A square matrix \(M \in \mathbb{R}^{n \times n}\) admits an eigendecomposition if \[ M = U D U^T \] where \(U\) is orthonormal and \(D\) is diagonal.
    The diagonal entries of \(D\) are the eigenvalues of \(M\), and the columns of \(U\) are the eigenvectors of \(M\).

3. Statistical Estimation

(a) Derivation of MLE for Multivariate Gaussian

  • Given \(n\) independent and identically distributed (i.i.d.) observations \(x_1, \dots, x_n \in \mathbb{R}^d\), derive the Maximum Likelihood Estimator (MLE) for a Gaussian distribution \(\mathcal{N}(\mu, \Sigma)\),

\[ p_{\mu, \Sigma}(x) = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}} \exp \left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right) \]

Step 1: The Log-Likelihood The likelihood function is the product of individual densities (i.i.d. observations): \[ L(\mu, \Sigma) = \prod_{i=1}^n \frac{1}{\sqrt{(2\pi)^d |\Sigma|}} \exp \left( -\frac{1}{2} (x_i - \mu)^T \Sigma^{-1} (x_i - \mu) \right) \]

We want to maximize the log-likelihood \(\mathcal{L}(\mu, \Sigma) = \ln L(\mu, \Sigma)\): \[ \begin{aligned} \mathcal{L}(\mu, \Sigma) &= \sum_{i=1}^n \left( -\frac{d}{2}\ln(2\pi) - \frac{1}{2}\ln|\Sigma| - \frac{1}{2}(x_i - \mu)^T \Sigma^{-1} (x_i - \mu) \right) \\ &= -\frac{nd}{2}\ln(2\pi) - \frac{n}{2}\ln|\Sigma| - \frac{1}{2} \sum_{i=1}^n (x_i - \mu)^T \Sigma^{-1} (x_i - \mu) \end{aligned} \]

Step 2: MLE for \(\mu\) To find \(\hat{\mu}\), take the derivative w.r.t \(\mu\) and set to 0. We use the fact that \(\nabla_\mu (x - \mu)^T \Sigma^{-1} (x - \mu) = -2\Sigma^{-1}(x - \mu)\) (covariance matrix \(\Sigma\) is symmetric):

\[ \frac{\partial \mathcal{L}}{\partial \mu} = -\frac{1}{2} \sum_{i=1}^n \left( -2 \Sigma^{-1} (x_i - \mu) \right) = \Sigma^{-1} \sum_{i=1}^n (x_i - \mu) = 0 \]

Since \(\Sigma^{-1}\) is positive definite, the sum term must be zero: \[ \sum_{i=1}^n x_i - n\mu = 0 \implies \hat{\mu}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n x_i \]

Step 3: MLE for \(\Sigma\) To find \(\hat{\Sigma}\), we differentiate w.r.t \(\Sigma\). First, rewrite the sum using the trace trick \(\text{tr}(x^T A x) = \text{tr}(A x x^T)\): \[ \sum_{i=1}^n (x_i - \mu)^T \Sigma^{-1} (x_i - \mu) = \text{tr}\left( \Sigma^{-1} \sum_{i=1}^n (x_i - \mu)(x_i - \mu)^T \right) \]

Using matrix calculus identities \(\frac{\partial \ln|\Sigma|}{\partial \Sigma} = \Sigma^{-1}\) and \(\frac{\partial \text{tr}(\Sigma^{-1} S)}{\partial \Sigma} = -\Sigma^{-1} S \Sigma^{-1}\):

\[ \frac{\partial \mathcal{L}}{\partial \Sigma} = -\frac{n}{2}\Sigma^{-1} + \frac{1}{2} \Sigma^{-1} \left( \sum_{i=1}^n (x_i - \mu)(x_i - \mu)^T \right) \Sigma^{-1} = 0 \]

Multiply by \(\Sigma\) on both sides to isolate the term: \[ -n \Sigma + \sum_{i=1}^n (x_i - \mu)(x_i - \mu)^T = 0 \]

Substituting \(\hat{\mu}\) for \(\mu\): \[ \hat{\Sigma}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n (x_i - \hat{\mu})(x_i - \hat{\mu})^T \]

(b) Exponential Family MLE

  • Consider the Exponential Family distribution, parametrised by \(\theta \in \mathbb{R}^d\) \[ p_{\theta}(x)=\frac{1}{Z(\theta)}\exp\{\sum_{i\in[d]}\theta_{i}T_{i}(x)\}, \quad (2) \] where \(T_{i}(x)\in\mathbb{R}\) are scalar-valued sufficient statistics. Derive (as much as possible) the MLE for i.i.d. observation \(x_{1},...,x_{n}.\) (May not be in closed form)

Step 1: The Log-Likelihood The likelihood for \(n\) observations is the product of individual densities: \[ L(\theta) = \prod_{j=1}^n p_{\theta}(x_j) = \prod_{j=1}^n \frac{1}{Z(\theta)} \exp \left\{ \sum_{i=1}^d \theta_i T_i(x_j) \right\} \] (Note: To avoid confusion with indices, we use \(j\) to index the \(n\) observations and \(i\) to index the \(d\) parameters).

Taking the log-likelihood \(\ell(\theta) = \ln L(\theta)\): \[ \begin{aligned} \ell(\theta) &= \sum_{j=1}^n \left( -\ln Z(\theta) + \sum_{i=1}^d \theta_i T_i(x_j) \right) \\ &= -n \ln Z(\theta) + \sum_{j=1}^n \sum_{i=1}^d \theta_i T_i(x_j) \\ &= -n \ln Z(\theta) + \sum_{j=1}^n \theta^T \mathbf{T}(x_j) \end{aligned} \] where \(\mathbf{T}(x)\) is the vector of sufficient statistics \([T_1(x), \dots, T_d(x)]^T\).

Step 2: The Gradient To find the maximum, we take the gradient with respect to the parameter vector \(\theta\). \[ \nabla_\theta \ell(\theta) = -n \nabla_\theta \ln Z(\theta) + \sum_{j=1}^n \mathbf{T}(x_j) \]

Step 3: Analyzing \(\nabla_\theta \ln Z(\theta)\) Recall that the partition function is \(Z(\theta) = \int \exp(\theta^T \mathbf{T}(x)) dx\). The gradient of the log-partition function is the expected value of the sufficient statistics (a standard property of exponential families):

\[ \nabla_\theta \ln Z(\theta) = \frac{1}{Z(\theta)} \nabla_\theta Z(\theta) = \mathbb{E}_\theta[\mathbf{T}(x)] \]

Step 4: Setting the Gradient to Zero Substituting this back into the derivative equation and setting to 0:

\[ -n \mathbb{E}_{\hat{\theta}}[\mathbf{T}(x)] + \sum_{j=1}^n \mathbf{T}(x_j) = 0 \]

Rearranging the terms and dividing by \(n\):

\[ \mathbb{E}_{\hat{\theta}}[\mathbf{T}(x)] = \frac{1}{n} \sum_{j=1}^n \mathbf{T}(x_j) \]

Conclusion: The MLE \(\hat{\theta}\) satisfies the condition that the model expectation of the sufficient statistics equals the empirical average of the sufficient statistics from the data.

(c) Mixture of Gaussian (MoG) MLE

  • Consider Mixture of Gaussian (MoG) example in Example 1.1.1, with a set of \(k\) Normal distriubtions with \(\mathcal{N}(\mu_{i},\Sigma_{i})\) where \(i\in[k]\); and \(\pi_{i}\ge0\) s.t. \(\sum_{i}\pi_{i}=1\). The MoG model has density of the form \[ p(x)=\sum_{i\in[k]}\pi_{i}\mathcal{N}(\mu_{i},\Sigma_{i}) \quad (3) \] Let \(x_{1},...,x_{n}\) be the observed samples, can we derive the MLE for \(\pi_{i},\mu_{i},\Sigma_{i},\) for \(i\in[k]?\) If not what we may do?

Step 1: The Log-Likelihood Given independent samples \(x_1, \dots, x_n\), the log-likelihood is: \[ \ell(\Theta) = \sum_{j=1}^n \ln p(x_j) = \sum_{j=1}^n \ln \left( \sum_{i=1}^k \pi_i \mathcal{N}(x_j | \mu_i, \Sigma_i) \right) \]

Step 2: The Problem (Why we can’t derive closed-form MLE) If we try to take the derivative with respect to \(\mu_i\) and set it to 0, we encounter a problem. Unlike the single Gaussian case, the logarithm cannot be pushed inside the summation. \[ \ln \left( \sum \dots \right) \neq \sum \ln (\dots) \] This means the \(\ln\) does not cancel out the \(\exp\) inside the Gaussian. The derivative results in a complex non-linear system where the parameters for all components are coupled together.

Step 3: What can we do? (The EM Algorithm) Since a closed-form solution is impossible, we typically use the Expectation-Maximization (EM) Algorithm.

  1. Introduce Latent Variables: We assume there is a hidden variable \(z_j \in \{1, \dots, k\}\) for each data point that tells us which Gaussian component generated it.
  2. E-Step (Expectation): Estimate the probability (responsibility) that component \(i\) generated point \(x_j\) using current parameters.
  3. M-Step (Maximization): Update parameters \(\pi_i, \mu_i, \Sigma_i\) by maximizing the expected log-likelihood (which is tractable because treating \(z\) as known removes the sum inside the log).

Alternative: Direct Optimization with PyTorch

While EM is the classical statistical approach, in modern ML frameworks like PyTorch, we can often solve this using Gradient Descent.

Since the log-likelihood is differentiable, we can treat the negative log-likelihood as a Loss Function:

\[ \mathcal{L}(\theta) = -\sum_{j=1}^n \ln \left( \sum_{i=1}^k \pi_i \mathcal{N}(x_j | \mu_i, \Sigma_i) \right) \]

We can optimize parameters \(\theta = \{\pi, \mu, \Sigma\}\) directly using torch.optim.Adam.

Handling Constraints: * To ensure \(\sum \pi_i = 1\), we optimize raw scores (logits) and apply softmax. * To ensure \(\Sigma_i\) is positive definite, we optimize the Cholesky factor \(L\) such that \(\Sigma = LL^T\).