1. Likelihood Function
In a probabilistic model, given some parameters and observed data x, the likelihood function measures how likely the observed data is under those parameters:
Given observed data ( x ) and model parameters ( ), the likelihood function is:
2. Log-Likelihood
Since likelihoods are often small numbers (fractions close to zero), working with them directly can cause numerical instability. Instead, we take the log of the likelihood Taking the logarithm of the likelihood:
3. Negative Log-Likelihood (NLL)
Most machine learning models minimize loss functions (instead of maximizing likelihood). To turn log-likelihood maximization into a minimization problem, we take the negative of the log-likelihood. To turn the maximization problem into a minimization problem:
(a) NLL for a Bernoulli Distribution (Binary Classification)
where:
- is the true label (0 or 1),
- is the predicted probability.
(b) NLL for a Gaussian Distribution (Regression)
4. Cross-Entropy Loss
General Formula for Cross-Entropy
where:
- is the number of classes
- is the true probability (usually 1 for the correct class, 0 for others)
- is the predicted probability
(a) Cross-Entropy for Binary Classification
where:
- ( ) is the true label (0 or 1),
- ( ) is the predicted probability for class 1.
(b) Cross-Entropy for Multi-Class Classification
For multi-class classification with softmax outputs:
Since only one = 1 (one-hot encoding), this simplifies to:
5. Relationship Between NLL and Cross-Entropy
Cross-entropy is equivalent to the negative log-likelihood (NLL) when using softmax probabilities:
6. Perplexity
exponentiation of the average negative log-likelihood of a sequence. For a given sequence of words w1,w2,…,wN the perplexity is calculated as