16. Multinomial Logit#

Multinomial logit models are a generalization of logistic regression when there are more than 2 categories in the response. Assume K possible categories and N independent observations.

\[\begin{split} \begin{align*} y_i & \sim \text{Multinomial}(\mathbf{p}_i , n_i), \qquad i=1,\dots,N\\ \mathbf{p}_i & = (p_{i1}, p_{i2}, \dots , p_{iK}) \\ y_i & = (y_{i1}, y_{i2}, \dots , y_{iK}),\; y_{ij}=1,\; y_{ik}=0 \;\text{for } k\neq j,\; j \in \{1,\dots,K\} \end{align*} \end{split}\]

The second parameter in the Multinomial distribution is the number of trials, \(n_i\).

For example, the \(i\)‑th response could be \(y_i = (0,0,0,1,0)\), meaning the 4th category is true and categories 1, 2, 3, and 5 are false:

\[\begin{split} \begin{align*} y_i & = (0,0,0,1,0) \\ K & = 5 \\ y_{i4} & = 1 \\ y_{ik} & = 0 \quad \text{for } k \neq 4 \\ n_i &= 1 \\ \end{align*} \end{split}\]

Similar to the probability \(p\) in a logistic regression for predicting category 1 over category 0, a vector of probabilities \(\mathbf{p}_i = (p_{i1}, p_{i2}, \dots , p_{iK})\) is produced in the Multinomial logit model. To obtain these probabilities, the linear combination of \(\beta\) coefficients and \(x\) predictors is calculated for each category into \(\eta\), and the \(\eta\)’s are normalized using the softmax function so that the sum of the probabilities equals 1:

\[\begin{split} \begin{align*} \eta_{ij} & = \beta_{0j} + \beta_{1j} x_{i1} + \dots + \beta_{p-1,j} x_{i,p-1}\\ p_{ij} & = \frac{e^{\eta_{ij}}}{\sum_{k=1}^K e^{\eta_{ik}}} \end{align*} \end{split}\]

There is a \(\beta\) coefficient for each category \(j\) and each predictor. Putting it all together, the Bayesian model is:

\[\begin{split} \begin{align*} y_i & \sim \text{Multinomial}(\mathbf{p}_i , n_i) && \text{likelihood}\\[4pt] \eta_{ij} & = \beta_{0j} + \beta_{1j} x_{i1} + \dots + \beta_{p-1,j} x_{i,p-1} && \text{deterministic} \\[4pt] p_{ij} & = \frac{e^{\eta_{ij}}}{\sum_{k=1}^K e^{\eta_{ik}}} && \text{deterministic} \\[4pt] \beta_{ij} & \sim N(0,\sigma_j^2) && \text{prior} \end{align*} \end{split}\]

Because this model is over‑parametrised, one category must be chosen as a reference or baseline to make the model identifiable.

The usual choice is \(j = 1\), in which case

\[ \beta_{0 1} = \beta_{1 1} = \dots = \beta_{p-1,\,1} = 0 , \]

and only the \(P\,(K-1)\) coefficients for \(j=2,\dots ,K\) are estimated.

With this constraint the model can be interpreted in familiar log‑odds form:

\[ \log\frac{p_{ij}}{p_{i1}} \;=\; \beta_{0j} + \beta_{1j}x_{i1} + \dots + \beta_{p-1,\,j}x_{i,p-1}, \qquad j = 2,\dots ,K . \]

So each \(\beta_{i j}\) measures the change in the log‑odds of choosing category \(j\) versus the baseline category per one‑unit increase in predictor \(x_{i}\).

You can also enforce a different identifiability constraint (e.g., sum‑to‑zero, as we learned for ANOVA).

Just like binary logistic regression, the link doesn’t need to be logit. Multinomial probit or clog‑log links are possible by replacing the soft‑max with the appropriate multivariate CDF. In practice, the logit link is the most common.

Authors#

  • Jason Naramore, August 2024.

  • Aaron Reding, April 2025.