Lecture 3: Undirected Graphical Models
An introduction to undirected graphical models
Review
In addition to the Imap concept that was introduced in the last lecture, today’s lecture also includes minimal Imap.
Minimal Imaps
A DAG \(\mathcal{G}\) is a minimal Imap if it is an Imap for a distribution \(P\), and if the removal of even a single edge from \(\mathcal{G}\) renders it not an Imap.

A distribution may have several minimal Imaps, each corresponding to a specific nodeordering.

The fact that \(\mathcal{G}\) is a minimal Imap for \(P\) is far from a guarantee that \(\mathcal{G}\) captures the independence structure in \(P\).
“Bayesball” Algorithm
“Bayesball” algorithm is an algorithm that we can apply to retrieve independences directly from a graphical model. We say \(X\) is dseparated from \(Z\) given \(Y\) if we cannot send a ball from any node in \(X\) to any node in \(Z\). The conditional probability statement (“given \(Y\)”) is represented by shading the node in the graph. Examples of three basic directed graphical structures are shown below.

In (a) and (b), the shaded \(Y\) node blocks the ball from going between nodes \(X\) and \(Z\). This gives the independence relation that was introduced in the last lecture: \(X \perp Z \mid Y\).

(c), also called the “Vstructure”, is a special case. Opposite from the first two exmaples, the ball can go between \(X\) and \(Z\) if the node \(Y\) is shaded, and is blocked otherwise. Therefore, the graph on the right yields \(X \perp Z\).
With these basic structures, we can apply the rules on a DAG. For example, let us try to find whether \(X_2\) and \(X_3\) are independent given \(X_1\) and \(X_6\).
After shading \(X_1\) and \(X_6\), the ball cannot go from \(X_2\) to \(X_3\) through \(X_1\) because it is blocked; however, \(X_2\), \(X_6\), and \(X_5\) forms a “Vstructure”, so the ball can go along the path \(X_2\), \(X_6\), \(X_5\), \(X_3\). Therefore, the independence statement is invalid.
Limits of Directed and Undirected GMs
From a representational perspective, we aim to find a graph \(\mathcal{G}\) that precisely captures the independencies in a given distribution \(P\). This goal of learning GMs motivates the following definition.
Perfect Maps
We say that a graph \(\mathcal{G}\) is a perfect map (Pmap) for a set of independencies \(\mathcal{I}\) if \(\mathcal{I}(\mathcal{G}) = \mathcal{I}\). We say that \(\mathcal{G}\) is a perfect map for \(P\) if \(\mathcal{I}(\mathcal{G}) = \mathcal{I}(P)\). That is, \(\text{sep}_{\mathcal{G}}(X;Z \vert Y) \iff P \models (X \perp Z \vert Y)\).
 The Pmap of a distribution is unique up to Iequivalence between networks. That is, a distribution P can have many Pmaps, but all of them are Iequivalent.
Arbitrary distribution \(P\)’s, however, do not necessarily attain perfect maps as either undirected or directed GMs. Two such examples are shown below.
Left: A distribution with no possible DGM representation, which entails \(A\perp C\vert \{B,D\}\) and \(B\perp D\vert \{A,C\}\). Right: The vstructure is a distribution with no UGM representation.
Undirected Graphical Models  Overview
 There can only be symmetric relationships between a pair of nodes (random variables). In other words, there is no causal effect from one random variable to another.
 The model can represent properties and configurations of a distribution, but it cannot generate samples explicitly.
 Each node has strong correlations with its neighbors.
Example
Let each node represents an image patch. It is impossible to tell what is inside this image patch by isolating it from others. However, when we look at its neighboring image patches, we can see that it’s an image patch of water. Due to the fact that the relationships between neighboring image patches should be symmetric, an image is best represented by an undirected graphical model. This particular undirected graphical model is also known as the grid model.
Quantitative Specification
Cliques
 Cliques are subgraphs that are fully connected.
 A maximal clique is a clique such that any superset (any bigger subgraph that contains this subgraph) is not a complete graph.
 A subclique is a notnecessarilymaximal clique.
Example
Potential Functions
Each clique can be associated with a potential function \(\psi\), which can be understood as a provisional function of its arguments that assigns a preprobabilistic score of their joint distribution. This potential function can be somewhat arbitrary, but must be nonnegative.
Why cliques? Each component of the clique contributes to the overall potential function.
Example
For \(\psi_c(X_1, X_2)\),
Potential functions are not necessarily probabilistic:
This model implies that \(X \perp Z  Y\). This independence statement implies (by definition) that the joint must factorize as:
Probability distributions can be used as potential functions. However, in this case, we cannot let all potentials be either marginal probabilities or conditional probabilities. So the potential function for this graph cannot be probability distributions.
Gibbs Distribution and Undirected Graphical Model Definition
Given an undirected graph \(H\) and clique potentials functions \(\psi_C\) associated with cliques of \(H\), we say \(P(X_1, ..., X_n)\) is a Gibbs distribution over \(H\) if it can be represented as
\[P(X_1, ..., X_n)=\frac{1}{Z}\prod_{c\in C}{\psi_c(\bold{x_c})}\]where \(Z\) is also known as the partition function. Upper case \(C\) denotes the set of all cliques, and lower case \(c\) denotes a clique associated with a set of random variables \(\bold{x}\).
An undirected graphical model represents a distribution \(P(X_1, ..., X_n)\) defined by an undirected graph \(H\), a set of positive potential functions \(\psi_C\) and the associated cliques of \(H\), such that
Note that this distribution is the Gibbs distribution.
Example UGM Models Depending on the question of interest, different representations may be more appropriate.
Using Max Cliques
We only need to represent discrete nodes with two 3D tables instead of one 4D table.
Using Pairwise Cliques
We only need to represent discrete nodes with five 2D tables instead of one 4D table.
Using Canonical Representation
Even if we use finegrained representation, the Markov network is often overparameterized. For any given distribution, there are multiple choices of parameters to describe in the model. As shown above, we can either choose max cliques or pairwise cliques to represent this model. Furthermore, ambiguities can arise in clique structures. For example, given a pair of cliques \(\{A, B\}\) and \(\{B, C\}\), the information about \(B\) can be placed in either of the two cliques, resulting in many ways to specify the samme distribution.
The canonical representation provides a natural approach to avoid this problem. It is defined over all nonempty cliques as shown below.
Qualitative Specification
Global Markov Independency
Suppose we are given the following UGM, denoted by \(H\):
\(Y\) separates \(X\) and \(Z\) if every path from a node in \(X\) to a node in \(Z\) passes through a node in \(Y\):
\[sep_H(X;ZY)\]A probability distribution satisfies the global Markov property if for any disjoint X,Y,Z such that Y separates X and Z, X is independent of Z given Y.
\[\mathcal{I}(H)=\{X\perp Z  Y:sep_H(X;ZY)\}\]Local Markov Independency
For each node \(X_i\in \bold{V}\), there is a unique Markov blanket of \(X_i\), denoted \(MB_{X_i}\), which is the set of neighbors of \(X_i\) in the graph.
The local Markov independencies (\(\mathcal{I}_l\)) associated with \(H\) is:
\[\mathcal{I}_l(H):\{X_i \perp (\bold{V}\{X_i\}MB_{X_i})MB_{X_i}:\forall i\}\]In other words, \(X_i\) is independent of the rest of the nodes given its immediate neighbors \(MB_{X_i}\).
Soundness and Completeness of Global Markov Property
The global Markov property for UGMs is similar to its variant for DGMs, in the sense that they both attain similar soundness and completeness results.
Soundness
Theorem: Let \(P\) be a distribution over \(\mathcal{X}\), and \(\mathcal{G}\) a Markov network structure over \(\mathcal{X}\). If \(P\) is a Gibbs distribution that factorizes over \(\mathcal{G}\), then \(\mathcal{G}\) is an Imap for \(P\).
Proof: Let \(X,Y,Z\) be three disjoint subsets in \(\mathcal{X}\) such that \(Z\) separates \(X\) and \(Y\) in \(\mathcal{G}\). We will show that \(P\models (X\perp Y\vert Z)\).
First, we observe that there is no direct edge from \(X\) to \(Y\). Assuming that \((X,Y,Z)\) is a partition of \(\mathcal{X}\), we know that any clique in \(\mathcal{G}\) is fully attained in either \(X\cup Z\) or \(Y\cup Z\). Let \(\mathcal{I}_{X}\) be the indices of the set of cliques that are contained in \(X\cup Z\), and \(\mathcal{I}_{Y}\) be the set defined for \(Y\cup Z\). We know that
\[P(X_1,\cdots,X_n) = \frac{1}{Z}\prod_{i\in \mathcal{I}_X}\phi_i (D_i)\cdot \prod_{i\in \mathcal{I}_Y}\phi_i (D_i).\]None of the terms in the first product contains variable from the latter. Hence, we can rewrite this product in the form:
\[P(X_1,\cdots,X_n) = \frac{1}{Z}f(X,Z)g(Y,Z),\]and we observe that independence follows.
If \(X\cup Y\cup Z\) is a strict subset of \(\mathcal{X}\). Let \(U = \mathcal{X}\setminus (X\cup Y\cup Z)\). We can partition \(U\) into two disjoint sets \(U_1\) and \(U_2\) such that \(Z\) separates \(X\cup U_1\) from \(Y\cup U_2\) in \(\mathcal{G}\). Using our argument from the partition case, we have that \(\big((X\cup U_1)\perp ((Y\cup U_2)\vert Z\). Apply decomposition property of probability we attain that \(P\models (X\perp Y\vert Z)\). \(\square\)
Completeness (HammersleyClifford theorem)
Theorem: Let \(P\) be a positive distribution over \(\mathcal{X}\), and \(\mathcal{G}\) a Markov network graph over \(\mathcal{X}\). If \(\mathcal{G}\) is an Imap for \(P\), then \(P\) is a Gibbs distribution that factorizes over \(\mathcal{G}\).
This result shows that, for positive distributions, the global independencies imply that the distribution factorizes according to the network structure. Thus, for this class of distributions, we have that a distribution $P$ factorizes over a Markov network \(\mathcal{G}\) if and only if \(\mathcal{G}\) is an Imap for \(P\).
Other Markov Properties
For UGMs, we defined Imaps in terms of global Markov properties. We will now define local independence. Intuitively, when two variables are not directly linked, there must be some way of rendering them conditionally independent. Specifically, we can require that $X$ and $Y$ be independent given all other nodes in the graph.
Pairwise Independencies
Let \(\mathcal{G}\) be a Markov network. We define the pairwise independencies associated with \(\mathcal{G}\) to be
\[\mathcal{I}_P(\mathcal{G}) = \{(X\perp Y\vert \mathcal{X}\{X,Y\}): XY\notin \mathcal{G}\}\]To illustrate this idea, observe that in the figure above, the variables of interests, \(A\) and \(E\), are conditionally independent given all other nodes in the graph, \(\{B,C,D\}\).
Pairwise and local indepdencies are also related. Their relationships are described in the following propositions and theorem.
Proposition
1. For any Markov network \(\mathcal{G}\) and any distribution \(P\), we have that if \(P\models \mathcal{I}_l(\mathcal{G})\) then \(P\models \mathcal{I}_P(\mathcal{G})\).
2. For any Markov network \(\mathcal{G}\) and any distribution \(P\), we have that if \(P\models \mathcal{I}(\mathcal{G})\) then \(P\models \mathcal{I}_l(\mathcal{G})\).
3. Let \(P\) be a positive distribution. If \(P\) satisfies \(\mathcal{I}_P(\mathcal{G})\), then \(P\) satisfies \(\mathcal{I}(\mathcal{G})\).
Theorem
The followings are equivalent for a positive distribution \(P\):
\(P\models \mathcal{I}_l(\mathcal{G})\)
\(P\models \mathcal{I}_P(\mathcal{G})\)
\(P\models \mathcal{I}(\mathcal{G})\)
Exponential Form
Since we don’t want to constraint the clique potentials to be positive in all situations, exponential form is used to represent a clique potential \(\phi_c(x_c)\) in an unconstrained form using a realvalue “energy” funtion \(\phi_c(x_c)\):
\[\Phi_c(x_c)=\exp\bigg\{\phi_c(x_c)\bigg\}\]This then gives the joint probability a nice additive structure
\[p(x)=\frac{1}{Z}\exp\bigg\{\sum_{c\in C}\phi_c(x_c)\bigg\}=\frac{1}{Z}\exp\bigg\{H(x)\bigg\}\]where the sum in the exponent is called the “free energy”:
\[H(x) = \sum_{c\in C}\phi_c(x_c)\]This form of representation is called the “Boltzmann distribution” in physics, and a loglinear model in statstics.
Undirected Graph Exmples
In this section, we cover several wellknown undirected graphical models: Boltzmann Machine (BM), Ising model, Restricted Boltzmann Machine (RBM), and Conditional Random Field (CRF).
Boltzmann Machine (BM)
Boltzmann Machine is a fully connected graph with pairwise (edge) potentials on binaryvalued nodes. One example is shown in the following figure:
Its probability distribution can be written as:
\[p(x_1,x_2,x_3,x_4)=\frac{1}{Z}\exp\bigg\{\sum_{ij}\phi_{ij}(x_i,x_j)\bigg\}\]It could also be written in a quadratic way:
\[p(x_1,x_2,x_3,x_4)=\frac{1}{Z}\exp\bigg\{\sum_{ij}\theta_{ij}x_ix_j+\sum_i\alpha_ix_i+C\bigg\}\]Hence the overall free energy function has the form:
\[H(x)=\sum_{ij}(x_i\mu)\Theta_{ij}(x_j\mu)=(x\mu)^T\Theta(x\mu)\]which can then be solved using quadratic programming.
Ising model
In the Ising model, nodes are arranged in a regular topology (often a regular packing grid) and connected only to their geometric neighbors. It is like a sparse Boltzmann Machine. There is also the multistate Ising model (also called Potts model), in which nodes can take multiple values instead of just binary values. One example of Ising model is shown in the following figure:
Its probability distribution can be written as
\[p(X)=\frac{1}{Z}\exp\bigg\{\sum_{i,j\in N_i}\theta_{ij}X_iX_j+\sum_i \theta_{i0}X_i\bigg\}\]Restricted Boltzmann Machine (RBM)
The Restricted Bolzmann Machine is a bipartite graph with connections between one layer of hidden units and one layer of visible units. One example is shown in the following figure:
Its probability distribution can be written as
\[p(x,h\theta)=\exp\bigg\{\sum_i\theta_i\phi_i(x_i) + \sum_j\theta_j\phi_j(h_j) + \sum_{i,j}\theta_{i,j}\phi_{i,j}(x_i,h_j)A(\theta)\bigg\}\]RBM has some appealing properties. For example, factors are marginally dependent and factors are conditionally independent given observations on the visible nodes. They enable one to use iterative Gibbs sampling for inference and learning on RBM. If the edges in RBM were directed, there would be plenty of Vstructures in the graph (lots of dependences) that increase the inference difficulty.
Conditional Random Field (CRF)
Conditional random field is an analogous form of HMM in the undirected case. It allows arbitrary dependencies on the input. For example, when labeling \(X_i\), future observations can be taken into account. An example of CRF is shown in the figure:
The probability distribution could be written as
\[p_\theta(yx)=\frac{1}{Z(\theta,x)}\exp\bigg\{\sum_c\theta_cf_c(x,y_c)\bigg\}\]