Lecture 2: Bayesian Networks

Overview of Bayesian Networks, their properties, and how they can be helpful to model the joint probability distribution over a set of random variables. Concludes with a summary of relevant sections from the textbook reading.

Motivation: representing joint distributions over some random variables is computationally expensive, so we need methodologies to represent joint distributions compactly.

Two types of Graphical Models

Directed Graphs (Bayesian Networks)

An acyclic graph, $\mathcal{G}$, is made up of a set of nodes, $\mathcal{V}$, and a set of directed edges, $\mathcal{E}$, where edges represent a causality relationship between nodes. Nodes in the graph represent a set of random variables, ${X_1,…,X_N}$, where there is a one-to-one map between nodes and random variables.

Directed graph.

The joint probability of the above directed graph can be written as follows:

\[P(X_1, \dots ,X_8) = P(X_1)P(X_2)P(X_3|X_1)P(X_4|X_2)P(X_5|X_2)P(X_6|X_3,X_4)P(X_7|X_6)P(X_8|X_5,X_6)\]

Undirected Graphs (Markov Random Fields)

An undirected graph contains nodes that are connected via non-directional edges.

Undirected graph.

The joint probability of the above undirected graph can be written as followed:

\begin{aligned} P(X_1, \dots ,X_8) = \frac{1}{Z} \exp\bigg( & E(X_1) + E(X_2) + \\ & E(X_3, X_1) + E(X_4, X_2) + E(X_5, X_2) + \\ & E(X_6,X_3,X_4) + \\ & E(X_7,X_6) + E(X_8,X_5,X_6) \bigg) \end{aligned}

Notation

The Dishonest Casino

Let $\textbf{x}$ be a sequence of random variables, $x_1,…,x_T$, where $x_t$ is the outcome of a casino’s die roll, $ x \in (1,2,3,4,5,6) $. Also, let $\textbf{y}$ be a parse of random variables, $y_1,…,y_T$, where $y_t$ is the outcome of whether the die was fair or bias, $y \in (0,1)$. The bias die follows the following distributions:

p(x=1) p(x=2) p(x=3) p(x=4) p(x=5) p(x=6)
0.1 0.1 0.1 0.1 0.1 0.5

Some questions we might want to ask are:

One way we could model this casino problem is as a Hidden Markov Model where $x_t$ are our observed variables and $y_t$ are our hidden variables.

Hidden Markov Model.

The hidden variables all share the Markov property that the the past is conditionally independent of the future given the present:

\[y_{t-1} \, \bot \, \{y_{t+1}, \dots ,y_{T}\} \mid y_t\]

This property is also explicitly highlighted in the topology of the graph.

Furthermore, we can find how likely of the parse, given our HMM sequence, as followed:

\begin{aligned} p(\mathbf{x,y}) &= p(y_1)\prod\limits_{t=1}^T p(x_t\,|\,y_t)\prod\limits_{t=2}^{T} p(y_t\,|\,y_{t-1}) \\ p(\mathbf{x,y}) &= p(y_1)p(x_1|y_1)p(y2|y_1)p(x_2|y_2)...p(y_T|y_{T-1})p(x_T|y_T) \\ p(\mathbf{x,y}) &= p(y_1)p(y_2|y_1)...p(y_T|y_{T-1})\times p(x_1|y_1)p(x_2|y_2)...p(x_T|y_T)\\ p(\mathbf{x,y}) &= p(y_1,...,y_T)p(x_1,...,x_T|y_1,...,y_T) \end{aligned}

The marginal and posterior distributions can be computed as follows:

Bayesian Network

Bayesian Network: Factorization Theorem

We define $Pa_{X_i}$ and $NonDescendants_{X_i}$ as the set of parent nodes and not descendant nodes for the $i$th node, respectively. The topology of a directed graph asserts the set of conditional independence statements:

\[\{ X_i \, \bot \, NonDescendants_{X_i} \, | \, Pa_{X_i} \}\]
Directed Graph.

As a result, the joint probability of the above directed graph can be written as follows:

\begin{aligned} P(X_1, \dots ,X_8) &= \prod\limits_{i =1}^8 P(X_i \, | \, Pa_{X_i}) \\ &= P(X_1)P(X_2)P(X_3|X_1)P(X_4|X_2)P(X_5|X_2)P(X_6|X_3,X_4)P(X_7|X_6)P(X_8|X_5,X_6) \end{aligned}

Specification of a Directed Graphical Model

There are two components to any GM:

Sources of Qualitative Specifications

Where do our assumptions come from?

Local Structures & Independencies

Common parent example.
Cascade example.
V-structure example.

In class example of v-structure

My clock running late (event $A$) and a traffic jam (event $B$) which are independent events could both cause me to be late for class (event $C$). However, my presence couples the two events. Observe that if my clock was running late it could ‘explain away’ my lateness due to the traffic jam. In other words, $A$ and $B$ become coupled if $C$ is known because they jointly cause the 3rd event: $P(A,B)=P(A)P(B)$ but $P(A,B \vert C)$ cannot be decoupled into two conditionals.

I-maps

Facts about I-map

In class examples

I-maps examples.

$X$ and $Y$ are independent only in Graph 1.

Below we have two tables showing the marginal distributions. Find the I-maps:

Solution: Graph 1.

Solution: Both graph 2 and graph 3, since they assume no independences. Therefore, they are equivalent in terms of independent sets. Graph 1 has an independent set. Therefore it is not an I-map of $P_2$.

What is $I(G)$

Local Markov assumptions of BN

A Bayesian network structure $\mathcal{G}$ is a directed acyclic graph whose nodes represent random variables $X_1, \dots, X_n$ (also see the textbook definition 3.2.1).

Local Markov assumptions

Definition

Let \({Pa}_{X_i}\) denote the parents of \(X_i\) in \(\mathcal{G}\), and \(NonDescendants_{X_i}\) denote the variables in the graph that are not descendants of $X_i$. Then $\mathcal{G}$ encodes the following set of local conditional independence assumptions $I_{\ell}(\mathcal{G})$:

\[I_{\ell} (\mathcal{G}): \left\{ X_i \perp NonDescendants_{X_i} \vert Pa_{X_i}: \forall i \right\}\]

In other words, each node $X_i$ is independent of its non-descendants given its parents.

D-separation criterion for Bayesian networks

Definition 1

Variables $X$ and $Y$ are D-separated (conditionally independent) given $Z$ if they are separated in the moralized ancestral graph. (D stands for Directed edges.)

Graph separation example.

If there is a way to travel from one node to another node using any path (not through the given), then these two nodes are not conditionally independent (example is not conditionally independent).

Practical definition of $I(\mathcal{G})$

Global Markov properties of Bayesian networks

Graph separation example.
\[I(\mathcal{G}): \left\{ X_i \perp Z \vert Y: dsep_{\mathcal{G}} ( X ; Z \vert Y) \right\}\]

I(G) example

In this graph there are two types of active trail structures (see section 3.3.1 from the reading below for definition):

To find the independencies, consider all trails with length greater than 1 (since a node cannot be independent of its parent).

Trails of length 2:

Trails of length 3 (only $x_2 \rightleftharpoons x_3 \rightleftharpoons x_1 \rightleftharpoons x_4$):

Trails between sets of nodes

Full $I(\mathcal{G})$

Putting the above together, and we have the following independencies.

\begin{aligned} I(\mathcal{G})= &\{(x_2 \perp x_1), (x_2 \perp x_1 \vert x_4), \\ &(x_3 \perp x_4 \vert x_1), (x_3 \perp x_4 \vert \{x_1, x_2\}), \\ &(x_2 \perp x_4), (x_2 \perp x_4 \vert x_1), (x_2 \perp x_4 \vert \{x_1, x_3\})\\ &(x_2 \perp \{x_1, x_4\}) \end{aligned}

The Equivalence Theorem

For a graph $G$, Let $D_1$ denotes the family of all distributions that satisfy $I(G)$. Let $D_2$ denotes the family of all distributions that factor according to $G$. Then $D_1 \equiv D_2$, which can be expressed as:

\[P(X)=\prod_{i=1:d}P(X_i \vert X_{\pi_i})\]

This means separation properties in the graph imply independence properties about the associated variables.

Conditional Probability Tables (CPTs)

This is an example with discrete probabilities.

Conditional Probability Densities (CPDs)

The probabilities can also be a continuous function, e.g. the Gaussian distribution.

Summary of BN

Soundness and Completeness of D-separation

Soundness and completeness are two desirable properties of d-separation, formally defined in Section 3.3.2

Example (follows Example 3.3 from textbook):

Consider a distribution $P$ over two independent variables $A$ and $B$. Recall that every independence we can observe from an I-map is by definition encoded in $P$ (see Definition 3.2-3.3 from the readings below). One I-map for $P$ would be the DAG $A \rightarrow B$ since it encodes no independencies. This is because $\varnothing$ is a subset of every set.

However, we can manipulate the conditional probability table so that independencies hold in \(P\) which do not follow from D-separation. One such conditional probability table would be:

Observe that in this table, $B$ is independent of $A$. But this is not reflected in the I-map $A \rightarrow B$.

Theorem

Let $G$ be a BN graph. If $X$ and $Y$ are not d-separated given $Z$ in $G$, then $X$ and $Y$ are dependent in some distribution $P$ that factorizes over $G$.

Theorem:

For almost all distributions $P$ that factorize over $G$, i.e., for all distributions except for a set of “measure zero” in the space of CPD parameterizations, we have that $I(P)=I(G)$.


Readings

The Bayesian network representation

3.1.1 Exploiting independence Properties

Standard vs compact parametrization of independent random variables

Given random variables $X_i$ each representing the outcome of an independent coin toss, the standard parametrization of their joint distribution would be as follows:

\begin{aligned} P(X_1, \dots X_n) &= P(X_1=x_1, X_2=x_2, \dots, X_n=x_n) \\ &= P(X_1=x_1) P(X_2=x_2) \dots P(X_n=x_n) \end{aligned}

Note there are 2 possibilities for each outcome $x_i$, this representation requires $2^n$ parameters. However, only $2^n -1$ will be independent parameters since the last probability is fully determined by the first $2^n$ (all probabilities sum to 1).

One simple way to reduce the number of parameters needed, would be to represent the probability that each coin toss lands heads as $ \theta_1, … \theta_n $. Then we have the following compact representation requiring only $n$ parameters:

\[P(X_1, ... X_n) = \prod_i \theta_{X_i}\]

3.1.3 Naive Bayes

We can further express the joint distribution in terms of conditional probabilities. This is done in the Naive Bayes model.

Instances of this model will include:

The model makes the strong ‘naive’ conditional independence assumption:

\[P(x_i | C, x_1, \dots, x_n) = P(x_i | C)\]

In words, features are conditionally independent, given the class of the instance of this model. Thus the joint distribution of the Naive Bayes model factorizes as follows: \(P(C, X_1, ... X_n) = P(C) \prod_i P(X_i |C)\)

3.2.1 Bayesian networks

Bayesian networks use a graph whose nodes are the random variables in the domain, and whose edges represent conditional probability statements. Unlike in the Naive Bayes model, Bayesian networks can also represent distributions that do not satisfy the naive conditional independence assumption.

Definition 3.1: Bayesian Network (BN)

A bayesian network $\mathcal{G}$ is a directed acyclic graph, whose nodes represent random variables $X_1, …X_n $. Let $Pa_{X_i}^{\mathcal{G}}$ denote the parents of $X_i \in \mathcal{G}$ and $NonDescendants_{X_i}$ denote the variables in the graph which are not descendants of $X_i$. Then $\mathcal{G}$ encodes the following set of conditional independence assumptions, also called the ‘local independencies’ and denoted by $I_l(\mathcal{G})$:

For each variable $X_i: (X_i \perp NonDescendants_{X_i} \vert \text{Pa}_{X_i}^{\mathcal{G}})$

3.2.3 Graphs and Distributions

In this section it is shown that the distribution $P$ satisfies the local independencies associated with $\mathcal{G} \iff P$ is representable as a set of conditional probability distributions (CPDs) associated with $\mathcal{G}$.

Definition 3.2-3.3: I-Map

Let $P$ be a distribution over $\mathcal{X}$. We define $I(P)$ to be the set of independence assertions of the form $X \perp Y \vert Z$ that hold in $P$.

Let $\mathcal{K}$ be any graph object associated with a set of independencies $I(\mathcal{K})$. We say $\mathcal{K}$ is an I-map for $I$ if $I(\mathcal{K}) \subseteq I$.

Note this means that $\mathcal{G}$ is an I-map for $P $ if $\mathcal{G}$ is an I-map for $I(P)$.

I-Map to factorization

In this section it is proven (see text) that the conditional independence assumptions implied by the BN structure $\mathcal{G}$ allow us to factorize the distribution $P$ for which $\mathcal{G}$ is an I-map into conditional probability distributions.

Definition 3.4 Factorization

Let $P$ be a distribution and $\mathcal{G}$ be a BN graph over random variables $X_1, …X_n $. We say that $P$ factorizes according to $\mathcal{G}$ if $P$ can be expressed as a product:

\[P(X_1, ... X_n) = \prod_i P( X_i | \text{Pa}_{X_i}^{\mathcal{G}})\]
Reduction in number of parameters

In a distribution over $n$ binary variables, specifying the joint distribution will require $2^n - 1$ independent parameters (as stated in 3.1.1). However if the distribution factorizes according to $\mathcal{G}$ where each node has at most $k$ parents, then the number of independent parameters is $ < 2^k$. Since $k$ is usually small, this represents an exponential reduction in number of parameters.

Factorization to I-map

The converse also holds as given by the following :

Theorem 3.2

Let $\mathcal{G}$ be a BN graph over random variables $X $ and $P$ be a distribution over the same space. If $P$ factorizes according to $\mathcal{G}$ then $\mathcal{G}$ is an I-map for $P$.

Box 3.C Knowledge engineering

Building a BN in the real world requires many steps including:

Done correctly, the model will be useful as well as not too complex to use (see Box 3.D).

3.3.1 D-separation

Objective: determine which independencies hold for every distribution $P$ which factorizes over $G$.

Definition 2.16 Trail

We say that $X_1, … X_k$ form a trail in the graph $\mathcal{K} = (\mathcal{X}, \mathcal{E}) $ if for every $i = 1, … , k-1$ we have that $ X_i \rightleftharpoons X_{i+1}$.

Types of active two edge trails

By examining 2-edge connections between nodes $X,Y,Z$ four types of active trails, i.e. trails along which influence can flow from $X$ to $Z$, are apparent:

  1. Causal trail $X \rightarrow Z \rightarrow Y$ : active iff Z is not observed.
  2. Evidential trail $X \leftarrow Z \leftarrow Y$ : active iff Z is not observed.
  3. Common cause $X \leftarrow Z \rightarrow Y$ : active iff Z is not observed.
  4. Common effect $X \rightarrow Z \leftarrow Y$ : active iff Z or one of its descendants is observed.

For influence to flow through a longer trail $ X_1 \rightleftharpoons … \rightleftharpoons X_n $, every two-edge trail within must allow influence flow. This is summarized as follows:

Definition 3.6 active trail

Let $\mathcal{G}$ be a BN structure and. $ X_1 \rightleftharpoons … \rightleftharpoons X_n $ be a trail in $\mathcal{G}$. Let $Z$ be a subset of observed variables. Then the trail $ X_1 \rightleftharpoons … \rightleftharpoons X_n$ is active given $Z $ if:

Further, for directed graphs that may have more than one trail between nodes “directed separation” (d-separation) gives a notion of separation between nodes.

Definition 3.7 D-separation

Let $X,Y,Z$ be three sets of nodes in $\mathcal{G}$. We say that $X, Y$ are d-separated given $Z$ (denoted $ \text{d-sep}_{\mathcal{G}} (X; Y \vert Z) $ ) if there is no active trail between any node X $ \in X $ and Y $ \in Y $ given $Z$.

3.3.2 Soundness and Completeness

As a method d-separation has the following properties (proof in text):

Soundness: $ X, Y $ are d-separated given $ Z \implies X \perp Y \vert Z $ i.e. independence determined by d-separation is satisfied by the underlying distribution.

Completeness: If $ X \perp Y \vert Z $ then they are d-separated, i.e. d-separation detects all possible independencies.