Bayesian Network – Matlab

The above link is the original source for Bayesian Network in Matlab. I put it in the way how I would like to study it.

1. Overview

A Bayesian Network is a directed acyclic graph (DAG)  that represents a set of random variables and their conditional dependency. Here is a simple example of the Bayesian Network in which we have four binary variables C(cloudy), S(sprinkler), R(rain), and W(wet grass). The arc from one variable $x_1$ to another variable $x_2$ indicates that $x_1$ causes $x_2$. The conditional probability distribution(CPD) at each node is represented as a table (CPT). After all these are constructed, we can use Bayes’ rule to calculate the posterior probability (called INFERENCE). Here I focus on how to construct Bayesian Network by using Bayes Net Toolbox (BNT) in Matlab.

2. Generate DATA

In reality, we may not have the conditional probability distribution tables. What we may have is the values of nodes each day (true=2, false=1 for a binary variable). For example,  one day we observed that it is cloudy (C=2) and raining (R=2); sprinkler is off (S=1) and grass is wet (W=2). This is one observation (C=2, S=1, R=2, W=2). We may have thousands of records like this. Here is the way to generate data for training purpose.

```%% Graph Structure
N = 4;  % Number of nodes
dag = zeros(N,N); % Adjacency Matrix
C = 1; S = 2; R = 3; W = 4;
dag(C,[R S]) = 1;
dag(R, W) = 1;
dag(S, W) = 1;
%% Creating the Bayes net shell
discrete_nodes = 1:N;
node_sizes = 2*ones(1,N);  % The number of values node i can take on
bnet = mk_bnet(dag, node_sizes, 'names', [{'C', 'S', 'R', 'W'}], 'discrete', [1:4]);

%% Parameters
bnet.CPD{C} = tabular_CPD(bnet, C, [0.5, 0.5]);
bnet.CPD{R} = tabular_CPD(bnet, R, [0.8, 0.2, 0.2, 0.8]);
bnet.CPD{S} = tabular_CPD(bnet, S, [0.5, 0.9, 0.5, 0.1]);
bnet.CPD{W} = tabular_CPD(bnet, W, [1, 0.1, 0.1, 0.01, 0, 0.9, 0.9, 0.99]);

%% Generate some data from the sprinkler network, randomize the parameters
nsamples = 281; % # of samples
samples = cell(N, nsamples);
seed = 0; rand('state', seed);  % Fix the seed to make it repeatable.
for i=1:nsamples
samples(:,i) = sample_bnet(bnet); % cell array sample{j,i}: the value of
% the j'th node in case i
end
data = cell2num(samples);```

%%%%%%%%%%%%%%% Table of DATA %%%%%%%%%%%%%

3. Exploratory Data Analysis (EDA)

Our final goal is to recover the original model. Here we first use a Matlab function of clustergram() to cluster data.

```cgo = clustergram('data');
variable_set = cell({'Cloudy', 'Sprinkler', 'Rain', 'Wet grass'});
set (cgo, 'RowLabels', variable_set);  % Add meaningful names to rows.
get(cgo) % Display the properties of the clustergram object.```
```Cluster: 'ALL'
RowPDist: {'Euclidean'}
ColumnPDist: {'Euclidean'}
Dendrogram: {}
OptimalLeafOrder: 1
LogTrans: 0
DisplayRatio: [0.2000 0.2000]
RowGroupMarker: []
ColumnGroupMarker: []
ShowDendrogram: 'on'
Standardize: 'ROW'
Symmetric: 1
DisplayRange: 3
Colormap: [11x3 double]
ImputeFun: []
ColumnLabels: {1x281 cell}
RowLabels: {4x1 cell}
ColumnLabelsRotate: 90
RowLabelsRotate: 0
ColumnLabelsLocation: 'bottom'
RowLabelsLocation: 'right'
Annotate: 'off'
AnnotPrecision: 2
AnnotColor: 'w'
ColumnLabelsColor: []
RowLabelsColor: []
LabelsWithMarkers: 0```

%%%%%%%%%%%%%%%%%%%%%%%%%%%%

```cgo2 = clustergram(data,'Standardize', 'column');
set(cgo2, 'RowLabels', variable_set); % Add meaningful names to rows.
get(cgo2);```

Standardize: ‘COLUMN’

%%%%%%%%%%%%%%%%%%%%%%%%%%%%

```cgo3 = clustergram(data,'Standardize', 'column','RowPDist','cityblock');
set(cgo3, 'RowLabels', variable_set); % Add meaningful names to rows.
get(cgo3);```

RowPDist: {‘cityblock’}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

```cgo4 = clustergram(data,'Standardize','none','Symmetric','false','RowPDist', 'cityblock');
set(cgo4, 'RowLabels', variable_set); % Add meaningful names to rows.
get(cgo4);```

4. Exhaustive Search

```dags = mk_all_dags(N);
score = score_dags(data, node_sizes, dags);```

For N=4 variables, there are 543 DAGs. With 281 observations, we obtain three best-scoring DAGs (491, 503, 504) in which DAG504 is the correct one. However, the search does not converge to the unique DAG as the number of observation increases. For example, if we have 2810 observations, the best-scoring DAG is 504; and if we have 4000 observation, the best-scoring DAG is 503; and if we have 10,000 observation, the best-scoring DAG is 491.