17.7.3.3 Algorithms (Hierarchical Cluster Analysis)

1 Distance Matrix
- 1.1 Cluster Observations
- 1.2 Cluster Variables
2 Linkage Method
3 Dendrogram
4 Group Objects

Hierarchical Cluster Analysis is used to build a hierarchical tree. It starts with n clusters, each with a single object and then at each of n − 1 stages, merges two clusters to form a larger cluster, until all objects are in a single cluster. The process can be shown in a Dendrogram.

Objects to be clustered in Hierarchical Cluster Analysis can be observations or variables.

Distance Matrix

A distance, or dissimilarity, matrix is a symmetric matrix with zero diagonal elements. The ijth element represents how far or how dissimilar the ith and jth objects are. Methods to calculate the distance between two objects are different for clustering observations and clustering variables.

Cluster Observations

Origin supports standardization of data before distance calculation for clustering observations. Observations containing one or more missing values will be excluded in the analysis.

Standardize Variables
1. Normalize to N(0,1)
  For a variable $x_j\$ , it is normalized as: $(x_j-\bar{x}_j)/\sigma_j$ , where $\bar{x}_j\$ and $\sigma_j\$ are the mean and standard deviation of the variable. The standardized variable will have zero mean and unit standard deviation.
2. Normalize to (0,1)
  For a variable $x_j\$ , it is normalized as: $(x_j-\mathrm{min}(x_j))/(\mathrm{max}(x_j)-\mathrm{min}(x_j))\$ . The variable will be standardized in the range of 0 and 1.

Origin supports these distance types:

Distance Type
For a standardized matrix with n observations and p variables, the distance between the ith observation and the kth observation can be expressed as follows.
1. Euclidean
  $d_{ik}=\Big\{\sum_{j=1}^p (x_{ij}-x_{kj})^2 \Big\}^{\frac{1}{2}}$
2. Squared Euclidean
  $d_{ik}=\sum_{j=1}^p (x_{ij}-x_{kj})^2$
3. Absolute (City block metric)
  $d_{ik}=\sum_{j=1}^p |x_{ij}-x_{kj}|$
4. Cosine
  $d_{ik}=1-\frac{\sum_{j=1}^p x_{ij}x_{kj}}{\sqrt{\sum_{j=1}^p x_{ij}^2}\sqrt{\sum_{j=1}^p x_{kj}^2}}$
5. Pearson correlation
  $d_{ik}=1-\frac{\sum_{j=1}^p (x_{ij}- \bar x_i)(x_{kj}- \bar x_k)}{\sqrt{\sum_{j=1}^p (x_{ij}- \bar x_i)^2}\sqrt{\sum_{j=1}^p (x_{kj}- \bar x_k)^2}}$
  where $\bar x_i=\sum_{j=1}^p x_{ij}$ and $\bar x_k=\sum_{j=1}^p x_{kj}$
6. Jaccard
  $d_{ik}=\frac{\sum_{j=1}^p ( (x_{ij} \ne x_{kj}) \bigcap (x_{ij} \ne 0 \bigcup x_{kj} \ne 0) ) ? 1:0 }{\sum_{j=1}^p (x_{ij} \ne 0 \bigcup x_{kj} \ne 0) ) ? 1:0}$

Cluster Variables

Origin supports two distance types for clustering variables. Observations will be excluded in the calculation of the correlation between two variables if missing values exist in either of the variables.

Distance Type
For a matrix with n observations and p variables, the distance between the jth variable and the kth variable can be expressed as follows:
1. Correlation
  $d_{jk}=1-\mathrm{corr}(x_j, x_k)\$ where $\mathrm{corr}(x_j, x_k)\$ is the correlation between the jth variable and the kth variable.
2. Absolute correlation
  $d_{jk}=1-|\mathrm{corr}(x_j, x_k)|\$

Linkage Method

At each stage, the two clusters that are nearest are merged. Origin provides several methods to calculate the distance between the new cluster and other clusters. Let clusters j and k be merged as cluster jk. Let $n_i\$ , $n_j\$ and $n_k\$ be number of objects of Cluster i, Cluster j and Cluster k respectively, and let $d_{ij}\$ , $d_{ik}\$ and $d_{jk}\$ be the distance between two clusters. The distance between Cluster jk and Cluster i $d_{i,jk}\$ can then be calculated in the following ways:

Single link or Nearest neighbor
$d_{i,jk}=\mathrm{min}(d_{ij},d_{ik})\$
Complete link or Furthest neighbor
$d_{i,jk}=\mathrm{max}(d_{ij},d_{ik})\$
Group average
$d_{i,jk}=\frac{n_j}{n_j+n_k}d_{ij}+\frac{n_k}{n_j+n_k}d_{ik}\$
Centroid
$d_{i,jk}=\frac{n_j}{n_j+n_k}d_{ij}+\frac{n_k}{n_j+n_k}d_{ik}-\frac{n_jn_k}{(n_j+n_k)^2}d_{jk}\$
Median
$d_{i,jk}=\frac{1}{2}d_{ij}+\frac{1}{2}d_{ik}-\frac{1}{4}d_{jk}\$
Minimum variance or Ward
$d_{i,jk}=\frac{(n_i+n_j)d_{ij}+(n_i+n_k)d_{ik}-n_id_{jk}}{n_i+n_j+n_k}\$

If clusters j and k, j<k, merge then the new cluster will be referred to as Cluster j in the Cluster Stages table.

Dendrogram

The Dendrogram plot is a hierarchical tree that shows the distance at which two clusters merge. Each stage is represented as a unit in the Dendrogram. The top of the unit for each stage represents the new cluster by the merging of two clusters. Its height corresponds to the distance between two merged clusters.

The end-points of the Dendrogram represent n objects. n objects in the Dendrogram are sorted so that the clusters that merge are adjacent. The first end-point in the Dendrogram always corresponds to the first object.

Group Objects

Membership of n objects for specified k clusters can be determined from the information in the Dendrogram plot or Cluster Stages table. k clusters exist at the n-kth stage, which means membership of each object can be known from the first n-kth stages. Object 1 always belongs to Cluster 1.

Cluster centers, the distance between cluster centers, and the distance between observations and clusters, are calculated for clustering observations. Note that observations are standardized in the calculation if standardization is chosen in the analysis.