Skip to content

statistiXL Features

Hierarchical Clustering

Cluster Analysis, also called Numerical Classification, is used to arrange objects of interest into a branching hierarchy of groups (a tree, or dendrogram) based on how similar or dissimilar the objects are in terms of a number of attributes that are known for each object. For example, countries (the objects or cases) could be clustered based on a number of socioeconomic attributes such as population size, average annual income, life expectancy and annual per capita expenditure on health. The outcome of such an analysis would be to show which countries are most similar in terms of these attributes.

Such hierarchical clustering can be either agglomerative, where clustering starts with the individual cases and proceeds by grouping the most similar cases together, or divisive, where the analysis starts with all cases in a single group and proceeds by dividing groups into two until only individual cases remain. statistiXL currently supports Agglomerative Cluster Analysis which can be used for data exploration or reduction, model or hypothesis testing, and tree (dendrogram) summary of groups.

statistiXL provides a variety of options for hierarchical agglomerative clustering. Firstly, there are a variety of methods for deriving the similarity matrix, dissimilarity matrix or distance matrix that forms the basis for grouping cases together. Methods are provided for clustering binomial data (e.g. Jaccard, Hamann, Kulczynski, Pattern Difference, Euclidean Distance, Squared Euclidean distance), quantitative data (e.g. Bray & Curtis, Canberra, City Block, Euclidean Distance, Squared Euclidean distance, Pearsonian correlation), or mixed data. Secondly, having fused two groups, the matrix is recalculated using a combinatorial algorithm, and the next fusion is determined. statistiXL provides a variety of combinatorial algorithms, including nearest neighbour, furthest neighbour, median, centroid, group average, Ward’s and flexible methods.

Results are presented both in tabular and graphical form. They start with the similarity (or dissimilarity or distance) matrix derived from the attribute data according to the selected method. Then, the order of fusion of cases is given, with the corresponding similarity (or distance), until the final completely fused group (root) is reached. A cophenetic correlation coefficient is provided, to indicate how similar the final hierarchical pattern and initial similarity (or distance) matrix are. A dendrogram (tree graph) is provided to graphically summarise the clustering pattern. The dendrogram starts with all individuals as separate clusters and shows the combination of fusions back to a single root and can be either text (character) based or presented as an Excel plot.

The help file included with statistiXL provides an introduction and detailed overview of clustering, and describes the general input and output options. A comprehensive range of eleven different clustering examples are provided, to detail how statistiXL is used to analyse different types of data (e.g. binomial, quantitative or mixed) and to illustrate the commonly used combinations of similarity/distance matrices and combinatorial algorithms.