|
|
Cluster Analysis
Cluster
Analysis, also called Numerical Classification, is used to arrange objects of
interest into a branching hierarchy of groups (a tree, or dendrogram) based on
how similar or dissimilar the objects are in terms of a number of attributes
that are known for each object. For example, countries (the objects or cases)
could be clustered based on a number of socioeconomic attributes such as
population size, average annual income, life expectancy and annual per capita
expenditure on health. The outcome of such an analysis would be to show which
countries are most similar in terms of these attributes.
Such hierarchical clustering can be either agglomerative, where clustering
starts with the individual cases and proceeds by grouping the most similar
cases together, or divisive, where the analysis starts with all cases in a
single group and proceeds by dividing groups into two until only individual
cases remain. statistiXL currently supports Agglomerative Cluster Analysis
which can be used for data exploration or reduction, model or hypothesis
testing, and tree (dendrogram) summary of groups.
statistiXL provides a variety of options for hierarchical agglomerative
clustering. Firstly, there are a variety of methods for deriving the similarity
matrix, dissimilarity matrix or distance matrix that forms the basis for
grouping cases together. Methods are provided for clustering binomial data
(e.g. Jaccard, Hamann, Kulczynski, Pattern Difference, Euclidean Distance,
Squared Euclidean distance), quantitative data (e.g. Bray & Curtis,
Canberra, City Block, Euclidean Distance, Squared Euclidean distance,
Pearsonian correlation), or mixed data. Secondly, having fused two groups, the
matrix is recalculated using a combinatorial algorithm, and the next fusion is
determined. statistiXL provides a variety of combinatorial algorithms,
including nearest neighbour, furthest neighbour, median, centroid, group
average, Ward’s and flexible methods.
Results
are presented both in tabular and graphical form. They start with the
similarity (or dissimilarity or distance) matrix derived from the attribute
data according to the selected method. Then, the order of fusion of cases is
given, with the corresponding similarity (or distance), until the final
completely fused group (root) is reached. A cophenetic correlation coefficient
is provided, to indicate how similar the final hierarchical pattern and initial
similarity (or distance) matrix are. A dendrogram (tree graph) is provided to
graphically summarise the clustering pattern. The dendrogram starts with all
individuals as separate clusters and shows the combination of fusions back to a
single root and can be either text (character) based or presented as an Excel™
plot.
The help file included with statistiXL provides an introduction and detailed
overview of clustering, and describes the general input and output options. A
comprehensive range of eleven different clustering examples are provided, to
detail how statistiXL is used to analyse different types of data (e.g.
binomial, quantitative or mixed) and to illustrate the commonly used
combinations of similarity/distance matrices and combinatorial algorithms.
|