Overview
Clustering in data mining is a technique used to group similar data points
together based on their features and characteristics. It is an unsupervised
learning method that helps to identify patterns in large datasets and segment
them into smaller groups or subsets.
What is Clustering in Data Mining?
Clustering
in data mining is a technique that groups similar data points together based on
their features and characteristics. It can also be referred to as a process of
grouping a set of objects so that objects in the same group (called a cluster)
are more similar to each other than those in other groups (clusters). It is an
unsupervised learning technique that aims to identify similarities and patterns
in a dataset.
Clustering
techniques in data mining can be used in various applications, such as image
segmentation, document clustering, and customer
segmentation. The goal is to obtain meaningful insights from the data
and improve decision-making processes.

A cluster can have the following properties -
- The data points within a
cluster are similar to each other based on some pre-defined criteria or
similarity measures.
- The clusters are distinct
from each other, and the data points in one cluster are different from
those in another cluster.
- The data points within a
cluster are closely packed together.
- A cluster is often
represented by a centroid or a center point that summarizes the properties
of the data points within the cluster.
- A cluster can have any
number of data points, but a good cluster should not be too small or too
large.
SIMILARITIES AND DISTANCE MEASURES
Clustering is a fundamental concept in data analysis and
machine learning, where the goal is to group similar data points into clusters
based on their characteristics. One of the most critical aspects of clustering
is the choice of distance measure, which determines how similar or dissimilar
two data points.
Distance
measures are mathematical functions that determine
how similar or different two data points are. The choice of distance
measure can significantly impact the clustering results, as
it influences the shape and structure of the clusters.
The
choice of distance measure significantly impacts the quality of the clusters
formed and the insights derived from them. A well-chosen distance measure can
lead to meaningful clusters that reveal hidden patterns in the data,
Common Distance Measures
There
are several types of distance measures, each with its strengths and weaknesses.
Here are some of the most commonly used distance measures in clustering:
1. Euclidean Distance
The Euclidean distance is the most widely used distance measure in clustering. It
calculates the straight-line distance between two points in n-dimensional
space. The formula for Euclidean distance is:
d(p,q)=Σi=1n(pi−qi)2
where,
·
p and q are two
data points
·
and n is the
number of dimensions.
2. Manhattan Distance
The Manhattan distance, is the total of the absolute differences
between their Cartesian coordinates, sometimes referred to as the L1 distance
or city block distance
The
Manhattan distance, which computes the total distance traveled along each
dimension to reach a different data point represents this movement. When it
comes to categorical data this metric is more effective than Euclidean distance
since it is less susceptible to outliers. The formula is:
d(p,q)=Σi=1n∣pi−qi∣d(p,q)=Σi=1n∣pi−qi∣
3. Cosine Similarity
Instead
than concentrating on the exact distance between data points , cosine similarity measure looks at their
orientation. It calculates the cosine of the angle between two data points,
with a higher cosine value indicating greater similarity. This measure is often
used for text data analysis
similarity(A,B)=∥A∥∥B∥A.B
4. Minkowski Distance
Minkowski distance is a generalized form of
both Euclidean and Manhattan distances, controlled by a parameter p. The
Minkowski distance allows adjusting the power parameter (p). When p=1, it’s
equivalent to Manhattan distance; when p=2, it’s Euclidean distance.
d(x,y)=(Σi=1n∣xi−yi∣p)1pd(x,y)=(Σi=1n∣xi−yi∣p)p1
5. Jaccard Index
This measure is ideal for
binary data, where features can only take values of 0 or 1. It calculates the
ratio of the number of features shared by two data points to the total number
of features. Jaccard Index measures the similarity between two sets by
comparing the size of their intersection and union.
J(A,B)=∣A∩B∣∣A∪B∣J(A,B)=∣A∪B∣∣A∩B∣
OUTLIERS
outliers
are sample points with values much different from those of the remaining set of
data. Outliers may represent errors in the or could be correct data values that
are simply much different from the remaining data. A person who is 2.5 meters
tall is much taller than most people. In analyzing the height of individuals;
this value probably would be viewed as an outlier.
This problem is illustrated in Figure 5.3. Here if
three clusters are found (solid line), the outlier will occur in a cluster by
itself. However, if two clusters are found (dashed line), the two (obviously)
different sets of data will be placed in one cluster because they are closer
together than the outlier. This problem is complicated by the fact that many
clustering algorithms actually have as input the number of desired clusters to
be found. Clustering algorithms may actually find and remove outliers to ensure
that they perform better. However, care must be taken in actually removing
outliers.
For example, suppose that the data mining problem is
to predict flooding. Extremely high water level values occur very infrequently,
and when compared with the normal water level values may seem to be outliers.
However, removing these values may not allow the data mining algorithms to work
effectively because there would be no data that showed that floods ever
actually occurred. Outlier detection, or outlier mining, is the process of
identifying outliers in a set of data. Clustering, or other data mining,
algorithms may then choose to remove or treat these values differently. Some
outlier detection techniques ate based on statistical techniques . These
usually assume that the set of data follows a known distribution and that
outliers can be detected by well-known tests such as discordancy tests. However,
these tests are not very realistic for real-world data because real .. world
data values may not follow well-defined data distributions. Also, most of these
tests assume a single attribute value, and many attributes are- involved in
real-world datasets. Alternative detection techniques may be based on distance
measures

Hierarchical Clustering
in Data Mining
Hierarchical clustering method works via grouping data into a tree of
clusters. Hierarchical clustering begins by treating every data point as a separate
cluster. Then, it repeatedly executes the subsequent steps:
1. Identify the 2 clusters which can be closest together,
and
2. Merge the 2 maximum comparable clusters. We need to
continue these steps until all the clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of
nested clusters. A diagram called Dendrogram (A Dendrogram is a tree-like
diagram that statistics the sequences of merges or splits) graphically
represents this hierarchy and is an inverted tree that describes the order in
which factors are merged (bottom-up view) or clusters are broken up (top-down
view).
Hierarchical
clustering is a
method of cluster analysis in data mining
that creates a hierarchical representation of the clusters in a dataset. The
method starts by treating each data point as a separate cluster and then
iteratively combines the closest clusters until a stopping criterion is reached.
Hierarchical
clustering has several advantages
·
The ability to handle
non-convex clusters and clusters of different sizes and densities.
·
The ability to handle missing
data and noisy data.
·
The ability to reveal the
hierarchical structure of the data, which can be useful for understanding the
relationships among the clusters.
Drawbacks
of Hierarchical Clustering
·
The need for a criterion to stop the
clustering process and determine the final number of clusters.
·
The computational cost and memory requirements
of the method can be high, especially for large datasets.
·
The results can be sensitive to the initial
conditions, linkage criterion, and distance metric used.
In summary, Hierarchical clustering is a method of data mining
that groups similar data points into clusters by creating a hierarchical
structure of the clusters.
·
This method can handle different types of data
and reveal the relationships among the clusters. However, it can have high
computational cost and results can be sensitive to some conditions.
Types of Hierarchical Clustering
Basically, there are two types
of hierarchical Clustering:
1.
Agglomerative Clustering
2.
Divisive clustering
1. Agglomerative Clustering
Initially consider every data
point as an individual Cluster and at every
step, merge the nearest pairs of the cluster. (It is a bottom-up method). At
first, every dataset is considered an individual entity or cluster. At every
iteration, the clusters merge with different clusters until one cluster is
formed.
The algorithm for Agglomerative Hierarchical
Clustering is:
·
Calculate the similarity of one cluster with
all the other clusters (calculate proximity matrix)
·
Consider every data point as an individual
cluster
·
Merge the clusters which are highly similar or
close to each other.
·
Recalculate the proximity matrix for each
cluster
·
Repeat Steps 3 and 4 until only a single
cluster remains.
Let’s see the graphical
representation of this algorithm using a dendrogram.
Note: This is just a demonstration
of how the actual algorithm works no calculation has been performed below all
the proximity among the clusters is assumed.

·
Step-1: Consider each alphabet as a single cluster and
calculate the distance of one cluster from all the other clusters.
·
Step-2: In the second step comparable clusters are merged
together to form a single cluster. Let’s say cluster (B) and cluster (C) are
very similar to each other therefore we merge them in the second step similarly
to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
·
Step-3: We recalculate the proximity according to the
algorithm and merge the two nearest clusters([(DE), (F)]) together to form new
clusters as [(A), (BC), (DEF)]
·
Step-4: Repeating the same process; The clusters DEF and BC
are comparable and merged together to form a new cluster. We’re now left with
clusters [(A), (BCDEF)].
·
Step-5: At last, the two remaining clusters are merged
together to form a single cluster [(ABCDEF)].
Divisive Hierarchical clustering
We can say that Divisive Hierarchical
clustering is precisely the opposite of Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, we take into account all of the data points as a
single cluster and in every iteration, we separate the data points from the clusters
which aren’t comparable. In the end, we are left with N clusters.
PARTITIONAL
ALGORITHMS :
Nonhierarchical or
partitional clustering creates the clusters in one step as opposed to several
steps. Only one set of clusters is created, although several different sets of
clusters may be created internally within the various algorithms. Since only
one set of clusters is output, the user must input the desired number, k, of
clusters.
Minimum
Spanning Tree Since we have agglomerative and
divisive algorithms based on the use of an MST, we also present a partitional
MST algorithm. This is a very simplistic approach, but it illustrates how
partitional algorithms work. The algorithm is shown in Algorithm 5.4. Since the
clustering problem is to define a mapping, the output of this algorithm shows
the clusters as a set of ordered pairs (ti , j) where f Cti) = K J .
Partitional
MST algorithm :
M= MST(A)
identify inconsistent
edges in M;
remove k - 1
inconsistent edges ;
create output representation ;
Squared
Error Clustering Algorithm
The
squared error clustering algorithm minimizes the squared error. The squared
error for a cluster is the sum of the squared Euclidean distances between each
element in the cluster and the cluster centroid, Ck. Given a cluster Ki , let
the set of items mapped to that cluster be {ti l, ti2, ... , tim}. The squared
error is defined as
ALGORiTHM
Squared
error algorithm: assign each item ti to a cluster ;
calculate
center for each cluster ;
repeat assign each item ti to the cluster
which has the closest center ;
calculate
new center for each cluster ;
calculate
squared error ;
unt
il the difference between successive squared errors
is
below a threshold ;
K-Means Clustering
K-means
is an iterative clustering algorithm in which items are moved among sets of
clus- . ters until the desired set is reached. As such, it may be viewed as a
type of squared error algorithm, although the convergence criteria need not be
defined based on the squared error. A high degree of similarity among elements
in clusters is obtained, while a high degree of dissimilarity among elements in
different clusters is achieved simultaneously. The cluster mean of Ki = {tn,
ti2 • . .. , tim} is defined as
ALGORITHM
assign
initial values for means m1 , m2 , ... , mk ;
repeat
assign each item ti to the cluster which has the closest mean;
calculate
new mean for each cluster ;
until
convergence criteria is met ;
Nearest Neighbor Algorithm
An
algorithm similar to the single link technique is called the nearest neighbor
algorithm. With this serial algorithm, items are iteratively merged into the
existing clusters that are closest. In this algorithm a threshold, t, is used
to determine if items will be added to existing clusters or if a new cluster is
created.
Nearest neighbor algorithm:
K1
= ft1 };
K
= {Kl} i
k=
1 ;
for
i = 2 to.; n do f ind the tm in s ome cluster Km in K such that di s(ti, tm) is
the
smallest ;
if
dis (ti , tm), :::; t then
Km
= Km U ti
else
k=k + l ;
Kk
= {ti } ;
Genetic Algorithms
There have been clustering techniques based on
the use of genetic algorithms. To determine how to perform clustering with genetic
algorithms, we first must determine how to. represent each cluster. One simple
approach would be to use a bit-map representation for each possible cluster.
So, given a database with four items, {A, B , C , D}, we would represent one
solution to creating two clusters as 1001 and 01 10. This represents the two
clusters {A, D} and {B, C}.
GA clustering algorithm:
randomly create an initial solut ion ;
repeat
use crossover to create a new s olut ion ;
unt il termination criteria is met
No comments:
Post a Comment