Overview

Clustering in data mining is a technique used to group similar data points together based on their features and characteristics. It is an unsupervised learning method that helps to identify patterns in large datasets and segment them into smaller groups or subsets.

What is Clustering in Data Mining?

Clustering in data mining is a technique that groups similar data points together based on their features and characteristics. It can also be referred to as a process of grouping a set of objects so that objects in the same group (called a cluster) are more similar to each other than those in other groups (clusters). It is an unsupervised learning technique that aims to identify similarities and patterns in a dataset.

Clustering techniques in data mining can be used in various applications, such as image segmentation, document clustering, and customer segmentation. The goal is to obtain meaningful insights from the data and improve decision-making processes.

$Description: C:\Users\VSWC1\Desktop\1.PNG$

A cluster can have the following properties -

The data points within a cluster are similar to each other based on some pre-defined criteria or similarity measures.
The clusters are distinct from each other, and the data points in one cluster are different from those in another cluster.
The data points within a cluster are closely packed together.
A cluster is often represented by a centroid or a center point that summarizes the properties of the data points within the cluster.
A cluster can have any number of data points, but a good cluster should not be too small or too large.

SIMILARITIES AND DISTANCE MEASURES

Clustering is a fundamental concept in data analysis and machine learning, where the goal is to group similar data points into clusters based on their characteristics. One of the most critical aspects of clustering is the choice of distance measure, which determines how similar or dissimilar two data points.

Distance measures are mathematical functions that determine how similar or different two data points are. The choice of distance measure can significantly impact the clustering results, as it influences the shape and structure of the clusters.

The choice of distance measure significantly impacts the quality of the clusters formed and the insights derived from them. A well-chosen distance measure can lead to meaningful clusters that reveal hidden patterns in the data,

Common Distance Measures

There are several types of distance measures, each with its strengths and weaknesses. Here are some of the most commonly used distance measures in clustering:

1. Euclidean Distance

The Euclidean distance is the most widely used distance measure in clustering. It calculates the straight-line distance between two points in n-dimensional space. The formula for Euclidean distance is:

d(p,q)=Σi=1n(pi−qi)2
where,

· p and q are two data points

· and n is the number of dimensions.

2. Manhattan Distance

The Manhattan distance, is the total of the absolute differences between their Cartesian coordinates, sometimes referred to as the L1 distance or city block distance

The Manhattan distance, which computes the total distance traveled along each dimension to reach a different data point represents this movement. When it comes to categorical data this metric is more effective than Euclidean distance since it is less susceptible to outliers. The formula is:

d(p,q)=Σi=1n∣pi−qi∣

3. Cosine Similarity

Instead than concentrating on the exact distance between data points , cosine similarity measure looks at their orientation. It calculates the cosine of the angle between two data points, with a higher cosine value indicating greater similarity. This measure is often used for text data analysis

similarity(A,B)=∥A∥∥B∥A.B
4. Minkowski Distance

Minkowski distance is a generalized form of both Euclidean and Manhattan distances, controlled by a parameter p. The Minkowski distance allows adjusting the power parameter (p). When p=1, it’s equivalent to Manhattan distance; when p=2, it’s Euclidean distance.

d(x,y)=(Σi=1n∣xi−yi∣p)p1
5. Jaccard Index

This measure is ideal for binary data, where features can only take values of 0 or 1. It calculates the ratio of the number of features shared by two data points to the total number of features. Jaccard Index measures the similarity between two sets by comparing the size of their intersection and union.

J(A,B)=∣A∪B∣∣A∩B∣

OUTLIERS
outliers are sample points with values much different from those of the remaining set of data. Outliers may represent errors in the or could be correct data values that are simply much different from the remaining data. A person who is 2.5 meters tall is much taller than most people. In analyzing the height of individuals; this value probably would be viewed as an outlier.

This problem is illustrated in Figure 5.3. Here if three clusters are found (solid line), the outlier will occur in a cluster by itself. However, if two clusters are found (dashed line), the two (obviously) different sets of data will be placed in one cluster because they are closer together than the outlier. This problem is complicated by the fact that many clustering algorithms actually have as input the number of desired clusters to be found. Clustering algorithms may actually find and remove outliers to ensure that they perform better. However, care must be taken in actually removing outliers.

For example, suppose that the data mining problem is to predict flooding. Extremely high water level values occur very infrequently, and when compared with the normal water level values may seem to be outliers. However, removing these values may not allow the data mining algorithms to work effectively because there would be no data that showed that floods ever actually occurred. Outlier detection, or outlier mining, is the process of identifying outliers in a set of data. Clustering, or other data mining, algorithms may then choose to remove or treat these values differently. Some outlier detection techniques ate based on statistical techniques . These usually assume that the set of data follows a known distribution and that outliers can be detected by well-known tests such as discordancy tests. However, these tests are not very realistic for real-world data because real .. world data values may not follow well-defined data distributions. Also, most of these tests assume a single attribute value, and many attributes are- involved in real-world datasets. Alternative detection techniques may be based on distance measures

$Description: C:\Users\VSWC1\Desktop\2.PNG$

Hierarchical Clustering in Data Mining

Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical clustering begins by treating every data point as a separate cluster. Then, it repeatedly executes the subsequent steps:

1. Identify the 2 clusters which can be closest together, and

2. Merge the 2 maximum comparable clusters. We need to continue these steps until all the clusters are merged together.

In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram called Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences of merges or splits) graphically represents this hierarchy and is an inverted tree that describes the order in which factors are merged (bottom-up view) or clusters are broken up (top-down view).

Hierarchical clustering is a method of cluster analysis in data mining that creates a hierarchical representation of the clusters in a dataset. The method starts by treating each data point as a separate cluster and then iteratively combines the closest clusters until a stopping criterion is reached.

Hierarchical clustering has several advantages

· The ability to handle non-convex clusters and clusters of different sizes and densities.

· The ability to handle missing data and noisy data.

· The ability to reveal the hierarchical structure of the data, which can be useful for understanding the relationships among the clusters.

Drawbacks of Hierarchical Clustering

· The need for a criterion to stop the clustering process and determine the final number of clusters.

· The computational cost and memory requirements of the method can be high, especially for large datasets.

· The results can be sensitive to the initial conditions, linkage criterion, and distance metric used.
In summary, Hierarchical clustering is a method of data mining that groups similar data points into clusters by creating a hierarchical structure of the clusters.

· This method can handle different types of data and reveal the relationships among the clusters. However, it can have high computational cost and results can be sensitive to some conditions.

Types of Hierarchical Clustering

Basically, there are two types of hierarchical Clustering:

1. Agglomerative Clustering

2. Divisive clustering

1. Agglomerative Clustering

Initially consider every data point as an individual Cluster and at every step, merge the nearest pairs of the cluster. (It is a bottom-up method). At first, every dataset is considered an individual entity or cluster. At every iteration, the clusters merge with different clusters until one cluster is formed.

The algorithm for Agglomerative Hierarchical Clustering is:

· Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)

· Consider every data point as an individual cluster

· Merge the clusters which are highly similar or close to each other.

· Recalculate the proximity matrix for each cluster

· Repeat Steps 3 and 4 until only a single cluster remains.

Let’s see the graphical representation of this algorithm using a dendrogram.

Note: This is just a demonstration of how the actual algorithm works no calculation has been performed below all the proximity among the clusters is assumed.

$Description: C:\Users\VSWC1\Desktop\3.PNG$

· Step-1: Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other clusters.

· Step-2: In the second step comparable clusters are merged together to form a single cluster. Let’s say cluster (B) and cluster (C) are very similar to each other therefore we merge them in the second step similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]

· Step-3: We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]

· Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].

· Step-5: At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].

Divisive Hierarchical clustering

We can say that Divisive Hierarchical clustering is precisely the opposite of Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the data points as a single cluster and in every iteration, we separate the data points from the clusters which aren’t comparable. In the end, we are left with N clusters.

PARTITIONAL ALGORITHMS :

Nonhierarchical or partitional clustering creates the clusters in one step as opposed to several steps. Only one set of clusters is created, although several different sets of clusters may be created internally within the various algorithms. Since only one set of clusters is output, the user must input the desired number, k, of clusters.

Minimum Spanning Tree Since we have agglomerative and divisive algorithms based on the use of an MST, we also present a partitional MST algorithm. This is a very simplistic approach, but it illustrates how partitional algorithms work. The algorithm is shown in Algorithm 5.4. Since the clustering problem is to define a mapping, the output of this algorithm shows the clusters as a set of ordered pairs (ti , j) where f Cti) = K J .

Partitional MST algorithm :

M= MST(A)

identify inconsistent edges in M;

remove k - 1 inconsistent edges ;

create output representation ;

Squared Error Clustering Algorithm

The squared error clustering algorithm minimizes the squared error. The squared error for a cluster is the sum of the squared Euclidean distances between each element in the cluster and the cluster centroid, Ck. Given a cluster Ki , let the set of items mapped to that cluster be {ti l, ti2, ... , tim}. The squared error is defined as

ALGORiTHM

Squared error algorithm: assign each item ti to a cluster ;

calculate center for each cluster ;

repeat assign each item ti to the cluster which has the closest center ;

calculate new center for each cluster ;

calculate squared error ;

unt il the difference between successive squared errors

is below a threshold ;

K-Means Clustering

K-means is an iterative clustering algorithm in which items are moved among sets of clus- . ters until the desired set is reached. As such, it may be viewed as a type of squared error algorithm, although the convergence criteria need not be defined based on the squared error. A high degree of similarity among elements in clusters is obtained, while a high degree of dissimilarity among elements in different clusters is achieved simultaneously. The cluster mean of Ki = {tn, ti2 • . .. , tim} is defined as

ALGORITHM

assign initial values for means m1 , m2 , ... , mk ;

repeat assign each item ti to the cluster which has the closest mean;

calculate new mean for each cluster ;

until convergence criteria is met ;

Nearest Neighbor Algorithm

An algorithm similar to the single link technique is called the nearest neighbor algorithm. With this serial algorithm, items are iteratively merged into the existing clusters that are closest. In this algorithm a threshold, t, is used to determine if items will be added to existing clusters or if a new cluster is created.

Nearest neighbor algorithm:

K1 = ft1 };

K = {Kl} i

k= 1 ;

for i = 2 to.; n do f ind the tm in s ome cluster Km in K such that di s(ti, tm) is

the smallest ;

if dis (ti , tm), :::; t then

Km = Km U ti

else k=k + l ;

Kk = {ti } ;

Genetic Algorithms

There have been clustering techniques based on the use of genetic algorithms. To determine how to perform clustering with genetic algorithms, we first must determine how to. represent each cluster. One simple approach would be to use a bit-map representation for each possible cluster. So, given a database with four items, {A, B , C , D}, we would represent one solution to creating two clusters as 1001 and 01 10. This represents the two clusters {A, D} and {B, C}.

GA clustering algorithm:

randomly create an initial solut ion ;

repeat

use crossover to create a new s olut ion ;

unt il termination criteria is met

Shadhika's Blog

Tuesday, April 22, 2025

DATA MINING (CLUSTERING )

What is Clustering in Data Mining?