Tuesday, April 22, 2025

DATA MINING (Association Rules)

 

What are Association Rules in Data Mining?

The if-else statement is also called the association rule, which further refers to showing the probability of the relationship between the data items.

These types of relationships occur in large data sets in various databases.

With the help of the association rule in mining, we can get the number of applications present in data mining, and these widely discovered the correlation of the sales in the transactional data or medical data sets.

What are Use Cases for Association Rules?

Association rules have so many practical use cases across various industries and domains 

Here are some common use cases for association rules:

MarketBasketAnalysis:
It is one of the most famous applications. All the Retailers can use the association rules to discover item associations in customer shopping baskets. For example, if there is a need to find out that customers who buy chips are likely to buy salsa as well, stores can optimize product placements and marketing strategies.

Healthcare:

Disease Diagnosis: with the help of Association rules, we can identify patterns in patient health records, such as discovering combinations of symptoms, test results, or patient characteristics indicative of certain diseases.

Treatment Recommendations: with the help of Association rules, we can suggest suitable treatments or interventions based on the patient's medical history and condition, improving personalized healthcare.

FinancialServices:

Fraud Detection: with the help of association rules, Banks and credit card companies can detect fraudulent transactions by identifying unusual spending patterns or sequences of transactions associated with fraud.

Cross-Selling: Financial institutions can recommend additional products or services to customers based on their transaction history and financial behaviour.

MarketResearch:

Consumer Behavior Analysis: Marketers can identify consumer preferences by analyzing purchase histories and demographic data, leading to better-targeted advertising and product development.

Product Placement Optimization: We have to understand which products are often purchased together to help optimize product placements in physical stores and online marketplaces.

WebUsageAnalysis:

    • Website Optimization: with the help of association rules, all Website owners can analyze user behaviour on their websites. For instance, we also have to understand which pages are visited together, and after doing this, we can help improve website navigation and content recommendations.

Manufacturing:

    • Quality Control: Manufacturers can identify factors or conditions associated with product defects, which helps improve quality control processes.
    • Production Optimization: it can lead to more efficient manufacturing processes. Discovering associations among different production variables

Telecommunications:

    • Network Management: In the telecom industry, we can detect patterns in network traffic that may indicate issues or anomalies with the help of association rules.
    • Customer Churn Prediction: Telecom companies can identify factors associated with customer churn and take preventive measures to retain customers.

InventoryManagement:

    • Supply Chain Optimization: Understanding the relationships between various items in a supply chain can help optimize inventory levels, reduce carrying costs, and improve order fulfilment.

SocialNetworkAnalysis:

    • Friendship Recommendations: Social media platforms can use association rules to suggest new friends or connections based on common interests, connections, or behaviours.

TextMining:

    • Content Recommendation: In content recommendation systems (e.g., Netflix or Amazon), association rules can recommend movies, books, or products to users based on their past interactions and preferences.

How do Association Rules Work?

Association rules are fundamental in data mining and machine learning, aiming to discover interesting relationships and patterns within large datasets. These rules identify associations or dependencies between items or attributes in the data.

The primary algorithm used for association rule mining is the Apriori algorithm, which follows a systematic process to generate these rules:\

1. Frequent Itemset Generation:

In this algorithm, we have to start the process by identifying frequent item sets in the dataset. The frequent item is a collection of items (or attributes) frequently occurring in the data.

We can measure the frequency of the data set by using a metric called support. These supports are represented by the proportion of transactions or records in which the itemset appears.

We have to use the Apriori algorithm, which uses the bottom-up approach. At first, we have to look for the frequent individual items, and then we have to gradually combine them to find larger itemsets.

2. Association Rule Generation:

After identifying the frequent items, we have to generate the association rules from these items.

Then, we have to write the association rule, which will be in the form of "if-then" statements, where the "if" part is called the antecedent (premise), and the "then" part is called the consequent (conclusion).

Then, we have to explore the Apriori algorithm, which combines items within frequent itemsets to generate potential association rules.

3. Rule Pruning:

We have to apply some criteria to ensure that only meaningful rules are generated. The most useful criteria are as follows.

  1. Support Threshold: we have to create a rule with support to be considered valid. This ensures that the rule applies to a sufficient number of transactions.
  2. Confidence Threshold: A rule must have a minimum confidence level to be considered interesting. Confidence is the probability that the antecedent implies the consequent, and it measures the strength of the association.
  3. Lift Threshold: Lift is a measure that compares the observed support of the rule to what would be expected if the items in the rule were independent. A lift value greater than 1 indicates a positive association, while a lift value less than 1 indicates a negative association.

4. Iterative Process:

In this process, we have to iterate the Apriori algorithm by generating itemsets, creating rules, and pruning rules until no more valid rules can be generated.

Then, we have to perform the iteration in which the algorithm employs a "downward closure property," which states that if an item is frequent, all of its subsets are also frequent. This property helps reduce the computational complexity of the algorithm.

5. Output:

The final output of association rule mining is a set of association rules that meet the specified support and confidence thresholds.

These rules can be ranked based on their interestingness or strength, allowing analysts to focus on the most relevant and actionable rules.

Association Rule Algorithms

The most common algorithms used in the association rule are AIS, SETM, Apriori, and variations of the latter.

  1. AISAlgorithm:
    All the items are generated in the AIS algorithm, and then item sets are counted through the scan process.

Then, the AIS algorithm determines the large itemset in the transaction data.

After that, the new itemsets are created. We can achieve this by extending the large itemsets with other items in the transaction data.

  1. SETMAlgorithm:
    In the SETM algorithm, we can generate the item set by scanning the database. We have to remember that this algorithm scans the database after completing all the tasks. In this algorithm, the generation process of the new dataset is similar to that of the dataset in the AIS algorithm. In this algorithm, the transaction ID of all the datasets is stored in the database in the data structure manner. After completing all the passes, the transaction ID is generated by saving the generated task in a sequential structure. The disadvantage of both SETM and AIS algorithms is that each one can generate and count many small candidate items. Dr. Saed Sayad, author of Real-Time Data Mining, gives these disadvantages.
  2. ApororiAlgorithm:
    In this algorithm, if the previous pass has the large itemset, then the itemset has been the large itemset of the last pass is joined with itself to generate all itemsets with a size that's larger by one. Each generated itemset with a subset that is not large is then deleted. The remaining itemsets are the candidates. The Apriori algorithm considers any subset of a frequent itemset a frequent item. With this approach, the algorithm reduces the number of candidates being considered by only exploring the itemsets whose support count is greater than the minimum support count, according to Sayad. Generated.

parallel and distributed algorithms

In data mining, "parallel and distributed algorithms for association rules" refer to techniques that leverage multiple processors or computers to efficiently discover relationships between items in large datasets by dividing the workload and processing data simultaneously, significantly speeding up the association rule mining process, especially when dealing with massive datasets that would be too large for a single machine to handle effectively. 

Key points about parallel and distributed association rule algorithms:

·         Problem with large datasets:

Traditional association rule mining algorithms, like Apriori, can become computationally expensive when dealing with large transaction datasets, requiring efficient parallelization strategies to manage the processing time. 

·         Data partitioning:

The core concept is to divide the data into smaller subsets and distribute them across multiple processors or nodes in a distributed system, allowing each processor to independently calculate frequent itemsets within its data partition. 

·         Types of parallelism:

·         Data parallelism: Distributing the data across multiple processors and performing the same operation on each data subset in parallel. 

·         Task parallelism: Breaking down the association rule mining process into smaller tasks (like candidate itemset generation) and assigning them to different processors. 

Common approaches to parallel and distributed association rule mining:

·         Distributed Apriori:

A widely used method where the Apriori algorithm is adapted to a distributed environment, with each node calculating frequent itemsets locally and then communicating with other nodes to aggregate results and generate candidate itemsets. 

·         MapReduce framework:

Utilizing the MapReduce paradigm to parallelize the counting of itemsets, where the "map" phase counts local frequencies and the "reduce" phase combines results to identify global frequent itemsets. 

·         Pregel-based algorithms:

Leveraging the Pregel model for iterative computation where each node in the distributed system communicates with its neighbors to update local information, facilitating efficient candidate generation and frequent itemset discovery. 

Benefits of parallel and distributed association rule mining:

·         Scalability:

Enables processing of large datasets that would be impractical to analyze on a single machine.

·         Faster execution time:

By utilizing multiple processors, the overall computation time for association rule mining can be significantly reduced. 

Challenges in parallel and distributed association rule mining:

·         Communication overhead:

Coordinating data exchange between distributed nodes can introduce latency and impact performance.

·         Load balancing:

Ensuring that all processors are working on roughly the same amount of data to maximize efficiency. 

DATA MINING (CLUSTERING )

Overview

Clustering in data mining is a technique used to group similar data points together based on their features and characteristics. It is an unsupervised learning method that helps to identify patterns in large datasets and segment them into smaller groups or subsets.

What is Clustering in Data Mining?

Clustering in data mining is a technique that groups similar data points together based on their features and characteristics. It can also be referred to as a process of grouping a set of objects so that objects in the same group (called a cluster) are more similar to each other than those in other groups (clusters). It is an unsupervised learning technique that aims to identify similarities and patterns in a dataset.

Clustering techniques in data mining can be used in various applications, such as image segmentationdocument clustering, and customer segmentation. The goal is to obtain meaningful insights from the data and improve decision-making processes.

                                            Description: C:\Users\VSWC1\Desktop\1.PNG

A cluster can have the following properties -

  • The data points within a cluster are similar to each other based on some pre-defined criteria or similarity measures.
  • The clusters are distinct from each other, and the data points in one cluster are different from those in another cluster.
  • The data points within a cluster are closely packed together.
  • A cluster is often represented by a centroid or a center point that summarizes the properties of the data points within the cluster.
  • A cluster can have any number of data points, but a good cluster should not be too small or too large.

SIMILARITIES AND DISTANCE MEASURES

Clustering is a fundamental concept in data analysis and machine learning, where the goal is to group similar data points into clusters based on their characteristics. One of the most critical aspects of clustering is the choice of distance measure, which determines how similar or dissimilar two data points.

Distance measures are mathematical functions that determine how similar or different two data points are. The choice of distance measure can significantly impact the clustering results, as it influences the shape and structure of the clusters.

The choice of distance measure significantly impacts the quality of the clusters formed and the insights derived from them. A well-chosen distance measure can lead to meaningful clusters that reveal hidden patterns in the data,

Common Distance Measures

There are several types of distance measures, each with its strengths and weaknesses. Here are some of the most commonly used distance measures in clustering:

1. Euclidean Distance

The Euclidean distance is the most widely used distance measure in clustering. It calculates the straight-line distance between two points in n-dimensional space. The formula for Euclidean distance is:

d(p,q)=Σi=1n(piqi)2
where,

·         p and q are two data points

·         and n is the number of dimensions.

2. Manhattan Distance

The Manhattan distance, is the total of the absolute differences between their Cartesian coordinates, sometimes referred to as the L1 distance or city block distance

The Manhattan distance, which computes the total distance traveled along each dimension to reach a different data point represents this movement. When it comes to categorical data this metric is more effective than Euclidean distance since it is less susceptible to outliers. The formula is:

d(p,q)=Σi=1npi−qid(p,q)=Σi=1npiqi

3. Cosine Similarity

Instead than concentrating on the exact distance between data points , cosine similarity measure looks at their orientation. It calculates the cosine of the angle between two data points, with a higher cosine value indicating greater similarity. This measure is often used for text data analysis

similarity(A,B)=A∥∥BA.B
4. Minkowski Distance

Minkowski distance is a generalized form of both Euclidean and Manhattan distances, controlled by a parameter p. The Minkowski distance allows adjusting the power parameter (p). When p=1, it’s equivalent to Manhattan distance; when p=2, it’s Euclidean distance.

d(x,y)=(Σi=1nxi−yip)1pd(x,y)=(Σi=1nxiyip)p1
5. Jaccard Index

This measure is ideal for binary data, where features can only take values of 0 or 1. It calculates the ratio of the number of features shared by two data points to the total number of features. Jaccard Index measures the similarity between two sets by comparing the size of their intersection and union.

J(A,B)=A∩B∣∣ABJ(A,B)=AB∣∣AB

OUTLIERS
outliers are sample points with values much different from those of the remaining set of data. Outliers may represent errors in the or could be correct data values that are simply much different from the remaining data. A person who is 2.5 meters tall is much taller than most people. In analyzing the height of individuals; this value probably would be viewed as an outlier. 

This problem is illustrated in Figure 5.3. Here if three clusters are found (solid line), the outlier will occur in a cluster by itself. However, if two clusters are found (dashed line), the two (obviously) different sets of data will be placed in one cluster because they are closer together than the outlier. This problem is complicated by the fact that many clustering algorithms actually have as input the number of desired clusters to be found. Clustering algorithms may actually find and remove outliers to ensure that they perform better. However, care must be taken in actually removing outliers.

For example, suppose that the data mining problem is to predict flooding. Extremely high water level values occur very infrequently, and when compared with the normal water level values may seem to be outliers. However, removing these values may not allow the data mining algorithms to work effectively because there would be no data that showed that floods ever actually occurred. Outlier detection, or outlier mining, is the process of identifying outliers in a set of data. Clustering, or other data mining, algorithms may then choose to remove or treat these values differently. Some outlier detection techniques ate based on statistical techniques . These usually assume that the set of data follows a known distribution and that outliers can be detected by well-known tests such as discordancy tests. However, these tests are not very realistic for real-world data because real .. world data values may not follow well-defined data distributions. Also, most of these tests assume a single attribute value, and many attributes are- involved in real-world datasets. Alternative detection techniques may be based on distance measures

Description: C:\Users\VSWC1\Desktop\2.PNG

Hierarchical Clustering in Data Mining

 Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical clustering begins by treating every data point as a separate cluster. Then, it repeatedly executes the subsequent steps:

1.      Identify the 2 clusters which can be closest together, and

2.      Merge the 2 maximum comparable clusters. We need to continue these steps until all the clusters are merged together.

In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram called Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences of merges or splits) graphically represents this hierarchy and is an inverted tree that describes the order in which factors are merged (bottom-up view) or clusters are broken up (top-down view).

 

Hierarchical clustering is a method of cluster analysis in data mining that creates a hierarchical representation of the clusters in a dataset. The method starts by treating each data point as a separate cluster and then iteratively combines the closest clusters until a stopping criterion is reached. 

Hierarchical clustering has several advantages 

·         The ability to handle non-convex clusters and clusters of different sizes and densities.

·         The ability to handle missing data and noisy data.

·         The ability to reveal the hierarchical structure of the data, which can be useful for understanding the relationships among the clusters.

Drawbacks of Hierarchical Clustering

·         The need for a criterion to stop the clustering process and determine the final number of clusters.

·         The computational cost and memory requirements of the method can be high, especially for large datasets.

·         The results can be sensitive to the initial conditions, linkage criterion, and distance metric used.
In summary, Hierarchical clustering is a method of data mining that groups similar data points into clusters by creating a hierarchical structure of the clusters. 

·         This method can handle different types of data and reveal the relationships among the clusters. However, it can have high computational cost and results can be sensitive to some conditions.

Types of Hierarchical Clustering

Basically, there are two types of hierarchical Clustering:

1.      Agglomerative Clustering

2.      Divisive clustering

1. Agglomerative Clustering

Initially consider every data point as an individual Cluster and at every step, merge the nearest pairs of the cluster. (It is a bottom-up method). At first, every dataset is considered an individual entity or cluster. At every iteration, the clusters merge with different clusters until one cluster is formed. 

The algorithm for Agglomerative Hierarchical Clustering is:

·         Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)

·         Consider every data point as an individual cluster

·         Merge the clusters which are highly similar or close to each other.

·         Recalculate the proximity matrix for each cluster

·         Repeat Steps 3 and 4 until only a single cluster remains.

Let’s see the graphical representation of this algorithm using a dendrogram. 

Note: This is just a demonstration of how the actual algorithm works no calculation has been performed below all the proximity among the clusters is assumed. 

Description: C:\Users\VSWC1\Desktop\3.PNG

·         Step-1: Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other clusters.

·         Step-2: In the second step comparable clusters are merged together to form a single cluster. Let’s say cluster (B) and cluster (C) are very similar to each other therefore we merge them in the second step similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]

·         Step-3: We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]

·         Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].

·         Step-5: At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].

 Divisive Hierarchical clustering

We can say that Divisive Hierarchical clustering is precisely the opposite of Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the data points as a single cluster and in every iteration, we separate the data points from the clusters which aren’t comparable. In the end, we are left with N clusters. 

 PARTITIONAL ALGORITHMS :

Nonhierarchical or partitional clustering creates the clusters in one step as opposed to several steps. Only one set of clusters is created, although several different sets of clusters may be created internally within the various algorithms. Since only one set of clusters is output, the user must input the desired number, k, of clusters.

Minimum Spanning Tree Since we have agglomerative and divisive algorithms based on the use of an MST, we also present a partitional MST algorithm. This is a very simplistic approach, but it illustrates how partitional algorithms work. The algorithm is shown in Algorithm 5.4. Since the clustering problem is to define a mapping, the output of this algorithm shows the clusters as a set of ordered pairs (ti , j) where f Cti) = K J .

Partitional MST algorithm :

M= MST(A)

identify inconsistent edges in M;

remove k - 1 inconsistent edges ;

 create output representation ;

Squared Error Clustering Algorithm

The squared error clustering algorithm minimizes the squared error. The squared error for a cluster is the sum of the squared Euclidean distances between each element in the cluster and the cluster centroid, Ck. Given a cluster Ki , let the set of items mapped to that cluster be {ti l, ti2, ... , tim}. The squared error is defined as

 

 

 

ALGORiTHM

Squared error algorithm: assign each item ti to a cluster ;

calculate center for each cluster ;

 repeat assign each item ti to the cluster which has the closest center ;

calculate new center for each cluster ;

calculate squared error ;

unt il the difference between successive squared errors

is below a threshold ;

K-Means Clustering

K-means is an iterative clustering algorithm in which items are moved among sets of clus- . ters until the desired set is reached. As such, it may be viewed as a type of squared error algorithm, although the convergence criteria need not be defined based on the squared error. A high degree of similarity among elements in clusters is obtained, while a high degree of dissimilarity among elements in different clusters is achieved simultaneously. The cluster mean of Ki = {tn, ti2 • . .. , tim} is defined as

 

 

ALGORITHM

assign initial values for means m1 , m2 , ... , mk ;

repeat assign each item ti to the cluster which has the closest mean;

calculate new mean for each cluster ;

until convergence criteria is met ;

 

Nearest Neighbor Algorithm

 

An algorithm similar to the single link technique is called the nearest neighbor algorithm. With this serial algorithm, items are iteratively merged into the existing clusters that are closest. In this algorithm a threshold, t, is used to determine if items will be added to existing clusters or if a new cluster is created.

 

 

Nearest neighbor algorithm:

K1 = ft1 };

K = {Kl} i

k= 1 ;

for i = 2 to.; n do f ind the tm in s ome cluster Km in K such that di s(ti, tm) is

the smallest ;

if dis (ti , tm), :::; t then

Km = Km U ti

else k=k + l ;

Kk = {ti } ;

Genetic Algorithms

 There have been clustering techniques based on the use of genetic algorithms. To determine how to perform clustering with genetic algorithms, we first must determine how to. represent each cluster. One simple approach would be to use a bit-map representation for each possible cluster. So, given a database with four items, {A, B , C , D}, we would represent one solution to creating two clusters as 1001 and 01 10. This represents the two clusters {A, D} and {B, C}.

GA clustering algorithm:

 randomly create an initial solut ion ;

 repeat

 use crossover to create a new s olut ion ;

 unt il termination criteria is met