DBSCAN Algorithm

DBSCAN: Density-based spatial clustering of applications with noise

Clustering comes under the Unsupervised learning technique that divides the data points into a number of groups/clusters, such that data points in the same groups have similar features and data points in different groups are dissimilar to each other.

In simple words, we can say that the Clustering Algorithm works on feature similarities.

Example: Let say we have ten different types of seeds. We can differentiate seeds based on their features like Shape, Color, Size, Test, etc... So, Based on these parameters we will put seeds into different groups.

Topics covered in this Article :

Types of Clustering Algorithms
Introduction to DBSCAN
Why DBSCAN?
Algorithm steps for DBSCAN
Advantage and Disadvantage of DBSCAN
Compare DBSCAN with K-Means

Types of Clustering Algorithms

I. Connectivity based Clustering -> Hierarchical Clustering

II. Distribution based Clustering -> Normal distribution, Gaussian distribution

III. Density-Based Clustering -> DBSCAN and OPTICS

IV. Centroid Based Clustering -> K-Means clustering, K-Medoid Clustering

Why DBSCAN?

K-means clustering and Hierarchical clustering give a better output when we have a spherical-shaped cluster or convex-shaped cluster. Moreover, they are also affected by the presence of noise and outliers in the data. i.e they calculate outliers in the clusters.

We know that In Real-life data may contain irregularities, like: Clusters can be of arbitrary shape or data may contain noise. As shown in the figure below.

Fig. 1.1

The above figure shows a data set containing non-convex clusters and outliers. For such data, k-means algorithms face some difficulties to identify clusters with arbitrary shapes and separate outliers. So, at that time DBSCAN comes in to picture.

Convex shape: If we take two data points that belong to the same cluster then all the points on the line connecting these two points must also be in the cluster.

From the below figure, you can easily differentiate between convex and non-convex shapes.

Fig. 1.2

Introduction to DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is the most well-known density-based clustering algorithm. It can discover clusters of different shapes and sizes from a large amount of data, which is containing noise and outliers.

The algorithm has two parameters:

Ɛ or eps: The radius of our neighborhoods around a data point i.e. if the distance between two points is lower or equal to ‘eps’ then they are considered as neighbors.

Choosing the value of ‘eps’ is a bit difficult. If we choose the ‘eps’ value too small then a large part of the data will be considered as outliers. If it is chosen very large then the clusters will merge and the majority of the data points will be in the same clusters.

MinPts: The Minimum number of data points (neighbors) within eps radius. If we have a large dataset then the MinPts value must be chosen large.

In this algorithm, we have 3 types of data points:

Core Point: A data point is a core point if it has more than MinPts points within eps.
Border Point: A data point that has less than MinPts within eps but it is in the neighborhood of a core point.
Outlier: A point that is not a core point or border point.

Fig. 1.3

Algorithm steps for DBSCAN

First, choose a random point that is not visited. And choose the appropriate value of ‘eps’ and ‘MinPts’.
Calculate the neighborhood of this point using ε (All points which are within the ε distance are called a neighborhood).
If there are sufficient neighborhoods around chosen point then the clustering process starts and the point is marked as visited else this point is labeled as noise (Later on this point can be part of the cluster).
If a point is found to be a part of the cluster then its ε neighborhood is also the part of the cluster and repeat the above procedure from step-2 for all ε neighborhood points. The process is repeated until all points in the cluster are determined.
A new not visited point is chosen and processed, which leads to the discovery of a further new cluster or noise.
This process continues until all points are marked as visited.

Advantage and Disadvantage of DBSCAN

Advantage:

Does not require prior knowledge of a number of clusters.
Performs well with arbitrary shape clusters.
Robust to outliers and able to detect the outliers.

Disadvantage:

In some cases, determining an appropriate value of ‘eps’ is not easy.
Algorithm Fails in case of neck type of dataset.
It does not work well in the case of high-dimensional data.
The algorithm fails in the case of varying density clusters.

Comparison between DBSCAN and K-Means

DBSCAN	K-Means
Doesn’t require prior knowledge of a number of clusters.	A number of clusters need to be specified prior.
DBSCAN clustering able to handles outliers and noisy datasets.	K-means Clustering does not work well with outliers and noisy datasets.
DBSCAN clustering does not work very well for data points with varying density.	K-means clustering works well in the case of data points with varying density.

Comments

AnonymousApril 12, 2021 at 10:30 PM
The blog is able to clear all the basic learning of the DB scan algorithm. It includes the comparison with other clustering algorithms and its advantages and the disadvantages of DB scan. It covers through different visualization of the DB scan which can be helpful to learn the algorithm.
Chitrak BhattApril 12, 2021 at 10:31 PM
Well explained 👏
UnknownApril 12, 2021 at 10:37 PM
Very brief explained topic also some visulization is done which gives better experience.

Search This Blog

Yash Bhensdadiya

DBSCAN Algorithm

Comments

Post a Comment