Repository for implementation of statistics concepts for Data Science in Python

2022-11-16

Machine Learning for Big Data using PySpark with real-world projects

2023-01-11

How to determine epsilon and MinPts parameters of DBSCAN clustering

Details of determining DBSCAN parameters

1) k-distance plot

In a clustering with MinPts = k, we expect that core pints and border points’ k-distance are within a certain range, while noise points can have much greater k-distance, thus we can observe a knee point in the k-distance plot. However, sometimes there may be no obvious knee, or there can be multiple knees, which makes it hard to decide.

The method proposed here consists of computing the k-nearest neighbor distances in a matrix of points. The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts. Next, these k-distances are plotted in ascending order. The aim is to determine the “knee”, which corresponds to the optimal epsilon parameter. A knee corresponds to a threshold where a sharp change occurs along the k-distance curve.

It can be seen that the optimal eps value is around a distance of 0.15.

2) DBSCAN extensions like OPTICS

OPTICS is a more generalized version of DBSCAN that substitutes the ε parameter with a maximum value, which mainly affects its performance. The MinPts parameter now denotes the minimum cluster size to be identified. Although the algorithm is much easier to parameterize than DBSCAN, its results are a bit more complex to use, as it typically generates a hierarchical clustering instead of the simple data partitioning produced by DBSCAN. Recently, one of the original authors of DBSCAN has revisited DBSCAN and OPTICS and introduced a revised version of hierarchical DBSCAN (HDBSCAN*) that eliminates the concept of border points. Only the core points now constitute the cluster.

OPTICS creates hierarchical clusters, we can extract significant flat clusters from them via visual inspection. An implementation of OPTICS is available in pyclustering. Additionally, one of the original authors of DBSCAN and OPTICS has proposed an automatic method to extract flat clusters that requires no human intervention. For more information, please refer to this paper.

3) Sensitivity Analysis

In essence, our aim is to select a radius that can cluster a higher number of truly regular points (i.e., points that are similar to other points) while also detecting more noise (i.e., outlier points). To achieve this, we can plot a percentage of regular points (i.e., points belonging to a cluster) versus epsilon analysis, wherein we set different epsilon values on the x-axis and their corresponding percentage of regular points on the y-axis. By analyzing the graph, we can identify a segment where the percentage of regular points is more sensitive to changes in the epsilon value. We can then select the upper bound epsilon value within this segment as our optimal parameter.

Some general rules for determining MinPts

The MinPts value is better to be set using domain knowledge and familiarity with the data set. Here are a few rules of thumb for selecting the MinPts value:

The larger the data set, the larger the value of MinPts should be
If the data set is noisier, choose a larger value of MinPts
Generally, MinPts should be greater than or equal to the dimensionality of the data set
For 2-dimensional data, use DBSCAN’s default value of MinPts = 4 (Ester et al., 1996).
If your data has more than 2 dimensions, choose MinPts = 2*dim, where dim= the dimensions of your data set (Sander et al., 1998).

Python Example for selecting Epsilon (ε)

After selecting MinPts value, we can move on to determining ε. One technique to automatically determine the optimal ε value is described in this paper. This technique calculates the average distance between each point and its k nearest neighbors, where k is the MinPts value you selected. The average k-distances are then plotted in ascending order on a k-distance graph. We can find the optimal value for ε at the point of maximum curvature (i.e. where the graph has the greatest slope).

Suppose that we are working with a data set with 10 dimensions. Following the rules above, minimum_samples value should be at least 10. We can go with 20 (2*dim), as suggested in the literature.

After selecting MinPts value, we can use NearestNeighbors from Scikit-learn, documentation here, to calculate the average distance between each point and its n_neighbors. The one parameter we need to define is n_neighbors, which in this case is the value we choose for MinPts.

Step 1: Calculate the average distance between each point in the data set and its 20 nearest neighbors (my selected MinPts value).

from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt

neighbors = NearestNeighbors(n_neighbors=20)
neighbors_fit = neighbors.fit(dataset)
distances, indices = neighbors_fit.kneighbors(dataset)

Step 2: Sort distance values by ascending value and plot.

distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)

For this data set, the sorted distances produce a k-distance elbow plot that looks like this:

Points sorted by distance to the 20th nearest neighbor

The ideal value for ε will be equal to the distance value at the “crook of the elbow”, or the point of maximum curvature. This point represents the optimization point where diminishing returns are no longer worth the additional cost. This concept of diminishing returns applies here because while increasing the number of clusters will always improve the fit of the model, it also increases the risk that overfitting will occur.

Zooming on the “elbow” optimization point, points sorted by distance to the 20th nearest neighbor

Zooming in on my k-distance plot, it looks like the optimal value for ε is around 225. We ended up looping through combinations of MinPts and ε values slightly above and below the values estimated here to find the model of best fit.

In layman’s terms, we find a suitable value for epsilon by calculating the distance to the nearest n points for each point, sorting, and plotting the results. Then we look to see where the change is most pronounced (think of the angle between your arm and forearm) and select that as epsilon.

We can calculate the distance from each point to its closest neighbor using the NearestNeighbors. The point itself is included in n_neighbors. The kneighbors method returns two arrays, one which contains the distance to the closest n_neighbors points and the other which contains the index for each of those points.

nn_model = NearestNeighbors(n_neighbors=2)
nn_model.fit(X)
distances, indices = nn_model.kneighbors(X)

Next, we sort and plot the results.

distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)

The optimal value for epsilon will be found at the point of maximum curvature.

Reference:

https://towardsdatascience.com/machine-learning-clustering-dbscan-determine-the-optimal-value-for-epsilon-eps-python-example-3100091cfbc

4 Comments

marcella balzano says:

2023-06-09 at 12:24

Hi and thanks to sharing your work.
I have a question about the Min-samples. I have 2 attributes in my dataset, but this attributes are type=string, so I vettorized the attributes and I obtained a 1300 dimensions.
Min-samples must be = to 2+1 or 1300+1.
Thanks for your support

Reply
- Amir Masoud Sefidian says:
  
  2023-06-10 at 00:53
  
  Hi Marcella,
  You should set the min_samples based on the final input of your features to your model, that’s 1300+1. However, you generally should be careful about the problem of curse of dimensionality if you transform your features to such a high dimensional space.
  
  Reply
Ryane says:

2023-07-28 at 22:52

Merci beaucoup pour votre explication. J ai une question comment fait fonction l ‘algorithme DBSCAN sur 3 dimensions sur le logiciel R, aurez vous une idée?

Reply
Sara says:

2023-08-06 at 20:52

درود بر تو سفیدیان =)

Reply

Repository for implementation of statistics concepts for Data Science in Python

Machine Learning for Big Data using PySpark with real-world projects

How to determine epsilon and MinPts parameters of DBSCAN clustering

Details of determining DBSCAN parameters

1) k-distance plot

2) DBSCAN extensions like OPTICS

3) Sensitivity Analysis

Some general rules for determining MinPts

Python Example for selecting Epsilon (ε)

Related posts

Coursera Deep Learning Specialization Notes

Machine Learning for Big Data using PySpark with real-world projects

4 Comments

Leave a Reply Cancel reply