Finding eps value in DBSCAN algorithmn
I am implementing DBSCAN on a dataset. First I sorted the data and then found the distance among its neighbors to find the minimum distance between them and plot the minimum distance. This will give the elbow curve to find density of the data points and their minimum distance(eps) values. But I am getting this kind of curve which looks different from the sample example I found on the internet. Please suggest me whether my obtained curve is OK or not and I have taken eps value as 0.7. Is this value correct or not???
do you know?
how many words do you know
See also questions close to this topic
-
Why KMedoids and Hierarchical return different results?
I have a huge dataframe which only contains 0 and 1, and I tried to use the method
scipy.cluster.hierarchy
to get the dendrogram and then use the methodsch.fcluster
to get the cluster by a specific cutoff. (the metric for distance matrix is Jacccard, the method for linkage is "centroid")However, when I want to specify the optimistic numbers of clusters for my dataframe, I notice the method of KMedoids combined with the Elbow Method can help me. Then after I know the best numbers of clusters such as 2, I tried to use
KMedoids(n_clusters=2,metric='jaccard').fit(dataset)
to get clusters, but the result is different from Hierarchical method. (the reason why I don't use Kmeans is that it is too slow for my dataframe)Therfore, I did a test (the index 0,1,2,3 will be grouped):
import pandas as pd import numpy as np from scipy.spatial.distance import pdist label1 = np.random.choice([0, 1], size=20) label2 = np.random.choice([0, 1], size=20) label3 = np.random.choice([0, 1], size=20) label4 = np.random.choice([0, 1], size=20) dataset = pd.DataFrame([label1,label2,label3,label4]) dataset
Method KMedoids:
since there only are 4 indexes, so the cluster number was set to 2.
from sklearn_extra.cluster import KMedoids cobj = KMedoids(n_clusters=2,metric='jaccard').fit(dataset) labels = cobj.labels_ labels
the clustering result as shown below:
Method Hierarchical:
import scipy.cluster.hierarchy as such #calculate distance matrix disMat = sch.distance.pdist(dataset, metric='jaccard') disMat1 = sch.distance.squareform(disMat) # cluster: Z2=sch.linkage(disMat1,method='centroid') sch.fcluster(Z2, t=1, criterion='distance')
to meet the same number of clusters I tried several cutoff, the number of cluster was 2 when the cutoff was set to 1. Here is the result:
And I googled about the dataframe which was passed to KMedoids should be the original dataframe, not the distance matrix. but it seems that KMedoids will convert the original dataframe to a new one which I don't know for some reason. because I got the data conversion warning:
DataConversionWarning: Data was converted to boolean for metric jaccard warnings.warn(msg, DataConversionWarning)
I also got warning when I perform Hierarchical method:
ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
Purpose:
What I want is to find some method to get the clusters if I know the optimal number of clusters. but the method Hierarchical need to try different cutoff, while the KMedoids don't, but it turns a different result.
Can anybody explain this to me? And are there better ways to perform clustering?
-
R: Double Clustering of Standard Errors in Panel Regression
so i am analysing fund data. I use a fixed effect model and want to double cluster my standard errors along "ISIN" and "Date" with plm().
output for dput(data) is :
> dput(nd[1:100, ]) structure(list(Date = structure(c(1517356800, 1519776000, 1522454400, 1525046400, 1527724800, 1530316800, 1532995200, 1535673600, 1538265600, 1540944000, 1543536000, 1546214400, 1548892800, 1551312000, 1553990400, 1556582400, 1559260800, 1561852800, 1564531200, 1567209600, 1569801600, 1572480000, 1575072000, 1577750400, 1580428800, 1582934400, 1585612800, 1588204800, 1590883200, 1593475200, 1596153600, 1598832000, 1601424000, 1604102400, 1606694400, 1609372800, 1612051200, 1614470400, 1617148800, 1619740800, 1622419200, 1625011200, 1627689600, 1630368000, 1632960000, 1635638400, 1638230400, 1640908800, 1517356800, 1519776000, 1522454400, 1525046400, 1527724800, 1530316800, 1532995200, 1535673600, 1538265600, 1540944000, 1543536000, 1546214400, 1548892800, 1551312000, 1553990400, 1556582400, 1559260800, 1561852800, 1564531200, 1567209600, 1569801600, 1572480000, 1575072000, 1577750400, 1580428800, 1582934400, 1585612800, 1588204800, 1590883200, 1593475200, 1596153600, 1598832000, 1601424000, 1604102400, 1606694400, 1609372800, 1612051200, 1614470400, 1617148800, 1619740800, 1622419200, 1625011200, 1627689600, 1630368000, 1632960000, 1635638400, 1638230400, 1640908800, 1517356800, 1519776000, 1522454400, 1525046400), tzone = "UTC", class = c("POSIXct", "POSIXt")), Dummy = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0), ISIN = c("LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "DE0008474008", "DE0008474008", "DE0008474008", "DE0008474008"), Returns = c(-0.12401, -4.15496, -1.39621, 4.46431, -2.28814, -0.58213, 3.61322, -3.56401, 0.6093, -4.73124, 0.88597, -5.55014, 5.12313, 2.65441, 1.3072, 2.99972, -5.1075, 3.51965, 0.24626, -2.21961, 4.48332, -0.03193, 2.19313, 1.81355, -2.2836, -8.3185, -14.58921, 4.47981, 4.52948, 5.51294, -2.16857, 2.56992, -2.04736, -6.17825, 14.71218, 1.24079, -1.33888, 3.5197, 8.09674, 1.43074, 3.79434, 0.47398, 1.57474, 2.48837, -3.08439, 3.68851, -2.93803, 6.43656, 2.67598, -3.39767, -5.27997, 4.76756, 4.89914, -0.95931, 2.22484, 3.01478, 1.63997, -6.64158, 3.46497, -8.54853, 7.40113, 5.68973, 1.64367, 4.35256, -5.09351, 3.43618, 2.16774, -0.77703, 3.16832, 1.65626, 4.91897, 1.76163, 1.49508, -5.16847, -9.53639, 12.74246, 3.08746, 3.4028, 0.09515, 5.66077, -2.85661, -2.58972, 9.53565, 2.93138, 0.32556, 2.92393, 5.02059, 0.98137, 0.58733, 4.91219, 2.21603, 2.52087, -3.87762, 7.66159, -0.04559, 4.48257, 2.83511, -6.27841, -3.98683, 4.99554), Flows = c(-0.312598458, -37.228563578, -119.065088084, -85.601069424, -46.613436838, -20.996760878, -12.075112555, -40.571568112, -16.210315254, -54.785115578, -55.93565336, -25.073939479, -16.513305702, -111.112262813, -17.260252326, -44.287088276, -84.358676293, -12.73665543, -14.846322594, -30.353217826, -43.002634628, -31.293725624, -32.291532262, -21.145334594, -33.460150254, -22.458849454, -34.690817528, -34.088358344, -4.069613214, -7.841523244, -6.883674001, -11.99060429, -19.155102931, -20.274682083, -33.509645025, -25.764368282, -22.451403457, -39.075362392, -9.772306537, -7.214728071, -10.462230506, -12.550102699, -0.439609898, -16.527865041, -15.938402293, -10.916678964, -11.041205907, -11.627537098, -13.797947969, -18.096144272, 29.879529566, -51.895196556, -3.192064966, -1.469562773, 9.739671656, -35.108549922, -19.490401121, 36.459406559, -66.213269625, 8.105824198, -17.078089399, -59.408458411, 1.227033593, -42.501421101, -15.275983037, 19.425363714, -23.165013159, -19.68599313, -20.478530269, -19.566890333, -19.63229278, -59.274372862, -37.128708445, 5.129404763, -2.650978954, -0.566245645, -14.80700799, 4.891308881, -18.16286654, -17.570559084, -2.726629634, -14.482219321, -35.795673521, -10.119935801, -14.37900783, -20.385053784, -4.550848701, -17.672355509, -14.270420088, 1.440911458, -8.924636198, -5.749771862, -12.284920947, -23.093834986, -13.553880939, -31.572182943, -22.977082191, -8.076560195, -11.825577374, -9.263872938), TNA = c(2474.657473412, 2327.75517961, 2171.146502197, 2175.433117247, 2082.147188171, 2042.121760963, 2031.311390907, 1918.904748403, 1914.140451001, 1765.867322561, 1724.972362171, 1600.059421422, 1605.009162592, 1539.205393073, 1540.8291693, 1538.550310809, 1370.631945404, 1404.091772234, 1351.60138448, 1290.98574898, 1309.942298579, 1280.634128059, 1278.146819041, 1281.50075434, 1189.563983023, 1062.001168646, 859.735053702, 868.096185968, 894.397805491, 933.614731653, 885.975121845, 897.018097461, 854.196359787, 781.178047528, 863.00585297, 846.859512502, 796.10866733, 784.290994645, 838.747509395, 841.511540715, 863.678978862, 854.663205271, 856.363306246, 859.460891875, 816.275861034, 836.347760358, 800.867957871, 842.657752288, 2742.709413, 2629.70296, 2518.690562, 2516.902480001, 2635.037923, 2606.124805, 2672.082125, 2715.556617, 2738.845915, 2591.318371, 2613.260789, 2396.060545001, 2554.437804, 2638.160519, 2680.990319, 2753.467368, 2533.347075001, 2637.887076, 2670.127393, 2628.138778001, 2688.643794, 2711.56785, 2823.634535001, 2811.983963001, 2835.218976, 2672.765021, 2413.332814, 2718.586512, 2727.69596, 2823.040628, 2805.482839, 2944.602701, 2855.870812, 2765.189256, 2990.804719, 3066.36598, 3059.603769, 3126.458368, 3276.612153, 3289.257788, 3291.864476, 3397.759970999, 3461.462599, 3540.518638, 3388.702548, 3622.641661, 3604.82519, 3732.115875999, 4129.617979, 3857.780349, 3687.848268001, 3858.323607), Age = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 62, 62, 62, 62)), row.names = c(NA, -100L), class = c("tbl_df", "tbl", "data.frame"))
My code did yield me initially a result, i didn't change anything but all of the sudden it doesn't allow me to execute the last line of code.
library(plm) attach(nd) library(lmtest) library(stargazer) library(sandwich) library(etable) library(pacman) library(fixest) library(multiwayvcov) library(foreign) #cleaning #adjust units of TNA and Flows nd <- nd %>% mutate(TNA = TNA / 1000000, Flows = Flows / 1000000) #1mio and 1mio #drop na's #nd <- nd %>% #drop_na() #variable creation for model Y <- cbind(nd$Flows) X <- cbind(nd$Dummy, lag(nd$Returns), lag(nd$TNA), nd$Age) # descriptive statistics summary(Y) summary(X) #random effects random2 <- plm(Y ~ X, nd, model='random', index=c('ISIN', 'Date')) summary(random2) #fixed effect model fixed2 <- plm(Y ~ X, nd, model='within', index=c('ISIN', 'Date')) # Breusch-Pagan Test bptest(fixed2) #Test which model to use fixed effect or random effects #hausmann test phtest(random2, fixed2) # we take fixed effects ##Double-clustering formula (Thompson, 2011) vcovDC <- function(x, ...){ vcovHC(x, cluster="ISIN", ...) + vcovHC(x, cluster="Date", ...) - vcovHC(x, method="white1", ...) } #visualize SEs coeftest(fixed2, vcov=function(x) vcovDC(x, type="HC1")) stargazer(coeftest(fixed2, vcov=function(x) vcovDC(x, type="HC1")), type = "text")
Now, when i try to run:
coeftest(fixed2, vcov=function(x) vcovDC(x, type="HC1"))
I get the error: Error in match.arg(cluster) : 'arg' should be one of “group”, “time” Before it didn't.
I highly appreciate any answer. I'd also like to know if the formula i used for the double clustered standard errors is correct. I followed the approach from: Double clustered standard errors for panel data
- the comment from Iandorin
edit: i rewrote the code and now it works:
library(plm) attach(nd) library(lmtest) library(stargazer) library(sandwich) library(etable) library(pacman) library(fixest) library(multiwayvcov) library(foreign) #cleaning #adjust units of TNA and Flows #nd <- nd %>% #mutate(TNA = TNA / 1000000, Flows = Flows / 1000000) #1mio and 1mio #drop na's #nd <- nd %>% #drop_na() #variable creation for model Y <- cbind(nd$Flows) X <- cbind(nd$Dummy, lag(nd$Returns), lag(nd$TNA), nd$Age) # descriptive statistics summary(Y) summary(X) #random effects random2 <- plm(Y ~ X, nd, model='random', index=c('ISIN', 'Date')) summary(random2) #fixed effect model fixed2 <- plm(Y ~ X, nd, model='within', index=c('ISIN', 'Date')) # Breusch-Pagan Test bptest(fixed2) #Test which model to use fixed effect or random effects #hausmann test phtest(random2, fixed2) # we take fixed effects ##Double-clustering formula (Thompson, 2011) vcovDC <- function(x, ...){ vcovHC(x, cluster="ISIN", ...) + vcovHC(x, cluster="Date", ...) - vcovHC(x, method="white1", ...) } testamk <- plm(Y ~ X, nd, model='within', index=c('ISIN', 'Date')) summary(testamk) coeftest(testamk, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
Many thanks in advance! Joe
-
Seurat - cannot plot the same dimplot again
I am trying to rewrite the code of this paper: https://doi.org/10.1038/s42003-020-0837-0
I have written the code step-by-step based on the instructions mentioned in the methods section. But after clustering, for plotting the clusters by dimplot, I receive a dissimilar plot compared to the same plot in the paper.
I wonder what is the problem? I have tailored every parameter to receive the same plot but it hasn't worked yet.
Graph of the paper
My graph
Please help me to solve this issue. -
Applying weights to KNN dimensions
When doing a KNN searches in ES/OS it seems to be recommended to normalize the data in the knn vectors to prevent single dimensions from over powering the the final scoring.
In my current example I have a 3 dimensional vector where all values are normalized to values between 0 and 1
[0.2, 0.3, 0.2]
From the perspective of Euclidian distance based scoring this seems to give equal weight to all dimensions.
In my particular example I am using an l2 vector:
"method": { "name": "hnsw", "space_type": "l2", "engine": "nmslib", }
However, if I want to give more weight to one of my dimensions (say by a factor of 2), would it be acceptable to single out that dimension and normalize between 0-2 instead of the base range of 0-1?
Example:
[0.2, 0.3, 1.2] // Third vector is now between 0-2
The distance computation for this term would now be
(2 * (xi - yi))^2
and lead to bigger diffs compared to the rest. As a result the overall score would be more sensitive to differences in this particular term.In OS the score is calculated as
1 / (1 + Distance Function)
so the higher the value returned from the distance function, the lower the score will be.Is there a method to deciding what the weighting range should be? Setting the range too high would likely make the dimension too dominant?
-
How can I make "euclidean distance function" in PostgreSQL?
I am creating a face recognition system, but the search response is slow. It takes about 0.8 seconds for 100,000 data items.
I thought it would be faster if I made a function, but I don't know how to make it. Can you help me, please?
One record means one face data. 128 facial features are stored in 1 record. From 100,000 records, I want to create a function that searches the face data for retrieval and returns the record with the closest Euclidean distance and the distance.
- PostgreSQL
test=# SELECT version(); version ----------------------------------------------------------------------------------------------------------------------------- PostgreSQL 14.2 (Debian 14.2-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit (1 row)
- Table
test=# \d face_feature Table "public.face_feature" Column | Type | Collation | Nullable | Default --------------+--------------------+-----------+----------+--------- id | bigint | | not null | face_feature | double precision[] | | not null | Indexes: "face_feature_pkey" PRIMARY KEY, btree (id)
- Data
test=# SELECT count(*) FROM face_feature; count -------- 100013 test=# SELECT * FROM face_feature LIMIT 1; 1 | {-0.07603023,0.13605964,0.06847742,-0.03398858,-0.00734358,-0.00842991,-0.10306944,-0.07324794,0.17355075,-0.14330758,0.28240225, 0.08052156,-0.2336603,-0.13322374,0.07830653,0.16286661,-0.24032585,-0.07958832,-0.10359511,-0.07131331,-0.00063275,-0.0075661,0.076691 75,0.05093278,-0.07491664,-0.38648498,-0.06326434,-0.13660973,-0.00129792,-0.178278,-0.1043617,-0.04163877,-0.17744723,-0.10434192,-0.0 1320369,-0.02023632,-0.01660283,-0.03433997,0.17991692,0.03811514,-0.13161938,0.0699086,-0.01464873,0.21853428,0.23839769,0.10686319,0. 02119838,-0.07234459,0.11722616,-0.21573897,0.04390875,0.16936384,0.08311173,0.01917882,0.09323658,-0.2072413,-0.01748681,0.08320624,-0 .09542595,0.0458608,0.04892007,-0.07144583,0.02571599,0.04505579,0.20255798,0.04992293,-0.10868648,-0.05127542,0.11837331,-0.02669029,- 0.00874132,-0.01055394,-0.18174604,-0.24011543,-0.25346312,0.06233353,0.33776185,0.19125421,-0.20479012,0.00082084,-0.17169486,0.011960 41,0.07014117,0.07075555,-0.04521162,-0.08565544,-0.09207604,0.04942492,0.10082003,0.04185115,-0.02510541,0.21267371,-0.0340629,0.05106 198,0.00818989,0.00951286,-0.14769551,-0.01800098,-0.16166028,-0.05558256,-0.00861979,-0.03454661,-0.00584124,0.12463668,-0.18756914,0. 05945413,0.00501454,-0.02449622,-0.01090427,0.10673723,-0.08141828,-0.06253327,0.06625147,-0.22139846,0.23956184,0.28763503,0.03987202, 0.16106789,0.07659496,0.05857954,-0.02265359,-0.03277328,-0.16266093,-0.08388556,0.03029961,0.07270813,0.09608927,0.00082345}
- SQL
SELECT * FROM face_feature ORDER BY sqrt( power(face_feature[0] - (-0.09077361), 2) + power(face_feature[1] - (0.10373443), 2) + ... ... power(face_feature[126] - (0.0778369), 2) + power(face_feature[127] - (0.00951046), 2) ) LIMIT 1 ;
- Result
1 | {-0.07603023,0.13605964, ... 0.09608927,0.00082345} Time: 835.313 ms
I need the closest Euclidean distance record and the distance. But now only the closest Euclidean distance record is showed.
-
Export Detected Objects (Point Cloud Data)
I want to save the objects I detected using RANSAC and DBSCAN separately. For this reason, I export the labels it produces for the objects. However, I want to save all of these labels (how many labels are in total according to the number of objects detected in the point cloud data) using the for loop. But I couldn't get any results from my attempts. Below is the code I am running.
import open3d as o3d import numpy as np import matplotlib.pyplot as plt import time #RANSAC pcd = o3d.io.read_point_cloud("D:\\Bitirme_Veri\\dene.pcd") start = time.time() plane_model, inliers = pcd.segment_plane(distance_threshold=0.04, ransac_n=3, num_iterations=1000) inlier_cloud = pcd.select_by_index(inliers) outlier_cloud = pcd.select_by_index(inliers, invert=True) inlier_color = plt.get_cmap("summer")(0) inlier_cloud.paint_uniform_color(list(inlier_color)[ :3]) #o3d.io.write_point_cloud("D:\\Bitirme_Veri\\aa.pcd", outlier_cloud, write_ascii=True, compressed=True, print_progress=False) #DBSCAN labels = np.array(outlier_cloud.cluster_dbscan(eps=0.05, min_points=5)) max_label = labels.max() colors = plt.get_cmap("tab20")(labels / (max_label if max_label > 0 else 1)) colors[labels < 0] = 0 inlier_cloud.colors = o3d.utility.Vector3dVector(colors[:, :3]) colors = plt.get_cmap("tab10")(labels / (max_label if max_label > 0 else 1)) colors[labels < 0] = 0 outlier_cloud.colors = o3d.utility.Vector3dVector(colors[ :, :3]) end = time.time() print(f"İşlem Süresi: { end-start:.3f}") print(labels) #Önizleme o3d.visualization.draw_geometries([inlier_cloud]+[outlier_cloud]) o3d.visualization.draw_geometries([outlier_cloud]) #Export Objects obj_points = np.asarray(outlier_cloud.points)[labels==0] new_pcd = o3d.geometry.PointCloud(); new_pcd.points = o3d.utility.Vector3dVector(obj_points) o3d.io.write_point_cloud("D:\\Bitirme_Veri\\gg.pcd", new_pcd, write_ascii=True, compressed=True, print_progress=False)
-
How to compute the number of clusters in spatial datasets in R?
I have two datasets which has a length of 10000 columns but it looks like this. Essentially, it has the x- and y- coordinate of every object in a 2d map.
rep x-pos y-pos 1 0.5 0.7 1 0.1 0.0 1 4.6 2.5 2 5.6 5.0 2 0.2 1.0 2 0.4 2.0
I want to measure if the two datasets have similar levels of clustering between the objects. Visually, the 2D maps look like one dataset has a higher number of clusters. Is there a method like mclust or dbscan where I can quantify clustering differences in spatial datasets in . Thanks?