Distance between the rows in dataset B, based on dataset A
I have two datasets, A and B
I am interested in how far each row of B is to each row in A (both have the same columns).
Due to the size of B, computing dist() or parDist() on the stacked dataset of A and B and taking a subset isn't feasible.
More concretely: suppose A is 50000 rows, B is 250000. I want 250000 rows x 50000 columns to detail these distances.
Any solution I'm overlooking?
1 answer

This worked for me with a smaller dataset and should work on your dataset. It separates the task into chunks and calculates summary stats for each rowofA compared to allrowsofB. It still performs an alltoall comparison in the end since it iterates through allrowsofA. (If this is not what you're looking for, it's important to provide a reproducible example and expected output to avoid situations like this)
set.seed(1) A < as.data.frame(matrix(runif(500*2)*10, nrow=500)) # change 500 to 50000 B < as.data.frame(matrix(runif(250000*2)*10, nrow=250000)) myfun < function(rowsofA, B) { Dx < outer(rowsofA[,1], B[,1], "")**2 # ** is same as ^ Dy < outer(rowsofA[,2], B[,2], "")**2 Dist < sqrt(Dx+Dy) # Dist = sqrt((x1x2)^2 + (y1y2)^2) # add summary stat below Summ < data.frame( mean = apply(Dist, 1, mean), sd = apply(Dist, 1, sd), min = apply(Dist, 1, min), max = apply(Dist, 1, max)) return(Summ) } library(purrr) map_df(split(A, 1:5), ~myfun(.x, B))
With 500row dataset,
split(..., 1:5)
will split the data frame into 5 100row data frames. With a 50,000row dataset, use something likesplit(..., 1:100)
orsplit(..., 1:1000)
depending on your memory.Output with 500row dataset. Each row of the output provides the
mean, sd, min, and max
distance for eachrowofA vs allrowsofB.# mean sd min max # 1 4.332120 1.922412 0.0104518694 9.179429 # 2 6.841677 2.798114 0.0044511643 13.195127 # 3 5.708658 2.601969 0.0131417242 11.788345 # 4 4.670345 2.139370 0.0104878996 9.521932 # 5 6.249670 2.716091 0.0069813098 12.473525 # 6 5.497154 2.476391 0.0127143548 11.108188 # 7 3.928659 1.551248 0.0077266976 7.954166 # etc