Distance between the rows in dataset B, based on dataset A

I have two datasets, A and B

I am interested in how far each row of B is to each row in A (both have the same columns).

Due to the size of B, computing dist() or parDist() on the stacked dataset of A and B and taking a subset isn't feasible.

More concretely: suppose A is 50000 rows, B is 250000. I want 250000 rows x 50000 columns to detail these distances.

Any solution I'm overlooking?

1 answer

  • answered 2017-10-11 12:23 CPak

    This worked for me with a smaller dataset and should work on your dataset. It separates the task into chunks and calculates summary stats for each row-of-A compared to all-rows-of-B. It still performs an all-to-all comparison in the end since it iterates through all-rows-of-A. (If this is not what you're looking for, it's important to provide a reproducible example and expected output to avoid situations like this)

    set.seed(1)
    A <- as.data.frame(matrix(runif(500*2)*10, nrow=500))  # change 500 to 50000
    B <- as.data.frame(matrix(runif(250000*2)*10, nrow=250000))
    
    myfun <- function(rowsofA, B) {
        Dx <- outer(rowsofA[,1], B[,1], "-")**2  # ** is same as ^
        Dy <- outer(rowsofA[,2], B[,2], "-")**2
        Dist <- sqrt(Dx+Dy)  # Dist = sqrt((x1-x2)^2 + (y1-y2)^2)
        # add summary stat below
        Summ <- data.frame( mean = apply(Dist, 1, mean), 
                    sd = apply(Dist, 1, sd), 
                    min = apply(Dist, 1, min), 
                    max = apply(Dist, 1, max))
        return(Summ)
    }
    
    library(purrr)
    map_df(split(A, 1:5), ~myfun(.x, B))
    

    With 500-row dataset, split(..., 1:5) will split the data frame into 5 100-row data frames. With a 50,000-row dataset, use something like split(..., 1:100) or split(..., 1:1000) depending on your memory.

    Output with 500-row dataset. Each row of the output provides the mean, sd, min, and max distance for each-row-of-A vs all-rows-of-B.

            # mean       sd          min       max
    # 1   4.332120 1.922412 0.0104518694  9.179429
    # 2   6.841677 2.798114 0.0044511643 13.195127
    # 3   5.708658 2.601969 0.0131417242 11.788345
    # 4   4.670345 2.139370 0.0104878996  9.521932
    # 5   6.249670 2.716091 0.0069813098 12.473525
    # 6   5.497154 2.476391 0.0127143548 11.108188
    # 7   3.928659 1.551248 0.0077266976  7.954166
    # etc