need an algorithm for continuity check: select a list of integer number to have best "coverage"

this is an hardcore algorithm problem that i'm facing right now:

let's say there is a sorted list of integers L1:
1. the total length of the list is known which is N (e.g. N could be 1e7)
2. all the elements are between two known boundaries, A and B , ( A <<< B )
e.g. L1 = [ 2,5,10,15,18,19,21...]

Right now, I need to select a subset of the elements from the list L1 to form a new list L2 with the total length of M (M < N)
(e.g. M could equal N /10 )

to satisfy a condition: the new list L2 needs to have best "coverage";
by "coverage", it means that all the elements integers in the L2 need to be distributed in the L1's range, [A,B], as equally as possible. (a.k.a an unbiased sub-sampling method )

any help is deeply appreciated.

thanks for everyone's help and idea. I try to simplify the problem so that everyone (without the background knowledge can understand the problem). To define a rule for goodness of the coverage:

the ultimate goal is to achieve:

  1. in the list L2, for any two neighbor element J and K, there are | J - K | , and the sum of this difference needs to be minimized

  2. apply a given window with the total length of Q ( Q < M ) to list L2, and the number of elements within the window needs to be either equal (ideal situation) or almost equal

thanks

1 answer

  • answered 2018-07-11 05:33 ujhuyz0110

    My idea is to utilize bucket sort of which bucket size is (A - B) / M. After mapping each element in l1 to its corresponding bucket, pick element randomly from each bucket to from the new list. If the new list is shorter than m, then I repeat the process. The following is my implementation in Python:

    import bisect
    import random
    import collections
    
    def form_new_list(l1, m, a, b, res):
        if m <= 0:
            return
    
        bucket_size = int((b - a) / m)
        bucket_list = collections.defaultdict(list)
        for i, num in enumerate(l1):
            bucket_num = int(num / bucket_size)
            bucket_list[bucket_num].append(num)
    
        for _, nums in bucket_list.items():
            selected = random.choice(nums)
            position = bisect.bisect(res, selected)
            bisect.insort(res, selected)
            l1.remove(selected)
    
        form_new_list(l1, m - len(res), a, b, res)
    
        return res