Can I sort a bunch of values without retaining the actual content of the string?

What do I want to do

I want to sort a bunch of strings, simple enough.

What are my constraints

I have the original text stored on-premises which has the real text which I want to sort, the cloud has some other "columns" of data which is not on-premises and for security reasons I cannot take the original text from on-premises to the cloud. The real constraint is that I cannot have all the data in one place which causes sorting, paging on values across on-premises & cloud data difficult.

What I thought of (and where I need help)

Maybe I can take a hash or some other way of extracting certain data from the string in such a way that the original string cannot be reproduced (takes care of the security thing) but the extracted string would be enough that I can do sorting on it.

Example

on-premises data:

[ { "id": 1, "name": "abcd" } ]

cloud data:

[ { "id": 1, "price": "20" } ]

I need to sort on both price and name in the above example (imagine a 100,000 rows of such data).

1 answer

  • answered 2018-05-16 07:46 Yunnosch

    What you need to do is to store pairs of a string and the corresponding id, e.g. in two lists/arrays (whatever the programming language of your choice offers).
    Then start sorting the strings, but each time you move a string, move the id the same way.

    Alternatively, most programming languages offer constructs which allow to make pairs and then you sort those pairs according to strings, which will automatically move the ids around.

    Both ways mean that after sorting, you can still find the id for each string, then with that id you can access the corresponding cloud data as usual.

    As an example, the programming language C offers the compound data type construct

    struct IdStringPair
    {
        int id;
        char* string;
        /* actually just the address of where the full string is stored,
           but basically what you probably want to use */
    };
    

    Hardly any programming language exists which does not offer something similar.

    If conversely the data to sort by is in the cloud, then sorting has to take place in the cloud, i.e. by something being able to execute the sorting algorithm. Make sure that you sort the id along with the key. Then finding the non-cloud string is again the same as before. Whatever you previously did to find a string to an id, do it with the id you got from the cloud sorted data.
    This is the same as the first situation/solution, just mirrored.

    The core concept is to always sort the ids along with the key (and other data) and thereby dispose of the need to have the data from the other side of the gap between clould and premise. That is applicable to all versions of the sorting of separated data.