How to get top 5 maximum values keys in pyspark red using filter


I have this kind of rdd in pyspark and I want top 5 keys with max value.

2 answers

  • answered 2018-07-11 04:43 pissall

    rdd.sort('column_name', ascending=False).take(5)

    Hope this helps

  • answered 2018-07-11 09:21 Oli

    If you're working with RDDs, you can sort your data and take the first 5 elements.

    >>> Rdd.sortBy(lambda x : - x[1]).take(5)
    [('e', 10), ('g', 9), ('d', 7), ('f', 5), ('b', 5)]

    Yet, this may not be very efficient, especially on large RDDs. You could use a simple reduce. x: [x])\
       .reduce(lambda a,b: sorted(a + b, key = lambda x : - x[1])[:5] )

    This is still not optimal because this will yield quite a lot of object creations, but already much better than a sort.