How to get top 5 maximum values keys in pyspark red using filter

Rdd=sc.parallelize([('a',1),('b',5),('c',3),('d',7),('e',10),('f',5),('g',9)])

I have this kind of rdd in pyspark and I want top 5 keys with max value.

2 answers

  • answered 2018-07-11 04:43 pissall

    rdd.sort('column_name', ascending=False).take(5)

    Hope this helps

  • answered 2018-07-11 09:21 Oli

    If you're working with RDDs, you can sort your data and take the first 5 elements.

    >>> Rdd.sortBy(lambda x : - x[1]).take(5)
    [('e', 10), ('g', 9), ('d', 7), ('f', 5), ('b', 5)]
    

    Yet, this may not be very efficient, especially on large RDDs. You could use a simple reduce.

    Rdd.map(lambda x: [x])\
       .reduce(lambda a,b: sorted(a + b, key = lambda x : - x[1])[:5] )
    

    This is still not optimal because this will yield quite a lot of object creations, but already much better than a sort.