Sorting after Repartitioning PySpark Dataframe

I'm fairly new to pyspark so I'm sorry if this is a dumb question. We have a giant file which we repartitioned according to one column, for example say it is STATE. Now it seems like after repartitioning, the data cannot be sorted completely. We are trying to save our final file as a txt file but instead of the first state listed being Alabama, now California shows up first. OrderBy doesn't seem to have an affect after running the repartition.

df = df.repartition(100, ['STATE_NAME'])\
    .sortWithinPartitions(['STATE_NAME', 'CUSTOMER_ID', 'ROW_ID'])

1 answer

  • answered 2021-10-22 16:57 Armali

    I can't find a clear statement in the documentation about this, only this hint for pyspark.sql.DataFrame.repartition:

    The resulting DataFrame is hash partitioned.

    Obviously, repartition doesn't bring the rows in a specific (namely alphabetic) order (not even if they were ordered previously), it only groups them. That .sortWithinPartitions imposes no global order is no wonder considering the name, which implies that the sorting only occurs within the partitions, not on them. You can try .sort instead.

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum