Remove Unnecessary Characters while keeping CSV format using shell

I have a CSV file in the following format

902610747280285697, possible future hurricaneirma analog 1995\xC2\xA02003\xC2\xA02004\xC2\xA02008 2010 til east leeward doubtful afterward invest93l
902611695239094277, midlevel ridge push invest93l future hurricaneirma wsw ridge east leave som
902642953373577216, midlevel ridge push invest93l future hurricaneirma wsw ridge east leave som
902711459561525248, midlevel ridge push invest93l future hurricaneirma wsw ridge east leave som
902755305158782976, 12z ecmwf setup support major strike east coast high east deep uul west hurricaneirma
902772740507275265, possible future hurricaneirma analog 1995\xC2\xA02003\xC2\xA02004\xC2\xA02008 2010 til east leeward doubtful
902777486186086400, future hurricaneirma satellite look impressive tropicaldepression10 24 hour
903355611810852867, hurricaneirma think f ***
903355689455804416, hurricaneirma tropics weather
903411347337162752, hurricaneirma shiiiiiiitty t *** im possibly fuck
903411365607591936, hurricaneirma 3000 mile cat 3 hurricane watch closely
903989185845088257, 

How do i remove the the characters like *,\xC2\xA02003\xC2\xA02004\xC2\xA0 and empty rows like the last one which might throw error in the Scala processing later on.I need to maintain the CSV structure in the same manner as before but require to remove these.

Please help me to achieve this in shell script? Thanks you once again as I am newbie in shell scripting

Edit:

Could you please tell me on how to correct the corrupted rows(with no ',') like

902755305158782976, 12z ecmwf setup support major strike east coast high east deep uul west hurricaneirma
902777486186086400, future hurricaneirma satellite look impressive tropicaldepression10 24 hour
903355611810852867 hurricaneirma think
903355611810852868 hurricagggneirma think

1 answer

  • answered 2018-01-14 03:53 sjsam

    You could use sed for this but I am pretty sure, you may not get 100% results. You should use a tool native to the file that you're processing to get the results in the desired format. Anyways below is my try :

    $ sed -E '/^[^,]*$/d;/^[0-9]+, *$/d;s/ \*+ */ /;s/\\[xX][^\ ,]*//g' case_file_48246326
    

    Output

    902610747280285697, possible future hurricaneirma analog 1995 2010 til east leeward doubtful afterward invest93l
    902611695239094277, midlevel ridge push invest93l future hurricaneirma wsw ridge east leave som
    902642953373577216, midlevel ridge push invest93l future hurricaneirma wsw ridge east leave som
    902711459561525248, midlevel ridge push invest93l future hurricaneirma wsw ridge east leave som
    902755305158782976, 12z ecmwf setup support major strike east coast high east deep uul west hurricaneirma
    902772740507275265, possible future hurricaneirma analog 1995 2010 til east leeward doubtful
    902777486186086400, future hurricaneirma satellite look impressive tropicaldepression10 24 hour
    903355611810852867, hurricaneirma think f 
    903355689455804416, hurricaneirma tropics weather
    903411347337162752, hurricaneirma shiiiiiiitty t im possibly fuck
    903411365607591936, hurricaneirma 3000 mile cat 3 hurricane watch closely