adjust arff file format in C++ for WEKA

I wnat to make preprocessing for Weka arff file which contains 2000 lines for nlp project (sentiment analysis)

I want a code that just add a single quotation at the start and end of each sentence. for example this is a sample for my dataset:

The Da Vinci Code is one of the most beautiful movies ive ever seen.,1 The Da Vinci Code is an * amazing * book, do not get me wrong.,1 then I turn on the light and the radio and enjoy my Da Vinci Code.,1 The Da Vinci Code was REALLY good.,1 i love da vinci code....,1

I want it to be like this:

'The Da Vinci Code is one of the most beautiful movies ive ever seen.',1 'The Da Vinci Code is an * amazing * book, do not get me wrong.',1 'then I turn on the light and the radio and enjoy my Da Vinci Code.',1 'The Da Vinci Code was REALLY good.',1 'i love da vinci code....',1

Just want to add a single quotation at the beginning and end of each sentence (before the 1 ).

I would really appreciate it if you help me do it

Is there any tool that I can use instead of writing a code?

1 answer

  • answered 2018-01-04 16:11 KompjoeFriek

    You could use regular expressions with a large amount of tools to achieve this.

    Regular expression:

    /([^\.]+)(\.+)(,1\s+)/g
    
    • Group 1: Match all characters except for a literal dot, at least 1 character.
    • Group 2: Match only literal dots, at least 1 character.
    • Group 3: Match a literal comma, followed by a literal 1, followed by at least 1 whitespace character.
    • Regex flag g (global): multiple matches

    Substitution:

    '$1$2'$3
    

    Enclose group 1 and 2 with quotes, followed by group 3.

    Use that to put your favorite tool to work. Like sed:

    sed -i -E "s/([^\.]+)(\.+)(,1\s+)/'\1\2'\3/gm" yourfile.txt
    

    Other tools might use a different syntax. Provided expression can probably be optimized further.