I am trying to remove Japanese stopwords from a text corpus from twitter.
Unfortunately the frequently used nltk does not contain Japanese, so I had to figure out a different way.
This is my MWE:
import urllib
from urllib.request import urlopen
import MeCab
import re
# slothlib
slothlib_path = "http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt"
sloth_file = urllib.request.urlopen(slothlib_path)
# stopwordsiso
iso_path = "https://raw.githubusercontent.com/stopwords-iso/stopwords-ja/master/stopwords-ja.txt"
iso_file = urllib.request.urlopen(iso_path)
stopwords = [line.decode("utf-8").strip() for line in iso_file]
stopwords = [ss for ss in stopwords if not ss==u'']
stopwords = list(set(stopwords))
text = '日本語の自然言語処理は本当にしんどい、と彼は十回言った。'
tagger = MeCab.Tagger("-Owakati")
tok_text = tagger.parse(text)
ws = re.compile(" ")
words = [word for word in ws.split(tok_text)]
if words[-1] == u"\n":
words = words[:-1]
ws = [w for w in words if w not in stopwords]
print(words)
print(ws)
Successfully Completed: It does give out the original tokenized text as well as the one without stopwords
['日本語', 'の', '自然', '言語', '処理', 'は', '本当に', 'しんどい', '、', 'と', '彼', 'は', '十', '回', '言っ', 'た', '。']
['日本語', '自然', '言語', '処理', '本当に', 'しんどい', '、', '十', '回', '言っ', '。']
There is still 2 issues I am facing though:
a) Is it possible to have 2 stopword lists regarded? namely iso_file
and sloth_file
? so if the word is either a stopword from iso_file
or sloth_file
it will be removed? (I tried to use line 14 as
stopwords = [line.decode("utf-8").strip() for line in zip('iso_file','sloth_file')]
but received an error as tuple attributes may not be decoded
b) The ultimate goal would be to generate a new text file in which all stopwords are removed.
I had created this MWE
### first clean twitter csv
import pandas as pd
import re
import emoji
df = pd.read_csv("input.csv")
def cleaner(tweet):
tweet = re.sub(r"@[^\s]+","",tweet) #Remove @username
tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+|\\n","", tweet) #Remove http links & \n
tweet = " ".join(tweet.split())
tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis
tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
return tweet
df['text'] = df['text'].map(lambda x: cleaner(x))
df['text'].to_csv(r'cleaned.txt', header=None, index=None, sep='\t', mode='a')
### remove stopwords
import urllib
from urllib.request import urlopen
import MeCab
import re
# slothlib
slothlib_path = "http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt"
sloth_file = urllib.request.urlopen(slothlib_path)
#stopwordsiso
iso_path = "https://raw.githubusercontent.com/stopwords-iso/stopwords-ja/master/stopwords-ja.txt"
iso_file = urllib.request.urlopen(iso_path)
stopwords = [line.decode("utf-8").strip() for line in iso_file]
stopwords = [ss for ss in stopwords if not ss==u'']
stopwords = list(set(stopwords))
with open("cleaned.txt",encoding='utf8') as f:
cleanedlist = f.readlines()
cleanedlist = list(set(cleanedlist))
tagger = MeCab.Tagger("-Owakati")
tok_text = tagger.parse(cleanedlist)
ws = re.compile(" ")
words = [word for word in ws.split(tok_text)]
if words[-1] == u"\n":
words = words[:-1]
ws = [w for w in words if w not in stopwords]
print(words)
print(ws)
While it works for the simple input text in the first MWE, for the MWE I just stated I get the error
in method 'Tagger_parse', argument 2 of type 'char const *'
Additional information:
Wrong number or type of arguments for overloaded function 'Tagger_parse'.
Possible C/C++ prototypes are:
MeCab::Tagger::parse(MeCab::Lattice *) const
MeCab::Tagger::parse(char const *)
for this line: tok_text = tagger.parse(cleanedlist)
So I assume I will need to make amendments to the cleanedlist
?
I have uploaded the cleaned.txt on github for reproducing the issue:
[txt on github][1]
Also: How would I be able to get the tokenized list that excludes stopwords back to a text format like cleaned.txt? Would it be possible to for this purpose create a df of ws?
Or might there even be a more simple way?
Sorry for the long request, I tried a lot and tried to make it as easy as possible to understand what I'm driving at :-)
Thank you very much!
[1]: https://gist.github.com/yin-ori/1756f6236944e458fdbc4a4aa8f85a2c