Calculating Term frequency without tfidfvectorizer()
I am trying to calculate the term frequency without scikit or nltk. I have a corpus with four documents. My code somehow is calculating only unique values. The TF of repeated words in the corpus is not calculated. That is,
corpus=['this is my first python code', 'this is my second line of code', 'and this contains third', 'is this my last line']
What I am expecting is a dictionary with each word and its TF. But somehow my output is not calculating the TF of repeating words. 'this', 'is', 'my' repeats in the first and second document. Every word will have different TF in different documents. But my code is calculating TF of this is and my from the first document and then is not calculating it for the second document and so on.
for sentence in corpus: Countofeachword=dict(Counter(sentence.split())) for key,value in Countofeachword.items(): TFdict[key]=value/sum(Countofeachword.values())
Is there any major understanding gap in my concept? I am not able to proceed. Can someone please provide a small hint on where I am going wrong? Thanks.
See also questions close to this topic
Module with globals or Class with attributes?
Currently I'm working with a lot of modules where the original developers used
globalvariables to control states and to exchange important information between functions, like so:
STATE_VAR = 0 def do_something(arg1): global STATE_VAR if arg1: STATE_VAR = 1 def say_hello(): if STATE_VAR: print("Hello!")
I have to create new libraries that communicate with these modules and, once I use pylint to check on my code, I get a lot of complaints about using
In my head, the structure should be something like this:
class MyClass: STATE_VAR = 0 @classmethod def do_something(cls, arg1): if arg1: cls.STATE_VAR = 1 @classmethod def say_hello(cls): if cls.STATE_VAR: print("Hello!")
This structure makes pylint happy for not using the
globalstatement, but at the same time, rubs me in the wrong way for the need to use clauses such as
from mymodule import MyClass, or have to contend with the ugly
mymodule.MyClass.do_something()type of call.
I wanted to develop my code that is both pythonic and also consistent with what is already in place (I might be overthinking this as well).
I've also stumbled upon this other related question that got no definitive answer to it.
So my question is: What is the best practice in this situation. Do I keep writing code that are modules using global variables to define state (keep consistency but let pylint mad) or should I follow the road of classes and OOP (and effectively go against the code already in place)?
Breaking subprocess loop from parent process
What is missing here to break the loop in tok2.py from tok1.py?
I try to send a string containing 'exit', read the sent value into my_input and break the loop in tok2.py?
Now tok2 runs forever.
Using Debian 10 Buster with Python 3.7.
import sys import time import subprocess command = [sys.executable, 'tok2.py'] proc = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) i=0 while proc.poll() is None: if i > 5: #Send 'exit' after 5th iteration proc.stdin.write(b'exit') print('tok1: ' + str(i)) time.sleep(0.5) i=i+1
import sys import time ii=0 my_input ='' while True: my_input = sys.stdin.read() if my_input == b'exit': print('tok2: exiting') sys.stdout.flush() break print('tok2: ' + str(ii)) sys.stdout.flush() ii=ii+1 time.sleep(0.5)
TypeError: '>=' not supported between instances of 'builtin_function_or_method' and 'int'
When i run the below code :
input("请输入1—100之间的数字：") n = input if n >= 1 and n <= 100: print("你妹好漂亮！") else: print("你大爷好丑") print("游戏结束啦！不和你玩了") if n >= 1 and n <= 100:
I get the following error:
TypeError: '>=' not supported between instances of 'builtin_function_or_method' and 'int'`
how can I set more priority to the exact sentence using tfidf
Hey I am trying to create a chatbot via tfidf vectorizer....I am not getting enough accuracy because tfidf tends to give more priority to larger sentences containing that word rather than that exact word itself ...so can anyone help me to give more priority to precise words plzzzz......Nd thanks
How to conditionaly create a new variable containing one or more observations?
I'm trying to match the post (from Table dataEnArbeit) with its relevent key-word/s and its/their
tf-idfvalue/s (from Table arbeit). How do I also copy all relevent words and their
tf-idfscores in wArbeit and iArbeit respectivly?
idataEnArbeit <- mutate(dataEnArbeit, wArbeit = ifelse((str_count(dataEnArbeit$Text, arbeit$feature))>=1, arbeit$feature, NA), iArbeit = ifelse((str_count(dataEnArbeit$Text, arbeit$feature))>=1, arbeit$tfidf, NA))
All I get is one word from (arbeit), although there are more.
How can I resolve "ValueError: empty vocabulary ~"?
This error occurred when trying to sort nouns whose tf-idf score was 0.03 or higher after morphological analysis of tweets acquired in real time. Also, I can't remove retweets and emoticons in the tweets I get.
Can you tell me what is happening inside the code and how to fix it?
File "final.py", line 97, in <module> stream.sample() File "/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python3.7/site-packages/tweepy/streaming.py", line 449, in sample self._start(is_async) File "/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python3.7/site-packages/tweepy/streaming.py", line 389, in _start self._run() File "/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python3.7/site-packages/tweepy/streaming.py", line 320, in _run six.reraise(*exc_info) File "/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python3.7/site-packages/six.py", line 693, in reraise raise value File "/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python3.7/site-packages/tweepy/streaming.py", line 289, in _run self._read_loop(resp) File "/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python3.7/site-packages/tweepy/streaming.py", line 351, in _read_loop self._data(next_status_obj) File "/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python3.7/site-packages/tweepy/streaming.py", line 323, in _data if self.listener.on_data(data) is False: File "/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python3.7/site-packages/tweepy/streaming.py", line 54, in on_data if self.on_status(status) is False: File "final.py", line 78, in on_status tfidf = vectorizer.fit_transform(corpus) File "/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 1652, in fit_transform X = super().fit_transform(raw_documents) File "/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 1058, in fit_transform self.fixed_vocabulary_) File "/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 989, in _count_vocab raise ValueError("empty vocabulary; perhaps the documents only" ValueError: empty vocabulary; perhaps the documents only contain stop words
import os import tweepy import redis import math from collections import Counter import re from natto import MeCab import codecs import sys from sklearn.feature_extraction.text import TfidfVectorizer import glob import numpy as np #r = redis.Redis(host='localhost', port=6379, db=0) TWITTER_CLIENT_ID = os.environ['TWITTER_CLIENT_ID'] TWITTER_CLIENT_SECRET = os.environ['TWITTER_CLIENT_SECRET'] TWITTER_OAUTH_TOKEN = os.environ['TWITTER_OAUTH_TOKEN'] TWITTER_OAUTH_TOKEN_SECRET = os.environ['TWITTER_OAUTH_TOKEN_SECRET'] auth = tweepy.OAuthHandler(TWITTER_CLIENT_ID,TWITTER_CLIENT_SECRET) auth.set_access_token(TWITTER_OAUTH_TOKEN,TWITTER_OAUTH_TOKEN_SECRET) class StreamListener(tweepy.StreamListener): def __init__(self): super().__init__() self.count = 0 # Number of tweets acquired def on_status(self, status): text = str(status.text) text2 = re.sub(r"http\S+", "", text) text3 = re.sub(r"@(\w+) ", "", text2) text4 = re.sub(r"#(\w+)", "", text3) text5 = re.sub(r"RT(\w+)", "", text4) #Unable to erase retweet emoji_pattern = re.compile("[" u"\U0001F600-\U0001F64F" u"\U0001F300-\U0001F5FF" u"\U0001F680-\U0001F6FF" u"\U0001F1E0-\U0001F1FF" "]+", flags=re.UNICODE) text6 = emoji_pattern.sub("", text5) #Unable to erase Emoji #Writing Japanese tweets to a file + Displaying the number of tweets if status.lang == "ja": self.count += 1 print(self.count, text6) with open("test37.txt", "a", encoding="utf-8") as f: f.write(text6) with codecs.open("test37.txt", "r", "utf-8") as f: corpus = f.read().split("\n") mecab = MeCab('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd') #if tagger.lang == 'ja': rm_list = ["RT","https","co","@","__"] docs =  for txt in corpus: words = mecab.parse(txt, as_nodes=True) doc =  #Morphological analysis using MeCab for w in words: if w.feature.split(",") == "名詞": #名詞 = noun if len(w.surface) >= 3: if not any(rm in w.surface for rm in rm_list): doc.append(str(w.surface)) doc = ' '.join(doc) docs.append(doc) corpus = docs #tf-idf calculation vectorizer = TfidfVectorizer(min_df=0.03) tfidf = vectorizer.fit_transform(corpus) #Sort words by score feature_names = np.array(vectorizer.get_feature_names()) for vec in tfidf: index = np.argsort(vec.toarray(), axis=1)[:,::-1] feature_words = feature_names[index] #print(corpus) print(feature_words[:,:10]) def on_error(self, status_code): return False stream = tweepy.Stream(auth=auth, listener=StreamListener()) stream.sample()
iOS 10.12.6, Python 3.7.3, Atom