How to find overall model coefficients in Gensim's latent dirichlet allocation?
I am using Latent Dirichlet Allocation (LDA) to find topics that occur in my corpus, and I was wondering if there is a way to get the overall model coefficients in Gensim? Here each coefficient would represent the weight of each topic in the overall model. I want to get an idea of which topics are the most important relevant through these weights, in addition to looking at topic vocabularies. Something like this is possible in R, since the model has a coefs
attribute ( a coefficient for each of the six topics):
lda_model$coefs
[1] -0.56446635 -0.52604353 -0.43325116 -1.88352347 0.20560172 -0.26902856
One idea I have currently is to use Gensim's get_document_topics(bow)
function where bow
represents a single document converted into a bow vector. Here is the documentation: https://radimrehurek.com/gensim/models/ldamodel.html. Although this is not the same as the model coefficients I was talking about above, it may capture a somewhat similar idea. I can use this function for all documents to get the document's topic distribution, and average all the topic distributions across documents for each topic to get a sense of which topics are the most prevalent. However, this would not be my first go-to option. Please let me know if there is something along the lines of coefs
in Gensim's LDA to obtain model coefficients. Thanks!
See also questions close to this topic
-
Why is my if statement is always triggering in python
I'm new to Python and I decided to do a project by myself. In my project there is a if statement that always triggers. Also I'm still learning PEP 8, so tell me if I violated it.
yn = input('Do you need me to explain the rules. Y/N: ').lower() if yn == 'y' or 'yes': print('I will think of a number between 1 - 100.') print('You will guess a number and I will tell you if it is higher or lower than my number.') print('This repeats until you guess my number.')
-
Dialogflow python client versioning
I am using python client for accessing dialogflow's functionality.
My question is: doesimport dialogflow
and
import dialogflow_v2 as dialogflow
have any difference?
As per my experience, all the methods are the same. In the samples given by Google,import dialogflow_v2 as dialogflow
has been used and I could not see any difference between the two.Please note that here I am talking about version v2 in python client, and not the dialogflow API version.
-
Which device does .in_waiting and .out_waiting refer to in Pyserial?
I have a computer that is connected to a serial device at
/dev/ttyUSB0
via a wire with USB2 and microUSB2 connectors.My script writes:
ser = serial.Serial('/dev/ttyUSB0') in_buffer = ser.in_waiting in_data = ser.read( in_buffer ) out_buffer = ser.out_waiting out_data = ser.read( out_buffer )
Output:
ser = {'is_open': True, 'portstr': '/dev/ttyUSB0', 'name': '/dev/ttyUSB0', '_port': '/dev/ttyUSB0', '_baudrate': 9600, '_bytesize': 8, '_parity': 'N', '_stopbits': 1, '_timeout': None, '_write_timeout': None, '_xonxoff': False, '_rtscts': False, '_dsrdtr': False, '_inter_byte_timeout': None, '_rs485_mode': None, '_rts_state': True, '_dtr_state': True, '_break_state': False, '_exclusive': None, 'fd': 6, 'pipe_abort_read_r': 7, 'pipe_abort_read_w': 8, 'pipe_abort_write_r': 9, 'pipe_abort_write_w': 10} in_buffer = 0 <class 'int'> in_data = b'' <class 'bytes'> out_buffer = 0 <class 'int'> out_data = b'' <class 'bytes'>
Does
in_buffer
andout_buffer
refer to the no. of bytes in the buffer in the UART chip of the computer and the device/dev/ttyUSB0
, respectively? Why do they have zero byte size? -
Doc2Vec: get text of the label
I've trained
Doc2Vec
model I'm trying to get predictions.I use
test_data = word_tokenize("Филип Моррис Продактс С.А.".lower()) model = Doc2Vec.load(model_path) v1 = model.infer_vector(test_data) sims = model.docvecs.most_similar([v1]) print(sims)
returns
[('624319', 0.7534812092781067), ('566511', 0.7333904504776001), ('517382', 0.7264763116836548), ('523368', 0.7254455089569092), ('494248', 0.7212602496147156), ('382920', 0.7092794179916382), ('530910', 0.7086726427078247), ('513421', 0.6893941760063171), ('196931', 0.6776881814002991), ('196947', 0.6705600023269653)]
Next I've tried to know, what's text of this number
model.docvecs['624319']
But it returns me only the vector representation
array([ 0.36298314, -0.8048847 , -1.4890883 , -0.3737898 , -0.00292279, -0.6606688 , -0.12611026, -0.14547637, 0.78830665, 0.6172428 , -0.04928801, 0.36754376, -0.54034036, 0.04631123, 0.24066721, 0.22503968, 0.02870891, 0.28329515, 0.05591608, 0.00457001], dtype=float32)
So, is any way to get text of this label from the model? Loading train dataset takes a lot of time, so I try to find out another way.
-
How to assess word embeddings model accuracy and performance?
I'm new to exploring the world of word embeddings and vectors. I've trained a Skip-Gram FastText word representation model, however I'm not sure how to assess the accuracy and performance of the model given that it's unsupervised. I usually look at the loss and watch out for when it begins to plateau...but other than that I have no idea of how to measure it's accuracy. Please help.
P.S. I've also used Gensim word2vec, but still the same issue.
-
How to handle words that are not in word2vec's vocab optimally
I have a list of ~10 million sentences, where each of them contains up to 70 words.
I'm running gensim word2vec on every word, and then taking the simple average of each sentence. The problem is that I use min_count=1000, so a lot of words are not in the vocab.
To solve that, I intersect the vocab array (that contains about 10000 words) with every sentence, and if there's at least one element left in that intersection, it returns its the simple average, otherwise, it returns a vector of zeros.
The issue is that calculating every average takes a very long time when I run it on the whole dataset, even when splitting into multiple threads, and I would like to get a better solution that could run faster.
I'm running this on an EC2 r4.4xlarge instance.
I already tried switching to doc2vec, which was way faster, but the results were not as good as word2vec's simple average.
word2vec_aug_32x = Word2Vec(sentences=sentences, min_count=1000, size=32, window=2, workers=16, sg=0) vocab_arr = np.array(list(word2vec_aug_32x.wv.vocab.keys())) def get_embedded_average(sentence): sentence = np.intersect1d(sentence, vocab_arr) if sentence.shape[0] > 0: return np.mean(word2vec_aug_32x[sentence], axis=0).tolist() else: return np.zeros(32).tolist() pool = multiprocessing.Pool(processes=16) w2v_averages = np.asarray(pool.map(get_embedded_average, np.asarray(sentences))) pool.close()
If you have any suggestions of different algorithms or techniques that have the same purpose of sentence embedding and could solve my problem, I would love to read about it.
-
saving an lda model for prediction
How can I save an lda model and call it for future prediction in another function
data=pd.DataFrame(ivector_list,index=label) data.to_csv("ivector/Ivector.csv") Y=data.index X=data.reset_index(drop=True) X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25,random_state=50) lda = LinearDiscriminantAnalysis(n_components = 50) model=lda.fit_transform(X_train,y_train) predictions = lda.predict(X_test) saved_model=pickle.dumps(model)
but when I try using saved_model to predict I get this error
AttributeError: 'numpy.ndarray' object has no attribute 'predict' pre=saved_model.predict(d1)
-
Issue with topic word distributions after malletmodel2ldamodel in gensim
After training an LDA model on gensim LDA model i converted the model to a with the gensim mallet via the
malletmodel2ldamodel
function provided with the wrapper. Before and after the conversion the topic word distributions are quite different. The mallet version returns very rare topic word distribution after conversion.ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=13, id2word=dictionary) model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet) model.save('ldamallet.gensim') dictionary = gensim.corpora.Dictionary.load('dictionary.gensim') corpus = pickle.load(open('corpus.pkl', 'rb')) lda_mallet = gensim.models.wrappers.LdaMallet.load('ldamallet.gensim') import pyLDAvis.gensim lda_display = pyLDAvis.gensim.prepare(lda_mallet, corpus, dictionary, sort_topics=False) pyLDAvis.display(lda_display)
Here is the output from gensim original implementation:
I can see there was a bug around this issue which has been fixed with the previous versions of gensim. I am using gensim=3.7.1
-
How to make wordcloud from guided lda output
I have created topic model with some initial seeds using guidedlda package - https://github.com/vi3k6i5/GuidedLDA. It looks good. However now I want to see the frequency distribution and word cloud for each topic. How do i that?
I am accessing top 10 words in each topic like this,
>>> n_top_words = 10 >>> topic_word = model.topic_word_ >>> for i, topic_dist in enumerate(topic_word): >>> topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1] >>> print('Topic {}: {}'.format(i, ' '.join(topic_words))) Topic 0: game play team win season player second point start victory Topic 1: company percent market price business sell executive pay plan sale Topic 2: play life man music place write turn woman old book Topic 3: official government state political leader states issue case member country Topic 4: school child city program problem student state study family group
However, how do i find the number of times each word appears in a topic and produce word cloud on that? Because I am not sure if this model captures the frequency of words.
Thanks in advance.
-
NLP on Chatbot data
I am trying to analyse a conversational chatbot data and I'd like to do a topic modelling. The chatbot data comes in the form of sequence of session ids, the question (what is being asked from the chatbot) and the response (chatbot response).
I am interested to know if there is specific way that I should process and clean the data? for example, should I combine all the questions and responses together and then do the topic modelling or should I separate the questions and responses and then do separate topic modelling on each?
If anyone has done similar work or if there's any specific guide or suggestions as what's the best approach to take dealing with such data?
Thanks in advance
-
GET topic names for each document
I am trying to topic modelling for the documents using the example in this link https://www.w3cschool.cn/doc_scikit_learn/scikit_learn-auto_examples-applications-topics_extraction_with_nmf_lda.html
My question How can I know which documents correspond to which topic ?
So far this is what i have done
n_features = 1000 n_topics = 8 n_top_words = 20 with open('dataset.txt', 'r') as data_file: input_lines = [line.strip() for line in data_file.readlines()] mydata = [line for line in input_lines] def print_top_words(model, feature_names, n_top_words): for topic_idx, topic in enumerate(model.components_): print("Topic #%d:" % topic_idx) print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])) print() tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, token_pattern='\\b\\w{2,}\\w+\\b', max_features=n_features, stop_words='english') tf = tf_vectorizer.fit_transform(mydata) lda = LatentDirichletAllocation(n_topics=3, max_iter=5, learning_method='online', learning_offset=50., random_state=0) lda.fit(tf) print("\nTopics in LDA model:") tf_feature_names = tf_vectorizer.get_feature_names() print_top_words(lda, tf_feature_names, n_top_words) #And to add find top topic related to each document doc_topic = lda.transform(tf) for n in range(doc_topic.shape[0]): topic_most_pr = doc_topic[n].argmax() print("doc: {} topic: {}\n".format(n,topic_most_pr))
The expected output would be
Doc| Assigned Topic | Words_in_assigned_topic 1 2 science,humanbody,bones
-
issues installing bigartm on windows machine
I want to build a topic model using the python package, BigARTM. Why can't I install the package on my windows machine? I am using Anaconda 3 and Python 3.7.
A colleague downloaded the master branch for BigARTM from the github instance (https://github.com/bigartm/bigartm) and recompiled it for me. I have manually set the environmental variables based on what is recommended by the developer so that it works with Anaconda 3 and Jupyter notebook (http://docs.bigartm.org/en/latest/installation/windows.html)
I still get the following error when i run the command "import artm".
import artm
I receive this traceback error:
import artm --------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-4-513ce4935e65> in <module> ----> 1 import artm ~\AppData\Local\Continuum\anaconda3\lib\site-packages\bigartm-0.10.0-py3.7.egg\artm\__init__.py in <module> 1 # Copyright 2017, Additive Regularization of Topic Models. 2 ----> 3 from .artm_model import ARTM, version, load_artm_model 4 from .lda_model import LDA 5 from .hierarchy_utils import hARTM ~\AppData\Local\Continuum\anaconda3\lib\site-packages\bigartm-0.10.0-py3.7.egg\artm\artm_model.py in <module> 19 import tqdm 20 ---> 21 from . import wrapper 22 from .wrapper import constants as const 23 from .wrapper import messages_pb2 as messages ~\AppData\Local\Continuum\anaconda3\lib\site-packages\bigartm-0.10.0-py3.7.egg\artm\wrapper\__init__.py in <module> 5 from . import exceptions 6 from . import constants ----> 7 from . import messages_pb2 as messages 8 9 from .api import LibArtm ImportError: cannot import name 'messages_pb2' from 'artm.wrapper' (C:\Users\dt297676\AppData\Local\Continuum\anaconda3\lib\site-packages\bigartm-0.10.0-py3.7.egg\artm\wrapper\__init__.py)
-
Get coefficients of a given polynomial in python
I need to find the roots of a given polynomial in python, but I'm stuck, because I need to calculate them from the polynomial, which is
-1.0*x^3 - 1.01260229801594*x^2 - 0.102692265748562*x - 0.00141373294267021
I know it is quite easy computing the coefficients directly and find the roots for x, the case is that this polynomial is calculated by my program, and I'd like that the program is capable of get these coefficients from the polynomial after calculate it. Here is some code.
F = x*(1-x/k1)-p*x*z/(1+a*x+c*h*y) G = y*(1-y/k2)-q*y*z/(1+a*x+c*h*y) H = e*(p*x+c*q*y)/(1+a*x+c*h*y)-d*z ##### equilibrium point def equations(j): x,y,z = j f1 = x*(1-x/k1)-p*x*z/(1+a*x+c*h*y) f2 = y*(1-y/k2)-q*y*z/(1+a*x+c*h*y) f3 = e*(p*x+c*q*y)/(1+a*x+c*h*y)-d*z return (f1, f2, f3) aa, bb, cc = fsolve(equations,(0.08,2.25,3.3)) aa = "%.4f" % aa bb = "%.4f" % bb cc = "%.4f" % cc ##### jacobian matrix M = sym.Matrix([F, G, H]) M.jacobian([x,y,z]) J = M.jacobian([x,y,z]).subs([(x,aa), (y,bb), (z,cc)]) J = np.array(J) I = np.identity(3) I_J = J-L*I det = I_J[0,0]*(I_J[1,1]*I_J[2,2]-I_J[1,2]*I_J[2,1])-\ I_J[0,1]*(I_J[1,0]*I_J[2,2]-I_J[1,2]*I_J[2,0])+\ I_J[0,2]*(I_J[1,0]*I_J[2,1]-I_J[1,1]*I_J[2,1]) det = simplify(det) print det
-
How to format line on scatter plot
I need the line to reach the end corners of the graph, and also need to change the color of the line to red. My code:
plt.figure(), plt.subplot(211) plt.scatter(df.NOX, df.PTRATIO) b, m = polyfit(x, y, 1) plt.plot(x, y, '.' ) plt.plot(x, b + m * x, '-' ) plt.show()
I also need to do the same for this scatterplot. Scatterplot2 The code I used:
X_prime = np.linspace(df.PTRATIO.min(), df.PTRATIO.max(), 100) X_prime = sm.add_constant(X_prime) y_hat = lr_model1.predict(X_prime) plt.figure() plt.subplot(211) plt.scatter(df.NOX, df.PTRATIO) plt.subplot(212) plt.scatter(X.NOX, y) plt.plot(X_prime[:,1], y_hat, 'red', alpha=0.9)
The output is inaccurate for Scatterplot 2: Scatterplot 2 output
Sample data screenshot: Sample Data screenshot
- How to call individual coefficients when they are given as an array?