How to find overall model coefficients in Gensim's latent dirichlet allocation?
I am using Latent Dirichlet Allocation (LDA) to find topics that occur in my corpus, and I was wondering if there is a way to get the overall model coefficients in Gensim? Here each coefficient would represent the weight of each topic in the overall model. I want to get an idea of which topics are the most important relevant through these weights, in addition to looking at topic vocabularies. Something like this is possible in R, since the model has a coefs
attribute ( a coefficient for each of the six topics):
lda_model$coefs
[1] 0.56446635 0.52604353 0.43325116 1.88352347 0.20560172 0.26902856
One idea I have currently is to use Gensim's get_document_topics(bow)
function where bow
represents a single document converted into a bow vector. Here is the documentation: https://radimrehurek.com/gensim/models/ldamodel.html. Although this is not the same as the model coefficients I was talking about above, it may capture a somewhat similar idea. I can use this function for all documents to get the document's topic distribution, and average all the topic distributions across documents for each topic to get a sense of which topics are the most prevalent. However, this would not be my first goto option. Please let me know if there is something along the lines of coefs
in Gensim's LDA to obtain model coefficients. Thanks!
See also questions close to this topic

Numpy view contiguous part of noncontiguous array as dtype of bigger size
I was trying to generate an array of trigrams (i.e. continuousthreeletter combinations) from a super long char array:
# data is actually load from a source file a = np.random.randint(0, 256, 2**28, 'B').view('c')
Since making copy is not efficient (and it creates problems like cache miss), I directly generated the trigram using stride tricks:
tri = np.lib.stride_tricks.as_strided(a, (len(a)2,3), a.strides*2)
This generates a trigram list with shape
(2**282, 3)
where each row is a trigram. Now I want to convert the trigram to a list of string (i.e.S3
) so that numpy displays it more "reasonably" (instead of individual chars).tri = tri.view('S3')
It gives the exception:
ValueError: To change to a dtype of a different size, the array must be Ccontiguous
I understand generally data should be contiguous in order to create a meaningful view, but this data is contiguous at "where it should be": each trees element are contiguous.
So I'm wondering how to
view
contiguous part in noncontiguousnp.ndarray
as dtype of bigger size? A more "standard" way would be better, while hackish ways are also welcome. It seems that I can setshape
andstride
freely withnp.lib.stride_tricks.as_strided
, but I can't force thedtype
to be something, which is the problem here. 
I am passing a list from python to JavaScript in Flask I want the output as Object in JavaScript ?
My python function :
def somefunction(self): x1 = ['reduced','fully automatic','years'] return x1
In Flask app.py file
keyword = somefunction()
JavaScript in html:
<script> var javaword = '{{ keyword }}'; somefunction { alert(typeof(javaword)); alert(javaword); }  Output  String ['reduced', 'fully automatic', 'years']
I want the output as
Object ['reduced','fully automatic','years']

Fast RBF kernel calculations in python?
I would like to determine the similarity (Gram) matrix of dimensions 20.000 x 20.000 with a RBF dot product function with 4.800 feature vectors dimension. It takes too much (more than 1 day) on my PC with python. Is there any way to speed up this calculation?

Is there a way to get the relationship from 'GloVe' word2vec?
I am using Glove, Gensimword2vec, module and I can use it to return the similarity score between entities such as
'man'
and'woman'
will return0.89034
. But is there a way to return the semantic relationship between two entities? For example given the word'people'
and a'location'
, the result should be the relationship'lives_in'
?I can do something like:
print(model.most_similar(positive=['king', 'woman'], negative=['man']))
Output is:
[('queen', 0.775162398815155), ('prince', 0.6123066544532776), ('princess', 0.6016970872879028), ('kings', 0.5996100902557373), ('queens', 0.565579891204834), ('royal', 0.5646308660507202), ('throne', 0.5580971240997314), ('Queen', 0.5569202899932861), ('monarch', 0.5499411821365356), ('empress', 0.5295248627662659)]
Desired output:
[(is_a, 0.3223), (same_as, 0349230), (people, 0302432) ...]

Getting error :AttributeError: 'list' object has no attribute 'lower' while vectorizing with sklearn
I am trying to perform topic modeling using sklearn and gensim. firsly vecorizing it using countvectorize and then changing it into id2word and performing gensim LDA as something like this:
vect = CountVectorizer(min_df=5, max_features=25000) corpus_vect = vect.fit_transform(docs) vocab = vectorizer.get_feature_names() id2word_dic=dict([(i, s) for i, s in enumerate(vocab)]) corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False) model = models.LdaMulticore(corpus=corpus_vect_gensim, id2word=id2word_dic, num_topics=n_topics,chunksize=10000,passes=lda_no_passes,iterations=lda_n_iter) model.fit(pmi_cal.d2w_)
but i am getting the following error:
return lambda x: strip_accents(x.lower()) AttributeError: 'list' object has no attribute 'lower'

glove most similar to multiple words
I am supposed to do some exercises with python glove, most of it doesn't give me any problems but now i am supposed to find the 5 most similar words to "norway  war + peace" from the "glovewikigigaword100" package. But when i run my code it just says that the 'word' is not in the vocabulary. Now I'm guessing that this is some kind of formatting, but i don't know how to use it.
import gensim.downloader as api model = api.load("glovewikigigaword100") # download the model and return as object ready for use bests = model.most_similar("norway  war + peace", topn= 5) print("5 most similar words to 'norway  war + peace':") for best in bests: print(best)

How to extract word terms from LDA model in topic modeling?
Below are the codes I have tried to extract and generate the word terms without its probability:
Lda = gensim.models.ldamodel.LdaModel ldamodel = Lda(num_topics=topic_num, id2word=dictionary, passes=20) for i in range(0, ldamodel.num_topics): with open('output_file.txt', 'w') as outfile: outfile.write('{}\n'.format('Topic #' + str(i + 1) + ': ')) for word, prob in ldamodel.show_topic(i, topn=10): outfile.write('{}\n'.format(word.encode('utf8'))) outfile.write('\n')
The output of the outfile.write is:
Topic #5: b'barrier' b'parenteral' b'short' b'salt' b'influenza' b'resume' b'vital' b'taken' b'turning' b'decrease'
The desired output I would like to achieve is:
Topic #5: 'barrier' 'parenteral' 'short' 'salt' 'influenza' 'resume' 'vital' 'taken' 'turning' 'decrease'
The issue here is I don't know why there is a "b" in front when I saved the word terms in the text file... Please help take a look at the codes!

how to get to topic document matrix from cvb0_local output in mahout?
I know that we can get the topicterm matrix using
vectordump
as follows:mahout vectordump i /ldatopics/partm00000 dictionary /tf/dictionary.file0 vectorSize 5 dt sequencefile
But is there a way to get topicdocument matrix in a text format? I know that the output of
cvb0_local
has an option fordo
which outputs thep(td)
but I do not seem to find a way to read the output in a text format. Any insights appreciated. 
TSNE color each cluster
At first, I use the LDA model to separate my data into ten topics:
from sklearn.decomposition import LatentDirichletAllocation lda = LatentDirichletAllocation(n_components=10,max_iter=10, n_jobs=1,random_state=95865) lda_trans = lda.fit_transform(X_tfidf)
In order to visualize the topic, I use TSNE to make my data into 2D,
topic_proportions = lda.transform(X_tfidf)[:1000] from sklearn.manifold import TSNE tsne_model = TSNE(n_components =2, learning_rate=800,angle=.99, init='pca') topic_tsne_lda = tsne_model.fit_transform(topic_proportions) plt.scatter(topic_tsne_lda[:,0],topic_tsne_lda[:,1])
And the scattered picture looks like this:
My Question is: how to color my cluster into different colors? Thanks!!

How do I extract only the topic terms in topic modeling?
Below are the codes I have tried for topic modeling: Please focus more on the codes that I have commented "Please focus the codes here" thanks!!
df = pd.read_csv(output_cat6.txt, sep='') df_content = df[['content']] df_content['cat_id'] = df['cat_id'].drop_duplicates() documents = df_content def preprocess(text): result = [] for token in gensim.utils.simple_preprocess(text): if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3: result.append(token) return result doc_sample = documents[documents.index == 23].values[0][0] # print('original document: ') words = [] for word in doc_sample.split(' '): words.append(word) processed_docs = documents['content'].map(preprocess) dictionary = gensim.corpora.Dictionary(processed_docs) # for i in dictionary.iteritems(): # print(i) count = 0 dictionary.filter_extremes(no_below=2, no_above=0.5, keep_n=100000) bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs] bow_corpus[23] bow_doc = bow_corpus[23] tfidf = models.TfidfModel(bow_corpus) corpus_tfidf = tfidf[bow_corpus] # please focus on the codes here topic_num = 5 word_num = 5 Lda = gensim.models.ldamodel.LdaModel ldamodel = Lda(num_topics=topic_num, id2word=dictionary, passes=20) # ldamodel_tfidf = Lda(corpus_tfidf, num_topics=topic_num, id2word=dictionary, passes=20) ldamodel_tfidf = Lda(num_topics=topic_num, id2word=dictionary) for index, topic in ldamodel_tfidf.print_topics(1): print(topic)
The output of print(topic) is:
0.003*"rate" + 0.003*"impairment" + 0.003*"social" + 0.003*"depth" + 0.003*"sputum" + 0.003*"obvious" + 0.003*"cholesterol" + 0.003*"awake" + 0.003*"breathing" + 0.003*"surgical" 0.003*"situ" + 0.003*"orientated" + 0.003*"haematemesis" + 0.003*"weight" + 0.003*"sugar" + 0.003*"slight" + 0.003*"water" + 0.003*"malnutrition" + 0.003*"according" + 0.003*"open" 0.003*"additional" + 0.003*"tracheostomy" + 0.003*"effects" + 0.003*"asleep" + 0.003*"bladder" + 0.003*"urgency" + 0.003*"absence" + 0.003*"dependent" + 0.003*"median" + 0.003*"need" 0.003*"alzheimer" + 0.003*"walk" + 0.003*"clear" + 0.003*"tong" + 0.003*"abnormal" + 0.003*"reathing" + 0.003*"melena" + 0.003*"ufeme" + 0.003*"rate" + 0.003*"notes" 0.003*"haematemesis" + 0.003*"vitals" + 0.003*"need" + 0.003*"pulse" + 0.003*"culture" + 0.003*"slowly" + 0.003*"drowsiness" + 0.003*"wound" + 0.003*"swallowing" + 0.003*"situ"
The desired output I would like to achieve is:
rate impairment social depth sputum obvious cholesterol awake breathing surgical situ orientated haematemesis weight sugar slight water malnutrition according open...
The objective that I would like to achieve is only to extract the topic terms in topic modeling without the probability of each words.

View html file in github repo?
I have a topic modeling visualization created using a python package saved to an html file in my github repository. I tried to open this using 
1. http://htmlpreview.github.io/ website
http://htmlpreview.github.io/?https://github.com/parvathysarat/wordpressblogtextmining/blob/master/topicmodeling_vis.html doesn't work, no display
2. https://rawgit.com  403 Forbidden. Not opening new repositories.Is there an alternative way? It's a public repo.
URL  https://github.com/parvathysarat/wordpressblogtextmining/blob/master/index.html

Pairwise similarity coefficient
I would like to find a similarity coefficient among several variables. After long reading I could not find the answer I wanted. Coefficients like Jaccard's or Cramer's V are not suitable to my data which is made of different variables of 100 observations each that can assume 1 single value (like the value 0.5 repeated x100 times), 2 single values (like 40 observations having value of 0.1 each and 60 of 0.2), or multiple values. What I am seeking is a pairwise correlation matrix giving a similarity coefficient of the face value for each pairwise (i.e. case1: 2 variables made of the value 0.1 repeated 100 times in X1 variable versus 0.2 value repeated 100 times in X2 should return the weakest possible similarity coefficient; case2: 2 variables made of the value 0.1 repeated 100 times in both X1 and X2 should return the strongest possible similarity coefficient, etc.). The problem is that my variables can assume 1 single value each, or 2 values or multiple values. Is there any similarity coefficient, and possible any r package that could do the job done? Thanks for reading and possibly for answering. Al

Plot coefficients from multiple models using plot_model
I would like to plot regression coefficients with a dotwhisker plot using the plot_model function. Is it possible to plot multiple models (overlap, not sidebyside) using this function? I couldn't get either of the following to work:
plot_model(c(m1, m2)) plot_model(list(m1, m2))
where m1 and m2 are mixed effects model summaries.
I know that I can use dwplot() to do this, but I find the function very finicky and much prefer to use plot_model. Perhaps someone has a quick fix.

How should I interpret the path coefficients obtained from running plspm() in R (PLS Path Modeling)?
I am conducting my first PLS Path Modeling analysis. I am using the package plspm() in R and follow the guidelines provided in Sanchez, G. (2013) PLS Path Modeling with R Trowchez Editions. Berkeley, 2013.
The data I am working with is Customer Satisfaction data on a 5point Likert scale.
Now, I wish to know if I can interpret a path coefficient of, say, 0.25 in the following manner: If the latent variable X increases by 1 point, the Customer Satisfaction is expected to increase by 0.25 points on average. In other words, can I make the usual linear regression interpretation?
My concern is that the path coefficients are actually standardized (automatically) and that I need to conduct some rescaling prior to interpretation.
Thank you in advance, Ronja