How to find overall model coefficients in Gensim's latent dirichlet allocation?
I am using Latent Dirichlet Allocation (LDA) to find topics that occur in my corpus, and I was wondering if there is a way to get the overall model coefficients in Gensim? Here each coefficient would represent the weight of each topic in the overall model. I want to get an idea of which topics are the most important relevant through these weights, in addition to looking at topic vocabularies. Something like this is possible in R, since the model has a
coefs attribute ( a coefficient for each of the six topics):
lda_model$coefs  -0.56446635 -0.52604353 -0.43325116 -1.88352347 0.20560172 -0.26902856
One idea I have currently is to use Gensim's
get_document_topics(bow) function where
bow represents a single document converted into a bow vector. Here is the documentation: https://radimrehurek.com/gensim/models/ldamodel.html. Although this is not the same as the model coefficients I was talking about above, it may capture a somewhat similar idea. I can use this function for all documents to get the document's topic distribution, and average all the topic distributions across documents for each topic to get a sense of which topics are the most prevalent. However, this would not be my first go-to option. Please let me know if there is something along the lines of
coefs in Gensim's LDA to obtain model coefficients. Thanks!
See also questions close to this topic
Python Kafka Streaming API - Binning
I am using python kafka stream binning example given in this, Python Kafka Streaming API
I am able to generate the data using generator.py file given under winton-kafka-streams/examples/binning/, whereas when i run the binning.py file from the same folder, i got the below issue. Could someone help me, to resolve this?
Change color of missing values in Seaborn heatmap
Consider the example of missing values in the Seaborn documentation:
corr = np.corrcoef(np.random.randn(10, 200)) mask = np.zeros_like(corr) mask[np.triu_indices_from(mask)] = True sns.heatmap(corr, mask=mask, vmax=.3, square=True)
How do I change the color of the missing values to, for example, black? The color of the missing values should be specified independent of the color scheme of the heatmap, it may not be present in the color scheme.
I tried adding
facecolor = 'black'but that didn't work. The color can be affected by e.g.
sns.axes_style("white")but it isn't clear to me how that can be used to set an arbitrary color.
Xpath + Scrapy + Python : data point couldn't be scraped
This is the XML structure:
<tr> <td> <font size="3"> <strong>Location:</strong> Hiranandani Gardens, Powai </font> </td> </tr>
I want to extract : Hiranandani Gardens, Powai
I tried with these:
Both returned an empty list.
Note: we must have to use the text of tag, i.e., "Location:". Otherwise, there are many other places on the site where the same XML structure is used. So, it'll fetch many more unnecessary things apart from the desired value if the text of strong tag is not used.
load a file with only its extension name
I would like to load a file for only it's extension name in gensim.
A normal code would be this:
model = gensim.models.word2vec.Word2Vec.load("news.bin")
But I would like it to auto open any file with ".bin".
model = gensim.models.word2vec.Word2Vec.load(***I would like to change this part to only load any .bin***)
It can be "news.bin", "file.bin" or "guess.bin". As long as it load only the extension. Thank you.
Gensim: How to extract words co-occurrence?
I am trying to use a text corpus file (One sentence by line) to extarct words co-occurrence from it in order to use them in a later traitement. So how can i extract word(statistical) co-occurrence from large corpus file using gensim and how to use them later ?
Gensim Doc2Vec Usage
I have what I suspect is a bit of a naive question about using Gensim's Doc2Vec usage.
In all the tutorials I have worked through such as this one, we always end up with a scenario where we are comparing the test document with the existing corpus to find its similarity with a document in the existing corpus. Like so:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40) model.build_vocab(train_corpus) inferred_vector = model.infer_vector(test_corpus[doc_id]) sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
But surely the most common problem (and certainly mine), is where you have two test documents, and you want to determine how similar they are to each other using a pretrained model. At first I thought I could just work around this by updating the model with one of the unseen documents (they are both user generated, so I can't preload them), but this appears to be an open issue.
So my question is, how can I do:
model.similarity(unseen_doc1, unseen_doc2) # --> some score
How can I write the resulted new representation of data, using LDA, from WEKA to an arff file?
Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction algorithm, so from d dimensional data (input data) we want to obtain p new dimensions, where d>>p. It is the same principle as PrincipalComponents (PCA) except that the latter is unsupervised. Actually, I wanted to store the result of LDA (means the new p dimensional data after performing LDA) in an arff file but I didn't know how can I do it, in contrast with PCA I can do it, I can store the new representation of data (I used it as a filter). Can anyone tell me how can I do it please?
Thank you in advance.
How to get all the keywords based on topic using topic modeling?
I'm trying to segregate the topics using lda's topic modeling.
Here, I'm able to fetch the top 10 keywords for each topic. Instead of getting only top 10 keywords, I'm trying to fetch all the keywords from each topic.
Can anyone please suggest me regarding the same...
from gensim.models import ldamodel import gensim.corpora; from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer; from sklearn.decomposition import LatentDirichletAllocation import warnings warnings.filterwarnings("ignore",category=DeprecationWarning) def load_data(filename): reviews = list() labels = list() with open(filename, encoding='utf-8') as file: file.readline() for line in file: line = line.strip().split(' ',1) labels.append(line) reviews.append(line) return reviews data = load_data('/Users/abc/dataset.txt') #print("Data:" , data) def display_topics(model, feature_names, no_top_words): for topic_idx, topic in enumerate(model.components_): print ("Topic %d:" % (topic_idx)) print (" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]])) no_features = 1000 tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english') tf = tf_vectorizer.fit_transform(data) tf_feature_names = tf_vectorizer.get_feature_names() no_topics = 5 lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf) no_top_words = 10 display_topics(lda, tf_feature_names, no_top_words)
Can we give large dataset to LDA?
Actually, I have large dataset which is of 1.6 GB. It contains generic data. So, I thought of segregating the data based on topics.
So, I have used LDA for the same. But it is taking so much time to give the output.
For the smaller data, the performance is good. But, for larger data, the performance is getting affected.
Can anyone suggest me regarding the same ...
I love food. I like cricket. I like reading books. Her samsung mobile is not great.
I do not like tennis.
My code :
from gensim.models import ldamodel import gensim.corpora; from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer; from sklearn.decomposition import LatentDirichletAllocation import warnings warnings.filterwarnings("ignore",category=DeprecationWarning) def load_data(filename): reviews = list() labels = list() with open(filename, encoding='utf-8') as file: file.readline() for line in file: line = line.strip().split() reviews.append(line) return reviews data = load_data('/Users/abc/dataset.txt') #print("Data:" , data) def display_topics(model, feature_names, no_top_words): for topic_idx, topic in enumerate(model.components_): print ("Topic %d:" % (topic_idx)) print (" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]])) no_features = 100 tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english') tf = tf_vectorizer.fit_transform(data) tf_feature_names = tf_vectorizer.get_feature_names() no_topics = 2 lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf) no_top_words = 10 display_topics(lda, tf_feature_names, no_top_words)
cluster topic modelling distribution by date
I have the following topic distribution over time:
what i'm trying to achiveve is to create time clusters according to the topic distribution. e.g., intituively one could say there's a cluster within 2012-2013 when topic_005 primes, and another between 2017-2018 when the distribution has settled.
My data is already presented as distribution (%) over time, like this (3rd filter is to simplify output for this question, and so is using years as time unit):
library(tidyverse) # for the pipe operator and other functions used in this code topic_scores_tidy %>% filter( !is.na(fecha) ) %>% filter( score > 0.3 ) %>% filter( topic %in% c("topic_001","topic_004","topic_009") ) %>% group_by( date = floor_date(fecha, unit = "year"),topic) %>% summarize(n=n()) %>% mutate( percFreq = n / sum(n)) %>% spread( topic, percFreq ) # A tibble: 10 x 5 # Groups: date  date n topic_001 topic_004 topic_009 <date> <int> <dbl> <dbl> <dbl> 1 2011-01-01 1 1 NA NA 2 2014-01-01 2 1 NA NA 3 2015-01-01 2 0.5 0.5 NA 4 2016-01-01 4 NA 0.308 0.308 5 2016-01-01 5 0.385 NA NA 6 2017-01-01 11 NA NA 0.297 7 2017-01-01 13 0.351 0.351 NA 8 2018-01-01 6 0.182 NA NA 9 2018-01-01 8 NA 0.242 NA 10 2018-01-01 19 NA NA 0.576
Not sure what the output would look like, but i guess it should be something like date ranges.
Python fitting a curve with coeficents errors
I need to fit a curve over a set of data and also need the uncertainty - or errors - of the coefficients, for exemple:
fitting ax^2+bx+c, i need the values: a+-da, b+-db and c+-dc. Where da,db and dc are the uncertaints.
I already tried polyfit and optmize.curve_fit, but none of them give me de uncertainty as I want. Some one knows how to do that?
Correlation coefficients for spatial polygons data frame
I have a spatial polygons data frame and I am interested in a matrix of correlation coefficients for my variables. The command
returns the following error:
>Error in cor(MergedData) : supply both 'x' and 'y' or a matrix-like 'x'
I can get pairwise coefficients if I run the following command
However, since I have 15 variables, I would need to run over 200 commands. Is there a way I could do it faster, i.e. return a matrix of correlation coefficients all in one table?
Thanks in advance!
How to interpret coefficients and intercepts of logistic regression
I am running Iris data set... I have four features and one target variable. I am getting only three intercepts instead of four and please coef_ in this case.
from sklearn.linear_model import LogisticRegression lr=LogisticRegression() lr.fit(train_x,train_y)
lr.coef_ output- array([[ 0.37158254, 1.35098324, -2.09936396, -0.93263471], [ 0.46758048, -1.57259888, 0.39692171, -1.0678223 ], [-1.52865509, -1.43245908, 2.30484329, 2.08586834]]) lr.intercept_ output- array([ 0.23818179, 1.0298293 , -1.04654308])