Why Doc2vec gives 2 different vectors for the same texts
I am using
Doc2vec to get vectors from words.
Please see my below code:
from gensim.models.doc2vec import TaggedDocument f = open('test.txt','r') trainings = [TaggedDocument(words = data.strip().split(","),tags = [i]) for i,data in enumerate(f) model = Doc2Vec(vector_size=5, epochs=55, seed = 1, dm_concat=1) model.build_vocab(trainings) model.train(trainings, total_examples=model.corpus_count, epochs=model.epochs) model.save("doc2vec.model") model = Doc2Vec.load('doc2vec.model') for i in range(len(model.docvecs)): print(i,model.docvecs[i])
I have a
test.txt file that its content has 2 lines and contents of these 2 lines is the same (they are "a")
I trained with doc2vec and got the model, but the problem is although the contents of 2 lines is the same, doc2vec gave me 2 different vectors.
0 [ 0.02730868 0.00393569 -0.08150548 -0.04009786 -0.01400406] 1 [ 0.03916578 -0.06423566 -0.05350181 -0.00726833 -0.08292392]
I dont know why this happened. I thought that these vectors would be the same. Can you explain that? And if I want to make the same vectors for the sames words, what should I do in this case?
There is inherent randomness in Doc2Vec (and Word2Vec) algorithm, e.g. initial vectors are random already and are different even for identical sentences. You can comment the
model.traincall and see this for yourself.
The details if you're interested: the vectors are initialized right after the vocab is built: in your case it's
model.build_vocab(...)call, which in turn calls
model.reset_doc_weights()method (see the source code at
gensim/models/doc2vec.py), which initializes all vectors randomly, no matter what the sentences are. If at this point you reset the initialization and assign equal vectors, they shouldn't diverge anymore.
In theory, if you train the identical sentences really long enough, the algorithm should converge to the same vector even with different initialization. But practically, it's not going to happen, and I don't think you should be worried about that.
@Maxim's answer is correct about the inherent randomness used by the algorithm, but you have additional problems with this example:
Doc2Vecdoesn't give meaningful results on tiny, toy-sized examples. The vectors only acquire good relative meanings when they're the result of a large, diverse set of contrasting training-contexts. Your 2-line dataset, run through 55 training cycles, is really just providing the model with 1 unique text, 110 times.
Even though you've wisely reduced the vector-size to a tiny number (5) to reflect the tiny data, it's still a too-large model for just 2 examples, prone to complete 'overfitting'. The model could randomly assign line #1 the vector [1.0, 0.0, 0.0, 0.0, 0.0], and line #2 [0.0, 1.0, 0.0, 0.0, 0.0]. And then, through all its training, only update its internal weights (never the doc-vectors themselves), but still achieve internal word-predictions just as good or better than in the real scenario, where everything is incrementally updated, because there's enough free state in the model that there's never any essential competition/compression/tradeoffs forcing the two sentences to converge where similar. (There's many solutions, and most don't involve any useful generalized 'learning'. Only large datasets, forcing the model into a tug-of-war between modeling multiple examples as well as possible with tradeoffs, creates the learning.)
dm_concat=1is a non-default experimental mode that requires even more data to train, and results in larger/slower models. Avoid using it unless you're sure – and can prove with results – that it helps for your use.
Even when these are fixed, complete determinism isn't automatic in
Doc2Vec– and you shouldn't really try to eliminate that. (The small jitter between runs is a useful signal/reminder of the essential variances in this algorithm – and if your training/evaluation remains stable across such small variances, that's an extra indicator that it's functional.)