Why Doc2vec gives 2 different vectors for the same texts
I am using Doc2vec
to get vectors from words.
Please see my below code:
from gensim.models.doc2vec import TaggedDocument
f = open('test.txt','r')
trainings = [TaggedDocument(words = data.strip().split(","),tags = [i]) for i,data in enumerate(f)
model = Doc2Vec(vector_size=5, epochs=55, seed = 1, dm_concat=1)
model.build_vocab(trainings)
model.train(trainings, total_examples=model.corpus_count, epochs=model.epochs)
model.save("doc2vec.model")
model = Doc2Vec.load('doc2vec.model')
for i in range(len(model.docvecs)):
print(i,model.docvecs[i])
I have a test.txt
file that its content has 2 lines and contents of these 2 lines is the same (they are "a")
I trained with doc2vec and got the model, but the problem is although the contents of 2 lines is the same, doc2vec gave me 2 different vectors.
0 [ 0.02730868 0.00393569 0.08150548 0.04009786 0.01400406]
1 [ 0.03916578 0.06423566 0.05350181 0.00726833 0.08292392]
I dont know why this happened. I thought that these vectors would be the same. Can you explain that? And if I want to make the same vectors for the sames words, what should I do in this case?
2 answers

There is inherent randomness in Doc2Vec (and Word2Vec) algorithm, e.g. initial vectors are random already and are different even for identical sentences. You can comment the
model.train
call and see this for yourself.The details if you're interested: the vectors are initialized right after the vocab is built: in your case it's
model.build_vocab(...)
call, which in turn callsmodel.reset_doc_weights()
method (see the source code atgensim/models/doc2vec.py
), which initializes all vectors randomly, no matter what the sentences are. If at this point you reset the initialization and assign equal vectors, they shouldn't diverge anymore.In theory, if you train the identical sentences really long enough, the algorithm should converge to the same vector even with different initialization. But practically, it's not going to happen, and I don't think you should be worried about that.

@Maxim's answer is correct about the inherent randomness used by the algorithm, but you have additional problems with this example:
Doc2Vec
doesn't give meaningful results on tiny, toysized examples. The vectors only acquire good relative meanings when they're the result of a large, diverse set of contrasting trainingcontexts. Your 2line dataset, run through 55 training cycles, is really just providing the model with 1 unique text, 110 times.Even though you've wisely reduced the vectorsize to a tiny number (5) to reflect the tiny data, it's still a toolarge model for just 2 examples, prone to complete 'overfitting'. The model could randomly assign line #1 the vector [1.0, 0.0, 0.0, 0.0, 0.0], and line #2 [0.0, 1.0, 0.0, 0.0, 0.0]. And then, through all its training, only update its internal weights (never the docvectors themselves), but still achieve internal wordpredictions just as good or better than in the real scenario, where everything is incrementally updated, because there's enough free state in the model that there's never any essential competition/compression/tradeoffs forcing the two sentences to converge where similar. (There's many solutions, and most don't involve any useful generalized 'learning'. Only large datasets, forcing the model into a tugofwar between modeling multiple examples as well as possible with tradeoffs, creates the learning.)
dm_concat=1
is a nondefault experimental mode that requires even more data to train, and results in larger/slower models. Avoid using it unless you're sure – and can prove with results – that it helps for your use.
Even when these are fixed, complete determinism isn't automatic in
Doc2Vec
– and you shouldn't really try to eliminate that. (The small jitter between runs is a useful signal/reminder of the essential variances in this algorithm – and if your training/evaluation remains stable across such small variances, that's an extra indicator that it's functional.)