How to get word embeddings back from Keras?
Say you create your own custom word embeddings in the process of some arbitrary task, say text classification. How do you get a dictionary like structure of {word: vector}
back from Keras?
embeddings_layer.get_weights()
gives you the raw embeddings...but it's unclear which word corresponds to what vector element.
1 answer
-
answered 2021-02-23 05:47
Andrey
This dictionary is not a part of keras model. It should be kept separately as a normal python dictionary. It should be in your code - you use it to convert text to integer indices (to feed them to Embedding layer).
See also questions close to this topic
-
How can I flatten a json in pandas with wildcard?
Given this json:
"World_Regions": { "Americas": { "0": { "Name": "North America", "Category_Average": "54.53", "Stocks_%": "55.44", "Benchmark": "59.02" }, "1": { "Name": "Latin America", "Category_Average": "0.87", "Stocks_%": "1.14", "Benchmark": "0.93" } }, "Greater Asia": { "0": { "Name": "Japan", "Category_Average": "6.58", "Stocks_%": "3.74", "Benchmark": "7.76" }, "1": { "Name": "Australasia", "Category_Average": "1.79", "Stocks_%": "7.45", "Benchmark": "2.17" }, "2": { "Name": "Asia Developed", "Category_Average": "5.56", "Stocks_%": "7.27", "Benchmark": "4.57" }, "3": { "Name": "Asia Emerging", "Category_Average": "6.63", "Stocks_%": "2.96", "Benchmark": "6.58" } },
I want to get this result:
Name Category_Average Stocks_% Benchmark 0 North America 54.53 55.44 59.02 1 Latin America 0.87 1.14 0.93 2 Japan 6.58 3.74 7.76 3 Australasia 1.79 7.45 2.17 4 Asia Developed 5.56 7.27 4.57 6 Asia Emerging 6.63 2.96 6.58
but unfortunately the different names for region name(Americas/Greater Asia) is causing a problem. I am trying to this cleanly in one command right now I can get the result by doing this:
pd.DataFrame.from_dict(jsonFile['World_Regions']['Greater Asia']).transpose() Name Category_Average Stocks_% Benchmark 0 Japan 6.58 3.74 7.76 1 Australasia 1.79 7.45 2.17 2 Asia Developed 5.56 7.27 4.57 3 Asia Emerging 6.63 2.96 6.58
then the same for Americas then merge the dataframes. Is there a way to do it that's more direct(i.e. one command?)
-
Python Splinter: Opening Chrome Dev Tools fixes timeout
One of the weekly scrapes that I run has been timing out. Instead of search results, their error page comes up with a message that translates to "The webserver is unavailable"
The program will reload the page and try again. Sometimes it will get past it on its own, but that can take a while and is not guaranteed. The browser.reload() will timeout and try again when the exception is caught. But if I just open dev tools at any point while it's stuck, the page will immediately start functioning normally, the reload will stop hanging, the page will load, and the search results will be there. The next few (anywhere from 3 to 20+) searches will be fine and then it will happen again, with dev tools getting things back on track.
Can anyone offer any insight as to why this may be happening?
They don't use captcha or anything but could this be under the umbrella of limiting traffic?
I looked into trying to have splinter/selenium open dev tools but was unable to make it happen.
I'm running searches for a list of locations and a list of dates for each location. My next thought is to build a function to kill the browser instance, create a new one, and handle a few things that are necessary upon initially visiting the site, so it can pick up where it left off within the loops. Before that, is there anything else I should look into? Google gave me nothing on this specific scenario.
-
Does np.save remove duplicates in saving content?
wiki_knowledge = [] for k in knowledge: wiki_knowledge.append(k) num += 1 if num%500 == 0: # save periodically np.save(save_path + "wiki_knowledge.npy", wiki_knowledge) np.save(save_path + "wiki_knowledge.npy", wiki_knowledge)
I am new to np.save() function. I am seeing other's code above. The array wiki_knowledge is accumulated and in the loop, how does np.save avoid to saving multiple times for the same element in the wiki_knowledge array?
-
What else to try in order to increase test accuracy of a deep learning model?
Please, I need help.
It has been a year that I'm struggling with this problem : I want to train a deep neural network with a kinematic dataset named JIGSAWS which is a publicly available surgical dataset. Data samples represent the recorded kinematic motion of surgeons that have performed some surgical tasks and they are divided into three classes : Good performance [1,0,0], Average performance [0,1,0] and Bad performance [0,0,1]. The goal is to classify surgeons performance and to achieve maximum test accuracy, of course. I have read several scientific papers presenting more than 90% test accuracy with the same database. But I have never been able to reach this accuracy with my deep learning model. I'm using Keras and here is what I have tried so far :
- Neural network types tested : Feed Forward Deep Neural Network, Muti Layer Perceptron, CNN (1D and 2D), RNN, LSTM, Bidirectional LSTM.
- Adding/Removing layers. Adding/Removing neurons. Adding/Removing LSTM units. Variying number of filters and filter sizes for CNN.
- Mixing LSTM with DNNs. Mixing CNNs with DNNs.
- Activation functions tested : sigmoid, ReLu, linear.
- Loss function : categorical crossentropy.
- Data augmentation (used method of this paper https://arxiv.org/abs/1806.05796).
- Regularization : Dropout (25%, 50%, 75%), L2 regularization (several values), adding Noise in Input and Hidden layers.
- Several learning rate values tried : 1e-1, 1e-2, 1e-3 and 1e-4 with learning rate decay.
- Balanced learning data, validation data and test data.
- Different batch sizes : 8, 16, 32, 64, 128...
- Several numbers of epochs : 100, 200,...,5000.
- Early stopping when the validation loss decrease to its minimum and then starts to increase.
- Data generator for a batchwise training and in order to provide different batches of data at each epoch.
I will edit my post if I notice that I have forgot to mention any other item. The best result I have got so far is 0.04 for validation loss and 73% in test accuracy. I'm pretty happy with the validation loss that proves that the overfitting problem is solved. But not happy with the test accuracy since other papers present over 90% test accuracy. I have sent email to authors and they did not answer me.
So, what else can I try to do to increase test accuracy ?
Thank you for your time.
-
Confused about predict() output from Huggingface Transformers Sequence Classification
Most of the code below is taken from this huggingface doc page, for tensorflow code selections. What confuses me is that after fine-tuning a pretrained model on a few new sentences and running
predict
on two test-set sentences, I getpredict()
output that is 16x2 array.x2 makes sense as I have two classes (0,1), but why length 16 when I passed a test-set of 2 (not 16) sequences, to a 'SequenceClassification' model? How do I get the predicted classes for the two test-set sequences? (ps I have no problem converting from logits to predicted probabilities, just confused about the shape of the output).
Reproducible code example below. Also feel free to step through code in google colab environment here
from transformers import DistilBertTokenizerFast from transformers import TFDistilBertForSequenceClassification import tensorflow as tf # set up arbitrary example data train_txt = ['this sentence is about dinosaurs', 'this also mentions dinosaurs', 'this does not'] test_txt = ['the land before time was cool', 'alligators are basically dinosaurs'] train_labels = [1,1,0] test_labels = [1,1] # convert sentence lists to Distilbert Encodings and then TF Datasets tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased') train_encodings = tokenizer([str(s) for s in train_txt], truncation=True, padding=True) test_encodings = tokenizer([str(s) for s in test_txt], truncation=True, padding=True) train_dataset = tf.data.Dataset.from_tensor_slices(( dict(train_encodings), train_labels )) test_dataset = tf.data.Dataset.from_tensor_slices(( dict(test_encodings), test_labels )) # Fine-tune pretrained Distilbert Classifier on our data model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased') optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5) model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn model.fit(train_dataset.shuffle(1000).batch(3), epochs=3, batch_size=3) # Generate test-set predictions test_preds = model.predict(test_dataset)
test_preds
output:>test_preds TFSequenceClassifierOutput([('logits', array([[ 0.1527334 , 0.17010647], [ 0.10007463, 0.15664947], [-0.10294056, 0.18813357], [-0.05231615, 0.1587314 ], [-0.11520502, 0.16303074], [ 0.00855697, 0.13974288], [-0.17962483, 0.12381783], [ 0.05765227, 0.04970012], [ 0.1527334 , 0.17010647], [-0.12754977, 0.11164709], [-0.00847345, 0.12885672], [-0.01731028, 0.13520113], [-0.08433925, 0.16828224], [-0.20086896, 0.08963215], [ 0.05765227, 0.04970012], [ 0.02467203, 0.15794128]], dtype=float32))])
-
Save and load weights and optimiser state for retraining
I wanna save the model and load it with the optimizer state for retraining. I was able to save model weights as a
.h5
file and but have no luck with optimizer state. Please help me -
python for-in -> if not in -> if -> del process stops working for no reason
I tried to delete words from a list using the conditions as below. but for some reason, in the middle of second condition, the logic stops working.
logic is as below.
iterate words in a list
if the words has '-' and the word having '-' is not a member of exception group
count how many '-' the word has
if the word has only one '-'
- split the word into two part, one is before '-' the other is after '-'
- then replace the word with '-' with the two new words above "before hyphen" and "after hyphen"
if the word has two '-'
- do the similar thing as step 4 but splitting into three parts and replace the word with two hyphens '-' with three part of word "before first hyphen", "between two hyphen", "after hyphen"
I finally came up with the logic and wrote the code as below.z = ['blue-ray','red-ray-something', 'blue', 'yellow', 'blue', 'red'] exception =['olive-oil','canola-oil'] # for entity in z: # print(entity) # z[1] in save_dash for entity in z: print(entity) if '-' in entity and entity not in exception: print(entity) letter_count=Counter(entity) if letter_count['-']==2: print('hi') first_hyphen=entity.index('-') second_hyphen=find_nth(entity,'-',2) firstword = entity[:first_hyphen] secondword= entity[first_hyphen+1:second_hyphen] thirdword = entity[second_hyphen+1:] delete_index = z.index(entity) del z[delete_index] z.append(firstword) z.append(secondword) z.append(thirdword) elif letter_count['-']==1: print('no') hyphenindex=entity.index('-') firstword = entity[:hyphenindex] secondword= entity[hyphenindex+1:] # delete_index = z.index(entity) # del z[delete_index] # z.append(firstword) # z.append(secondword)
No error but as soon as I un-commentout from
"delete_index = z.index(entity)"
the word with two hyphens stop being sorted out.
Why is that?
-
using spacy how do I make a pattern for splitting words having dash within itself
I'm trying to split words like 'olive-oil','high-fat','all-purpose' which are tokenized into one chunk.
The desired tokenization has to be
['olive','-','oil','high','-','fat','all','-','purpose']
I looked into retokenizer and the usage was like below.
doc = nlp("I live in NewYork") with doc.retokenize() as retokenizer: heads = [(doc[3], 1), doc[2]] attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]} retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)
As you could see in the last line, to retokenize a chunk into pieces, I have provide what the result would be. I don't think this is an efficient way of processing words because if I have to provide all the result, it means I just manually type all the possibilities, which I don't think feasible plan.
Given that I know all the cases and provide the ending result one by one it might be way more efficient that just find the words to be replaced and replace them into what I want manually.
I believe there must be a way to generalize them.
If anyone knows to the way to tokenize the words I put at the top, can you help me?
Thank you
-
Foma extracting relations
Say I have, D, which contains relations such as
<DOG,dog>, <DOG,doggie>, <CAT,cat>, <BIRD, bird>, <BIRD, birdie>.
What regex would I have to write infoma
so that I get<DOG,dog>, <DOG,doggie>, <BIRD, bird>, <BIRD, birdie>
, that is I get only those relations that have multiple lower forms. -
how to control the distance metric in gensim "similar_by_vector" method
I am using the ".wv.similar_by_vector" method, I want to compute both by Euclidean distance and cosine distance separately but can't find a flag to do so.
This is what I did
list_nn = model.wv.similar_by_vector(vec, topn=10, restrict_vocab=None)
How can I change it to Euclidean distance?
*Judging by the results I conclude that it computes the cosine distance.
Thanks!
-
How to use keras transformer to build word embeddings
I'm looking for any details on how I can build word embeddings using keras transformers?
In the git repo below, i see the embedding size arg, but once I do execute this, how will I extract the trained embeddings?
https://github.com/kpot/keras-transformer/tree/master/example
The example in question is located at the link below. The embeddings are specified using arg "word_embedding_size". https://github.com/kpot/keras-transformer/blob/master/example/run_bert.py
-
How to print out the short-text by their distance to center point in each cluster ? NLP clustering by Python
NLP/K-MEANS/PYTHON
Hi all,
Im currently doing a short-text clustering task of NLP. Im trying to cluster the short text by K-means.
I have completed embedding the sentences(by using GLOVE) and feed to CNN, and then I used K-means to do clustering.
I find most(or maybe all) online tutorials only show the way to plot the clustering results...none of them tell how to print out the sentences/documents in the clusters. I have figured out the way to print out sentences in each clusters(Im using Python)
My question is :
- how to print out the sentence/document of the center point?
- how can I print out sentences and order them by their distance to the center point of the the cluster?
Can anyone help me on this issue?
Many thanks in advance!
My code:
#print centers of the clusters centers = kmeans.cluster_centers_ centroidpoint = pca.transform(centers) print("Centers- Kmeans") print(centers)
out put is like this:
Centers- Kmeans [[0.0752584 0.08675878 0.03207847 ... 0.10317419 0.07130289 0.0322413 ] [0.06198343 0.07327988 0.05582789 ... 0.10588244 0.0630549 0.03647455] ...
how can I find out the sentences of the center point of the cluster instead of just output the vector value of the center of the cluster?
#print out the sentences in each cluster centroid_list = kmeans.cluster_centers_ labels = kmeans.labels_ n_clusters_ = len(centroid_list) # print "cluster centroids:",centroid_list print (labels) cluster_menmbers_list = [] for i in range(0, n_clusters_): menmbers_list = [] for j in range(0, len(labels)): if labels[j] == i: menmbers_list.append(j) cluster_menmbers_list.append(menmbers_list) # print cluster_menmbers_list for i in range(0,len(cluster_menmbers_list)): print("CLUSTER" + " " + str(i) + ':') for j in range(0,len(cluster_menmbers_list[i])): a = cluster_menmbers_list[i][j] print(data1[a])
the out put is like:
cluster 0: sentence1 sentence2 sentence3 ... cluster 1: sentence1 sentence2 sentence3
but these sentences are not orderred by their distance to the center of the cluster, so they look very dispersed...
how can I print out like the top 20 or top 30 of the sentences that are nearreat to the center of each cluster?
Many thnaks in advance!