Seq2Seq training with already tokenized ID files
I am training a 4-tuple event to event model (for story generation). My training data are two separate text files for encoder and decoder, each line is a 4-tuple integer, e.g. 57, 59, 1, 3.
I do have a separate dictionary file for this.
My question is how do I apply batching in this case?
I am following the architecture in this notebook: https://colab.research.google.com/github/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb#scrollTo=qdvhfatmcV83
Thank you so much!
do you know?
how many words do you know
See also questions close to this topic
-
Print accuracy for YOLOv5 model?
I want to print the accuracy of my trained model. I do understand when I use the model through the camera it shows me the accuracy on the box only, but I want to see the accuracy of my whole model rather than a single instance of my image.
cap = cv2.VideoCapture(0) # loop through labels for label in labels: print('Collecting images for {}'.format(label)) #so that we can see when transitioning through images time.sleep(5) #sleep or wait for 5 seconds when transitioning between image for other class # Loop through image range for img_num in range(number_imgs): print('Collecting images for {}, image number {}'.format(label, img_num)) # webcam feed ret, frame = cap.read() # read the feed from the web cam and store it in given vars # Naming image path imgname = os.path.join(IMAGES_PATH, label+'.'+str(uuid.uuid1())+'.jpg') #Writes out image to file cv2.imwrite(imgname,frame) # Render to screen cv2.imshow('Image Collection', frame) # 2 sec delay between diff captures time.sleep(2) if cv2.waitKey(10) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows()
!cd yolov5 && python train.py --img 320 --batch 16 --epochs 500 --data dataset.yml --weights yolov5s.pt
model = torch.hub.load('ultralytics/yolov5', 'custom', path='yolov5/runs/train/exp/weights/last.pt', force_reload=True) while cap.isOpened(): ret, frame = cap.read() # Make detections results = model(frame) cv2.imshow('YOLO', np.squeeze(results.render())) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows() cv2.waitKey(1)
-
keyerror 0 during cpu trainning
I am not sure what is the reason, probably my index exceed the limit of the input? But how can I solve this error? Here is the code I used https://github.com/wanyu-lin/ICML2021-Gem
-
how can I modify Dataset class to make the mask RCNN work with multiple objects?
I am currently working on instance segmentation. I follow these two tutorials:
However, these two tutorials work perfectly with one class like person + background. But in my case, I have two classes like a person and car + background. I didn't find any resources about making the Mask RCNN work with multiple objects.
Notice that:
I am using PyTorch ( torchvision ), torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0
I am using a Pascal VOC annotation
i used segmentation class (not the XML file) + the images
and this is my dataset class
class PennFudanDataset(torch.utils.data.Dataset): def __init__(self, root, transforms=None): self.root = root self.transforms = transforms # load all image files, sorting them to # ensure that they are aligned self.imgs = list(sorted(os.listdir(os.path.join(root, "img")))) self.masks = list(sorted(os.listdir(os.path.join(root, "imgMask")))) def __getitem__(self, idx): # load images ad masks img_path = os.path.join(self.root, "img", self.imgs[idx]) mask_path = os.path.join(self.root, "imgMask", self.masks[idx]) img = Image.open(img_path).convert("RGB") # note that we haven't converted the mask to RGB, # because each color corresponds to a different instance # with 0 being background mask = Image.open(mask_path) mask = np.array(mask) # instances are encoded as different colors obj_ids = np.unique(mask) # first id is the background, so remove it obj_ids = obj_ids[1:] # split the color-encoded mask into a set # of binary masks masks = mask == obj_ids[:, None, None] # get bounding box coordinates for each mask num_objs = len(obj_ids) boxes = [] for i in range(num_objs): pos = np.where(masks[i]) xmin = np.min(pos[1]) xmax = np.max(pos[1]) ymin = np.min(pos[0]) ymax = np.max(pos[0]) boxes.append([xmin, ymin, xmax, ymax]) boxes = torch.as_tensor(boxes, dtype=torch.float32) # there is only one class labels = torch.ones((num_objs,), dtype=torch.int64) masks = torch.as_tensor(masks, dtype=torch.uint8) image_id = torch.tensor([idx]) area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0]) # suppose all instances are not crowd iscrowd = torch.zeros((num_objs,), dtype=torch.int64) target = {} target["boxes"] = boxes target["labels"] = labels target["masks"] = masks target["image_id"] = image_id target["area"] = area target["iscrowd"] = iscrowd if self.transforms is not None: img, target = self.transforms(img, target) return img, target def __len__(self): return len(self.imgs)
anyone can help me?
-
Shop name classification
I have a list of merchant name and its corresponding Merchant Category Code (MCC). It seems that about 80 percent of MCCs are true. Total number of MCCs are about 300. Merchant name may contain one, two or three words. I need to predict MCC using merchant name. How can I do that?
-
R: How can I add titles based on grouping variable in word_associate?
I am using the word_associate package in R Markdown to create word clouds across a grouping variable with multiple categories. I would like the titles of each word cloud to be drawn from the character values of the grouping variable.
I have added trans_cloud(title=TRUE) to my code, but have not been able to resolve my problem. Here's my code, which runs but doesn't produce graphs with titles:
library(qdap) word_associate(df$text, match.string=c("cat"), grouping.var=c(df$id), text.unit="sentence", stopwords=c(Top200Words), wordcloud=TRUE, cloud.colors=c("#0000ff","#FF0000"), trans_cloud(title=TRUE))
I have also tried the following, which does not run:
library(qdap) word_associate(df$text, match.string=c("cat"), grouping.var=c(df$id), text.unit="sentence", stopwords=c(Top200Words), wordcloud=TRUE, cloud.colors=c("#0000ff","#FF0000"), title=TRUE)
Can anyone help me figure this out? I can't find any guidance in the documentation and there's hardly any examples of or discussions about word_associate on the web.
Here's an example data frame that reproduces the problem:
id text question1 I love cats even though I'm allergic to them. question1 I hate cats because I'm allergic to them. question1 Cats are funny, cute, and sweet. question1 Cats are mean and they scratch me. question2 I only pet cats when I have my epipen nearby. question2 I avoid petting cats at all cost. question2 I visit a cat cafe every week. They have 100 cats. question2 I tried to pet my friend's cat and it attacked me.
Note that if I run this in R (instead of Markdown), the figures automatically print the "question1_list1" and "question2_list1" in bright blue at the top of the figure file. This doesn't work for me because I need the titles to exclude "_list1" and be written in black. These automatically generated titles do not respond to changes in my trans_cloud specifications. Ex:
library(qdap) word_associate(df$text, match.string=c("cat"), grouping.var=c(df$id), text.unit="sentence", stopwords=c(Top200Words), wordcloud=TRUE, cloud.colors=c("#0000ff","#FF0000"), trans_cloud(title=TRUE, title.color="#000000"))
In addition, I'm locked in to using this package (as opposed to other options for creating word clouds in R) because I'm using it to create network plots, too.
-
kwic() function returns less rows than it should
I'm currently trying to perform a sentiment analysis on a
kwic
object, but I'm afraid that thekwic()
function does not return all rows it should return. I'm not quite sure what exactly the issue is which makes it hard to post a reproducible example, so I hope that a detailed explanation of what I'm trying to do will suffice.I subsetted the original dataset containing speeches I want to analyze to a new data frame that only includes speeches mentioning certain keywords. I used the following code to create this subset:
ostalgie_cluster <- full_data %>% filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie', speechContent, ignore.case = TRUE))
The resulting data frame consists of 201 observations. When I perform
kwic()
on the same initial dataset using the following code, however, it returns a data frame with only 82 observations. Does anyone know what might cause this? Again, I'm sorry I can't provide a reproducible example, but when I try to create a reprex from scratch it just.. works...#create quanteda corpus object qtd_speeches_corp <- corpus(full_data, docid_field = "id", text_field = "speechContent") #tokenize speeches qtd_tokens <- tokens(qtd_speeches_corp, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, padding = FALSE) %>% tokens_remove(stopwords("de"), padding = FALSE) %>% tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ") ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie") test_kwic <- kwic(qtd_tokens, pattern = ostalgie_words, window = 5)
-
How to translate my own sentence using Attention mechanism?
For every language translation repos, I can see the existing sentences which are translated. If i give my own sentence ( evnthough the words of my own sentence are existed in the dataset) the translation I am getting is the existing sentence from the dataset.
Please help me to resolve this issue
Reference link for the code https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
Reference sentence which I tried to translate on enter image description here
-
In Seq2Seq Model can I use Bert last hidden state to initial Decoder hidden state
I build a Seq2Seq model, Encoder is a Bert model and output word embedding. Decoder is like a LSTM Language Model input the word embedding from Encoder and output probability distribution each word. When Encoder is a LSTM, We usually take Encoder last output to initial Decoder. But now I don't know how to initial Decoder hidden state. Can I take Bert last output to initial My Decoder? Is it reasonable to do so?
-
Use 1 tokenizer or 2 tokenizers for translation task?
I’ve seen several tutorials about seq2seq tasks like translation. They usually use 2 tokenizers trained on corpus, one for source language and the other for target language. However, in huggingface’s translation task example, they just use one tokenizer for 2 languages. I wonder which is the better way, 1 tokenizer or 2 tokenizers? If i use 2 tokenizers then the output classes would be smaller and may be it can eliminate some tokens that target language doesn’t have, thus, improve the result or it is okay to use one tokenizer and the performance is still the same? Please, help me, thanks In advance!