evaluate Neural network accuracy based on several inputs (x) with same label (y)

I am using a CNN model for prediction of speaker on text. the model input is each word. yet the overall input is a sentence of a person. I want the model accuracy to be calculated according to the following: predict speaker of each word -> calculate the speaker that appears the most and give this as the prediction of the sentence.

is there any good way to do this vs. manually predict and calculate?