What can be the cause of the validation loss increasing and the accuracy remaining constant to zero while the train loss decreases?

I am trying to solve a multiclass text classification problem. Due to specific requirements from my project I am trying to use skorch (https://skorch.readthedocs.io/en/stable/index.html) to wrap pytorch for the sklearn pipeline. What I am trying to do is fine-tune a pretrained version of BERT from Huggingface (https://huggingface.co) with my dataset. I have tried, in the best of my knowledge, to follow the instructions from skorch on how I should input my data, structure the model etc. Still during the training the train loss decreases until the 8th epoch where it starts fluctuating, all while the validation loss increases from the beginning and the validation accuracy remains constant to zero. My pipeline setup is

    from sklearn.pipeline import Pipeline

    pipeline = Pipeline(
        ("tokenizer", Tokenizer()),
        ("classifier", _get_new_transformer())

in which I am using a tokenizer class to preprocess my dataset, tokenizing it for BERT and creating the attention masks. It looks like this

    import torch
    from transformers import AutoTokenizer, AutoModel
    from torch import nn
    import torch.nn.functional as F
    from sklearn.base import BaseEstimator, TransformerMixin
    from tqdm import tqdm
    import numpy as np

    class Tokenizer(BaseEstimator, TransformerMixin):
        def __init__(self):
            super(Tokenizer, self).__init__()

            self.tokenizer = AutoTokenizer.from_pretrained(/path/to/model)

        def _tokenize(self, X, y=None):
            tokenized = self.tokenizer.encode_plus(X, max_length=20, add_special_tokens=True, pad_to_max_length=True)
            tokenized_text = tokenized['input_ids']
            attention_mask = tokenized['attention_mask']
            return np.array(tokenized_text), np.array(attention_mask)

        def fit(self, X, y=None):
            return self

        def transform(self, X, y=None):
            word_tokens, attention_tokens = np.array([self._tokenize(string)[0] for string in tqdm(X)]), \
                                    np.array([self._tokenize(string)[1] for string in tqdm(X)])
            X = word_tokens, attention_tokens
            return X

        def fit_transform(self, X, y=None, **fit_params):
            self = self.fit(X, y)
            return self.transform(X, y)

then I initialize the model I want to fine-tune as

    class Transformer(nn.Module):
        def __init__(self, num_labels=213, dropout_proba=.1):
            super(Transformer, self).__init__()

            self.num_labels = num_labels
            self.model = AutoModel.from_pretrained(/path/to/model)
            self.dropout = torch.nn.Dropout(dropout_proba)
            self.classifier = torch.nn.Linear(768, num_labels)

        def forward(self, X, **kwargs):
            X_tokenized, attention_mask = torch.stack([x.unsqueeze(0) for x in X[0]]),\
                                  torch.stack([x.unsqueeze(0) for x in X[1]])
            _, X = self.model(X_tokenized.squeeze(), attention_mask.squeeze())
            X = F.relu(X)
            X = self.dropout(X)
            X = self.classifier(X)
            return X

I initialize the model and create the classifier with skorch as follows

    from skorch import NeuralNetClassifier
    from skorch.dataset import CVSplit
    from skorch.callbacks import ProgressBar
    import torch
    from transformers import AdamW

    def _get_new_transformer() -> NeuralNetClassifier:
        transformer = Transformer()
        net = NeuralNetClassifier(
            callbacks=[ProgressBar(postfix_keys=['train_loss', 'valid_loss'])],
            train_split=CVSplit(cv=2, random_state=0)
            return net

and I use fit like that

   pipeline.fit(X=dataset.training_samples, y=dataset.training_labels)

in which my training samples are lists of strings and my labels are the an array containing the indexes of each class, as pytorch requires.

This is a sample of what happens

training history

I have tried to keep train only the fully connected layer and not BERT but I have the same issue again. I also tested the train accuracy after the training process and it was only 0,16%. I would be grateful for any advice or insight on how to solve my problem! I am pretty new with skorch and not so comfortable with pytorch yet and I believe that I am missing something really simple. Thank you very much in advance!