Calculate the gradients of the last state with respect to the initial state with GRU and understanding gradient tensor sizes in Tensorflow

I have the following model in tensorflow:

def output_layer(input_layer, num_labels):
    :param input_layer: 2D tensor
    :param num_labels: int. How many output labels in total? (10 for cifar10 and 100 for cifar100)
    :return: output layer Y = WX + B
    input_dim = input_layer.get_shape().as_list()[-1]
    fc_w = create_variables(name='fc_weights', shape=[input_dim, num_labels],
    fc_b = create_variables(name='fc_bias', shape=[num_labels], initializer=tf.zeros_initializer())

    fc_h = tf.matmul(input_layer, fc_w) + fc_b
    return fc_h

def model(input_features):

    with tf.variable_scope("GRU_Layer1"):
        cell1 = tf.nn.rnn_cell.GRUCell(gru1_cell_size)
        # shape=(?, 64) ... gru1_cell_size=64
        initial_state1 = tf.placeholder(shape=[None, gru1_cell_size], dtype=tf.float32, name="initial_state1")
        output1, new_state1 = tf.nn.dynamic_rnn(cell1, input_features, dtype=tf.float32, initial_state=initial_state1)

    with tf.variable_scope("GRU_Layer2"):
        cell2 = tf.nn.rnn_cell.GRUCell(gru2_cell_size)
        # shape=(?, 32)...gru2_cell_size=32
        initial_state2 = tf.placeholder(shape=[None, gru2_cell_size], dtype=tf.float32, name="initial_state2")
        output2, new_state2 = tf.nn.dynamic_rnn(cell2, output1, dtype=tf.float32, initial_state=initial_state2)

    with tf.variable_scope("output2_reshaped"):
        # before, shape: (34, 100, 32), after, shape: (34 * 100, 32)
        output2 = tf.reshape(output2, shape=[-1, gru2_cell_size])

    with tf.variable_scope("output_layer"):
        # shape: (34 * 100, 3), num_labels=3
        predictions = output_layer(output2, num_labels)
        predictions = tf.reshape(predictions, shape=[-1, 100, 3])
    return predictions, initial_state1, initial_state2, new_state1, new_state2

So as we can see from the code that the cell size of the first GRU is 64, the cell size of the second GRU is 32. And the batch size is 34 (but this is not important for me now). And the size of input features is 200. I have tried computing the gradients of the loss with respect to the trainable variables through:

local_grads_and_vars = optimizer.compute_gradients(loss, tf.trainable_variables())
# only the gradients are taken to add them later with the back propagated gradients from previous batch.
local_grads = [grad for grad, var in local_grads_and_vars]

for v in local_grads:
    print("v", v)

After printing out the grads I got the following:

v Tensor("Optimizer/gradients/GRU_Layer1/rnn/while/gru_cell/MatMul/Enter_grad/b_acc_3:0", shape=(264, 128), dtype=float32)
v Tensor("Optimizer/gradients/GRU_Layer1/rnn/while/gru_cell/BiasAdd/Enter_grad/b_acc_3:0", shape=(128,), dtype=float32)
v Tensor("Optimizer/gradients/GRU_Layer1/rnn/while/gru_cell/MatMul_1/Enter_grad/b_acc_3:0", shape=(264, 64), dtype=float32)
v Tensor("Optimizer/gradients/GRU_Layer1/rnn/while/gru_cell/BiasAdd_1/Enter_grad/b_acc_3:0", shape=(64,), dtype=float32)
v Tensor("Optimizer/gradients/GRU_Layer2/rnn/while/gru_cell/MatMul/Enter_grad/b_acc_3:0", shape=(96, 64), dtype=float32)
v Tensor("Optimizer/gradients/GRU_Layer2/rnn/while/gru_cell/BiasAdd/Enter_grad/b_acc_3:0", shape=(64,), dtype=float32)
v Tensor("Optimizer/gradients/GRU_Layer2/rnn/while/gru_cell/MatMul_1/Enter_grad/b_acc_3:0", shape=(96, 32), dtype=float32)
v Tensor("Optimizer/gradients/GRU_Layer2/rnn/while/gru_cell/BiasAdd_1/Enter_grad/b_acc_3:0", shape=(32,), dtype=float32)
v Tensor("Optimizer/gradients/output_layer/MatMul_grad/tuple/control_dependency_1:0", shape=(32, 3), dtype=float32)
v Tensor("Optimizer/gradients/output_layer/add_grad/tuple/control_dependency_1:0", shape=(3,), dtype=float32)

Here is the GRU cell from "Mastering TensorFlow 1.x: Advanced machine learning and deep learning concepts using TensorFlow 1.x and Keras" book. Here is the link as well:

So I was trying to understand the shapes of the gradients after printing out the tensors as shown in the code local_grads.

enter image description here

From the GRU cell shown above I assumed that:

1- Having a grad tensor with shape (264, 128) is used to calculate the activation before the input to r() and u(). If the output of r() and u() is 64, then there is a tensor of shape (128).

2- Since the output size of the GRU is 64, therefore, I assumed that the input to the second layer GRU will of size 64 + 32(this is the cell size of the second GRU) which gives 96. Hence, similar to point 1, the gradient tensors will have shapes of (96, 64) and (64).

3- Given that we have a dense layer after the second GRU layer, since the output is of size 3, then there is a gradient tensor for the corresponding weight of size (32, 3) and (3)

My concern is why do we have tensors of shape (264, 64), (64) and (96, 32), (32).

Second, Assume that I saved the gradients after training the model on the first batch, that is, after feeding a tensor of shape: (34, 100, 200) as input_features "In the model function argument", and output of shape (34 * 100, 3), how to back propagate these gradients on the second mini-batch?

I would like to fix the gap as in the following image from

enter image description here

Where instead of having synthetic gradients, I would like to back propagate the gradients from the previous time step. So I was trying something like:

prev_grads_val__ = tf.gradients([new_state1, new_state2], [initial_state1, initial_state2], grad_ys=previous_gradients)

but this won't work giving the following error:

ValueError: Passed 10 grad_ys for 2 ys

And then prev_grads_val__ should be added to local_grads before performing the back propagation.

Any help is much appreciated!!!