Dropout

Srivastava et al. (2014) applied dropout to feed forward neural network’s and RBM’s and noted a probability of dropout around 0.5 for hidden units and 0.2 for inputs worked well for a variety of tasks.

Reference: A review of Dropout as applied to RNNs

When I apply the 0.5 for hidden units and 0.2 for inputs, it works well. But it is not the case in decoder. In decoder, I would suggest not to use dropout.

Batch Size

Suggest <=32, See Efficient Mini-batch Training for Stochastic Optimization and this RNN study

Accumulation_steps

Have the same function like batch size, but it can be use when Graphics memory is not enough.