1. Introduction

When different fields come together, there can often be confusion around the terms used. This can be caused by the same concepts having different names or through importing a process that no longer fully fits the original definition but still maintains the name. The development of deep reinforcement learning has been no different and comes with similar terminology problems.

This tutorial discusses the problems surrounding the difference between episodes and epochs. This confusion may also extend to the concepts of steps and batches. Although these differences are minor, they play an important role in understanding the training process and making comparisons across various deep-reinforcement learning algorithms.

2. Training a Deep Neural Network: Supervised Learning

The training process typically involves dividing the available data into batches in deep supervised learning. Each batch consists of a fixed number of input-output pairs. The training algorithm then iterates over these batches, processing one batch at a time.

An epoch in deep supervised learning refers to a complete pass through the entire dataset, where the algorithm has iterated over all the batches once. After each epoch, the model parameters may be saved, and the training process can continue with the next epoch.

During each iteration, the algorithm computes the error or loss between the model’s predicted output and the true output for each input in a batch. The algorithm then uses an optimization algorithm, such as stochastic gradient descent, to update the model’s parameters to reduce the loss. This is often called an update step, and optimizers often provide a step function.

3. Training a Deep Neural Network: Reinforcement Learning

Training a deep reinforcement learning agent involves having it interact with its environment by taking actions based on its current state and receiving rewards from the environment. The agent’s interactions with the environment are organized into episodes, which consist of a sequence of steps.

During each step, the agent receives an observation of the environment’s current state and takes action based on that observation. The environment then transitions to a new state, and the agent receives a reward for its action.

The training aims to learn a policy that maximizes the cumulative reward over multiple episodes. To learn a policy, multiple episodes are played and grouped to update the model. Depending on the type of task, the number of steps taken in the environment may be used to determine the end of an episode. This data represents a dataset. This dataset differs from traditional datasets because it is dynamic and will change throughout the learning process.

When training an on-policy policy gradient algorithm, the sampled data can only be used once. In this case, an epoch is one pass through the generated data, called a policy iteration. This will remain similarly sized across sampling periods.

When training an off-policy value function-based method, we might sample a batch from our replay memory every fixed update period set as a hyper-parameter. In order to maintain a similar perspective to the supervised learning setting, an episode can be considered passed after a fixed number of batched update steps to the model. This is what was done in the well-known paper “Playing Atari with Deep Reinforcement Learning

4. Highlighting the Differences

To summarize, an episode is a sequence of interactions between an agent and the environment, called steps, while an epoch is a complete pass over the training dataset during the training process. In reinforcement learning, an epoch typically corresponds to a fixed number of episodes played through using the current policy or updates when using a value-iteration based method. Each episode consists of steps

Rendered by QuickLaTeX.com

5. Conclusion

The confusion in terms has developed from merging two fields that are very similar but not the same. Using familiar terminology improves cross-domain appeal but also causes some confusion. Usually, the authors of work using these terms tell us how they have defined them. This can sometimes be in the main text or in a footnote or appendix.

Here we have tried to make the general differences understandable. We highlight the source of the terms, their original meaning and how they have evolved. If in doubt, however, you can always query the source.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.