强化学习性能测试方法：取最后10个epoch的testing epoch的均值 —— 强化学习中的一种性能测试方法

参考：

https://www.cnblogs.com/devilmaycry812839668/p/17813337.html

The Actor-Mimic and expert DQN training curves for 100 training epochs for each of the 8 games. A training epoch is 250,000 frames and for each training epoch we evaluate the networks with a testing epoch that lasts 125,000 frames. We report AMN and expert DQN test reward for each testing epoch and the mean and max of DQN performance. The max is calculated over all testing epochs that the DQN experienced until convergence while the mean is calculated over the last ten epochs before the DQN training was stopped.

强化学习和其他的AI方法在性能测试上有一些不同，其他的AI方法都是在训练完成后再进行性能测试，也就是说其他AI方法中训练和测试时两个隔离开的的两个独立过程，但是在强化学习中却不是这样，在强化学习中测试过程和训练过程是融合在一起的，具体来讲：

假设在一个强化学习的训练过程中，我们要进行100个epoch的训练，而每个epoch里面包括250000 frames，如果一个batch的大小为100，那么一个epoch就包括了2500个batch，也就是要进行2500次参数训练更新；

由于强化学习算法中测试和训练是结合在一起的，因此每完成1个epoch的训练我们就进行一次test，每次test都包括125000 frames，然后取这125000 frames收集过程中获得的reward的总和作为测试结果，当然也可以除125000做下规范化；

测试的重点在于如何根据训练过程中获得的这些测试结果来计算出测试的max和mean值，这里给出的一个方法就是将训练过程中的所有testing epoch值（每个testing epoch中的125000 frames的reward的和）的最大值作为max值，虽然max值好获得但是mean值却难以有个唯一的评价方法，这里的一个主要的贡献就是给出了一种比较客观的mean值计算方式，也就是取整个训练过程中的最后10个测试结果做平均，也就是将训练过程中的最后10个testing epoch的值（每个testing epoch值为125000 frames的reward的和）取平均作为整个训练过程的测试mean值。