本博文假设基本的状态表示和动作表示都已经了解的情况下，对多个强化学习方法进行简要对比

参考文献包括Lilian Wang的A (Long) Peek into Reinforcement Learning，以及笔者就读的CentraleSupelec的课程PPT

Policy definition

策略即给定状态寻找一个动作

DP methods

Dynamic Programming: 对于环境，不同状态不同动作得到的回报我们是完全知道的。

策略分为两步:

Policy evalutation: Estimate
Policy improvement: Generate

GPI的算法可以写为

For step in [K1, K2, K3...]
    Policy Evaluation:
    Stop when V_(t+1)(s) = V_(t)(s)
Policy improvement

policy_iteration
如图，左边为策略迭代，做K步或者收敛（即到达真实策略时）停止，并进行策略优化。

右图为值迭代，每一步都进行策略的优化。与之相似的为Salsa和Q-Learnings

Sample backups Methods

传统的DP解法：full-width backups
新方法：sample backups，使用样本的平均值而不是模型预期回报
优势： model free, break curse of dimensionality

Monte-Carlo

使用采样出来的轨迹模拟上述公式

Temporal-Difference Learning:

估计量改为，即可得到

比较三种方法

方法	boostrap	biase	variance	backup
DP	no	—	—	full, shallow
MC	yes	no	huge	with sample, deep
TD	yes	yes	small	with sample, shallow

On/Off-policy

Salsa和Q-learning的方法皆为TD Learning的拓展：

On-policy

生成的采样路径都是基于最新的策略的

Off-policy

off-policy learning means learning about one policy from experience generated according to a different policy

生成的采样路径并没有基于最新的策略。
而需要学习的采样策略是基于以往的采样策略进行学习（比如使用importance sampling的方法）

Importance sampling 修正

Q-Learning: Off-policy TD control

Off-policy 是指行动策略和评估策略不同，Q-learning的评估策略是贪婪的，公式对比极易看出

$Q\left(S_{t}, A_{t}\right) \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha\left(R_{t+1}+\gamma \max {a \in \mathcal{A}} Q\left(S{t+1}, a\right)-Q\left(S_{t}, A_{t}\right)\right)$

Double Q-learning

为了解决 Q-learning Overestimation的问题

未完待续

SARSA: On-Policy TD control

Salsa 是和policy iteration相似的方法，
使用 -greedy来获得动作a’（行动策略和评估策略）

DL-based RL

Table-based drawback: hard to scale

is parametrised function.

DQN

To present

Value, Q function
Policy
Model

s -[w] ->

s,a -[w] ->

Off-policy, using the greedy way to choose the action

Value Function Approximation

Q-learning will converge

problem:

Correlations between samples(especially consecutive state)
Non-stationary targets

Solution:

Exprience Replay(sample historical )
Fixed target

Experience Replay is not the only solution for correlation problem

online algorithm: eligibility traces
planing with learned models: Dyna-Q
change the problem setting: parallel environment

Deadly triad?

when combine Experience Replay+Bootstrap+DL, it supposes to diverge, but in practice, it converges. Called soft-divergence.

Problem: cause value estimate to be poor for extended period.

Solution: Multiple target network

Policy Gradients

Comparison

Model-based RL: computing policy is hard and expensive
Value-based RL: not the true objective, small value error lead to larger policy error
Policy-based RL: right object, but could stuck in local optima, knowledge can be specific , not always generalise well

Stochastic policy

Unlike MDP always with an determistic policy. Most problems are not fully observable. We need to add stochastic policies to increase a bit unstable and have a smoother search space.

Contextual Bandits

one step, use score function to make it gredient-able
$$
\nabla_{\boldsymbol{\theta}} \mathbb{E}{\pi{\boldsymbol{\theta}}}[R(S, A)]=\mathbb{E}{\pi{\boldsymbol{\theta}}}\left[R(S, A) \nabla_{\boldsymbol{\theta}} \log \pi(A \mid S)\right]
$$

Policy Gradient Theorem

$$
\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})=\mathbb{E}{\pi{\boldsymbol{\theta}}}\left[\sum_{t=0}^T \gamma^t q_{\pi_{\boldsymbol{\theta}}}\left(S_t, A_t\right) \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}\left(A_t \mid S_t\right) \mid S_0 \sim d_0\right]
$$

where

$$
\begin{aligned}
q_\pi(s, a) & =\mathbb{E}\pi\left[G_t \mid S_t=s, A_t=a\right] \
& =\mathbb{E}\pi\left[R{t+1}+\gamma q_\pi\left(S{t+1}, A_{t+1}\right) \mid S_t=s, A_t=a\right]
\end{aligned}
$$

Drawback

Policy gradient variations

Solution: Kullbeck-Leibler divergence, minimize the KL-divergence of current the learn policy

then maximize

REINFORCE

使用蒙特卡洛的方法获得一系列轨迹，使用来进行梯度更新

Actor-Critic：结合两部分

既更新策略期望，又更新值函数

Asynchronous Advantage

Actor-Critic (A3C)

异步：多线程优势：使用优势函数进行更新

异步获得梯度，进行加和，同步更新

Many alternation

TRPO: Trust Region Policy Optimization

PPO: Proximal Policy Optimization

DDPG: Deep Deterministic Policy Gradient

can only be used for environments with continuous actions paces

A2C: Advantage Actor-Critic

uses advantage estimates value proposition instead of bootstrapping

A3C: Asynchronous Advantage Actor-Critic

The convergence rate is faster

SAC: Soft Actor-Critic

optimizes a stochastic policy in an off-policy way
incorporates the clipped double-Q trick
uses entropy regularization (how noise and diverse in the policy)

Yanjian Zhang's blog

强化学习精要总结（更新）
Click back to the top

Home

About

Archives

强化学习精要总结（更新）