时序差分学习

时序差分学习（英語：Temporal difference learning，TD learning）是一类无模型强化学习方法的统称，这种方法强调通过从当前价值函数的估值中自举的方式进行学习。这一方法需要像蒙特卡罗方法那样对环境进行取样，并根据当前估值对价值函数进行更新，宛如动态规划算法。^[1]

和蒙特卡罗法所不同的是，时序差分学习可以在最终结果出来前对其参数进行不断地调整，使其预测更为准确，而蒙特卡罗法只能在最终结果产生后进行调整。^[2]这是一种自举式的算法，具体的例子如下：

假設你需要預測星期六的天氣，並且手頭上正好有相關的模型。按照一般的方法，你只有到星期六才能根據結果對你的模型進行調整。然而，當到了星期五時，你應該對星期六的天氣有很好的判斷。因此在星期六到來之前，你就能夠調整你的模型以預測星期六的天氣。^[2]

时序差分学习与动物学领域中的动物认知存在一定的关联。^[3]^[4]^[5]^[6]^[7]

数学模型

$TD(0)$ 表格法是最简单的时序差分学习法之一，为随即近似法的一个特例。这种方法用于估计在策略 $\pi$ 之下有限状态马尔可夫决策过程的状态价值函数。现用 ${\displaystyle V^{\pi ))$ 表示马尔可夫决策过程的状态价值函数，其中涉及到状态 ${\displaystyle (s_{t})_{t\in \mathbb {N} ))$ 、奖励 ${\displaystyle (r_{t})_{t\in \mathbb {N} ))$ 、学习折扣率 $\gamma$ 以及策略 $\pi$ ^[8]：

V^{\pi }(s)=E_{a\sim \pi }\left\{\sum _{t=0}^{\infty }\gamma ^{t}r_{t}(a_{t}){\Bigg |}s_{0}=s\right\}.

为了方便起见，我们将上述表达式中表示动作的符号去掉，所得 ${\displaystyle V^{\pi ))$ 满足哈密顿-雅可比-贝尔曼方程：

V^{\pi }(s)=E_{\pi }\{r_{0}+\gamma V^{\pi }(s_{1})|s_{0}=s\},

因此 $r_{0}+\gamma V^{\pi }(s_{1})$ 乃是 $V^{\pi }(s)$ 的无偏估计，基于这一观察结果可以设计用于估计 ${\displaystyle V^{\pi ))$ 的算法。在这一算法中，首先用任意值对表格 $V(s)$ 进行初始化，使马尔可夫决策过程中的每个状态都有一个对应值，并选择一个正的学习率 $\alpha$ 。我们接下来要做的便是反复对策略 $\pi$ 进行评估，并根据所获得的奖励 $r$ 按照如下方式对旧状态下的价值函数进行更新^[9]：

V(s)\leftarrow V(s)+\alpha (\overbrace {r+\gamma V(s')} ^{\text{The TD target))-V(s))

其中 $s$ 和 $s'$ 分别表示新旧状态，而 $r+\gamma V(s')$ 便是所谓的TD目标（TD target）。

TD-λ算法

TD-λ算法是理查德·S·萨顿基于亚瑟·李·塞谬尔的时序差分学习早期研究成果而创立的算法，这一算法最著名的应用是杰拉尔德·特索罗开发的TD-Gammon程序。该程序可以用于学习双陆棋对弈，甚至能够到达人类专家水准。^[10]这一算法中的 $\lambda$ 值为迹线衰减参数，介于0和1之间。当 $\lambda$ 越大时，很久之后的奖励将越被重视。当 $\lambda =1$ 时，将会变成与蒙特卡罗强化学习算法并行的学习算法。^[11]

在神经科学领域

时序差分学习算法在神经科学领域亦得到了重视。研究人员发现腹侧被盖区与黑质中多巴胺神经元的放电率和时序差分学习算法中的误差函数具有相似之处^[3]^[4]^[5]^[6]^[7]，该函数将会回传任何给定状态或时间步长的估计奖励与实际收到奖励之间的差异。当误差函数越大时，这意味着预期奖励与实际奖励之间的差异也就越大。

多巴胺细胞的行为也和时序差分学习存在相似之处。在一次实验中，研究人员训练一只猴子将刺激与果汁奖励联系起来，并对多巴胺细胞的表现进行了测量。^[12]一开始猴子接受果汁时，其多巴胺细胞的放电率会增加，这一结果表明预期奖励和实际奖励存在差异。不过随着训练次数的增加，预期奖励也会发生变化，导致其巴胺细胞的放电率不再显著增加。而当没有获得预期奖励时，其多巴胺细胞的放电率会降低。由此可以看出，这一特征与时序差分学习中的误差函数有着相似之处。

目前很多关于神经功能的研究都是建立在时序差分学习的基础之上的^[13]^[14]，这一方法还被用于对精神分裂症的治疗及研究多巴胺的药理学作用。^[15]

参考文献

^ Sutton & Barto (2018)，第133頁.
^ ^2.0 ^2.1 Sutton, Richard S. Learning to predict by the methods of temporal differences. Machine Learning. 1988-08-01, 3 (1): 9–44 [2023-04-04]. ISSN 1573-0565. doi:10.1007/BF00115009. （原始内容存档于2023-03-31）（英语）.
^ ^3.0 ^3.1 Schultz, W, Dayan, P & Montague, PR. A neural substrate of prediction and reward. Science. 1997, 275 (5306): 1593–1599. CiteSeerX 10.1.1.133.6176 . PMID 9054347. S2CID 220093382. doi:10.1126/science.275.5306.1593.
^ ^4.0 ^4.1 Montague, P. R.; Dayan, P.; Sejnowski, T. J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning (PDF). The Journal of Neuroscience. 1996-03-01, 16 (5): 1936–1947 [2023-04-04]. ISSN 0270-6474. PMC 6578666 . PMID 8774460. doi:10.1523/JNEUROSCI.16-05-01936.1996. （原始内容存档 (PDF)于2018-07-21）.
^ ^5.0 ^5.1 Montague, P.R.; Dayan, P.; Nowlan, S.J.; Pouget, A.; Sejnowski, T.J. Using aperiodic reinforcement for directed self-organization (PDF). Advances in Neural Information Processing Systems. 1993, 5: 969–976 [2023-04-04]. （原始内容存档 (PDF)于2006-03-12）.
^ ^6.0 ^6.1 Montague, P. R.; Sejnowski, T. J. The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms. Learning & Memory. 1994, 1 (1): 1–33. ISSN 1072-0502. PMID 10467583. S2CID 44560099. doi:10.1101/lm.1.1.1 .
^ ^7.0 ^7.1 Sejnowski, T.J.; Dayan, P.; Montague, P.R. Predictive hebbian learning. Proceedings of Eighth ACM Conference on Computational Learning Theory. 1995: 15–18. ISBN 0897917235. S2CID 1709691. doi:10.1145/225298.225300 .
^ Sutton & Barto (2018)，第134頁.
^ Sutton & Barto (2018)，第135頁.
^ Tesauro, Gerald. Temporal difference learning and TD-Gammon. Communications of the ACM. 1995-03-01, 38 (3): 58–68 [2023-04-06]. ISSN 0001-0782. doi:10.1145/203330.203343. （原始内容存档于2023-04-06）.
^ Sutton & Barto (2018)，第175頁.
^ Schultz, W. Predictive reward signal of dopamine neurons. Journal of Neurophysiology. 1998, 80 (1): 1–27. CiteSeerX 10.1.1.408.5994 . PMID 9658025. S2CID 52857162. doi:10.1152/jn.1998.80.1.1.
^ Dayan, P. Motivated reinforcement learning (PDF). Advances in Neural Information Processing Systems (MIT Press). 2001, 14: 11–18 [2023-04-11]. （原始内容 (PDF)存档于2012-05-25）.
^ Tobia, M. J., etc. Altered behavioral and neural responsiveness to counterfactual gains in the elderly. Cognitive, Affective, & Behavioral Neuroscience. 2016, 16 (3): 457–472. PMID 26864879. S2CID 11299945. doi:10.3758/s13415-016-0406-7 .
^ Smith, A., Li, M., Becker, S. and Kapur, S. Dopamine, prediction error, and associative learning: a model-based account. Network: Computation in Neural Systems. 2006, 17 (1): 61–84. PMID 16613795. S2CID 991839. doi:10.1080/09548980500361624.

参考著作

Sutton, Richard S.; Barto, Andrew G. Reinforcement Learning: An Introduction 2nd. Cambridge, MA: MIT Press. 2018 [2023-04-04]. （原始内容存档于2023-04-26）.

延伸阅读

Meyn, S. P. Control Techniques for Complex Networks. Cambridge University Press. 2007. ISBN 978-0521884419. See final chapter and appendix.
Sutton, R. S.; Barto, A. G. Time Derivative Models of Pavlovian Reinforcement (PDF). Learning and Computational Neuroscience: Foundations of Adaptive Networks. 1990: 497–537 [2023-04-06]. （原始内容存档 (PDF)于2017-03-30）.

外部链接

Connect Four TDGravity Applet （页面存档备份，存于互联网档案馆） (+ mobile phone version) – self-learned using TD-Leaf method (combination of TD-Lambda with shallow tree search)
Self Learning Meta-Tic-Tac-Toe （页面存档备份，存于互联网档案馆） Example web app showing how temporal difference learning can be used to learn state evaluation constants for a minimax AI playing a simple board game.
Reinforcement Learning Problem, document explaining how temporal difference learning can be used to speed up Q-learning
TD-Simulator （页面存档备份，存于互联网档案馆） Temporal difference simulator for classical conditioning

分类

[FOOTNOTESuttonBarto2018133-1] Sutton & Barto (2018)，第133頁.

[RSutton-1988-2] 2.0 ^2.1 Sutton, Richard S. Learning to predict by the methods of temporal differences. Machine Learning. 1988-08-01, 3 (1): 9–44 [2023-04-04]. ISSN 1573-0565. doi:10.1007/BF00115009. （原始内容存档于2023-03-31）（英语）.

[WSchultz-1997-3] 3.0 ^3.1 Schultz, W, Dayan, P & Montague, PR. A neural substrate of prediction and reward. Science. 1997, 275 (5306): 1593–1599. CiteSeerX 10.1.1.133.6176 . PMID 9054347. S2CID 220093382. doi:10.1126/science.275.5306.1593.

[:0-4] 4.0 ^4.1 Montague, P. R.; Dayan, P.; Sejnowski, T. J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning (PDF). The Journal of Neuroscience. 1996-03-01, 16 (5): 1936–1947 [2023-04-04]. ISSN 0270-6474. PMC 6578666 . PMID 8774460. doi:10.1523/JNEUROSCI.16-05-01936.1996. （原始内容存档 (PDF)于2018-07-21）.

[:1-5] 5.0 ^5.1 Montague, P.R.; Dayan, P.; Nowlan, S.J.; Pouget, A.; Sejnowski, T.J. Using aperiodic reinforcement for directed self-organization (PDF). Advances in Neural Information Processing Systems. 1993, 5: 969–976 [2023-04-04]. （原始内容存档 (PDF)于2006-03-12）.

[:2-6] 6.0 ^6.1 Montague, P. R.; Sejnowski, T. J. The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms. Learning & Memory. 1994, 1 (1): 1–33. ISSN 1072-0502. PMID 10467583. S2CID 44560099. doi:10.1101/lm.1.1.1 .

[:3-7] 7.0 ^7.1 Sejnowski, T.J.; Dayan, P.; Montague, P.R. Predictive hebbian learning. Proceedings of Eighth ACM Conference on Computational Learning Theory. 1995: 15–18. ISBN 0897917235. S2CID 1709691. doi:10.1145/225298.225300 .

[FOOTNOTESuttonBarto2018134-8] Sutton & Barto (2018)，第134頁.

[FOOTNOTESuttonBarto2018135-9] Sutton & Barto (2018)，第135頁.

[10] Tesauro, Gerald. Temporal difference learning and TD-Gammon. Communications of the ACM. 1995-03-01, 38 (3): 58–68 [2023-04-06]. ISSN 0001-0782. doi:10.1145/203330.203343. （原始内容存档于2023-04-06）.

[FOOTNOTESuttonBarto2018175-11] Sutton & Barto (2018)，第175頁.

[WSchultz-1998-12] Schultz, W. Predictive reward signal of dopamine neurons. Journal of Neurophysiology. 1998, 80 (1): 1–27. CiteSeerX 10.1.1.408.5994 . PMID 9658025. S2CID 52857162. doi:10.1152/jn.1998.80.1.1.

[PDayan-2001-13] Dayan, P. Motivated reinforcement learning (PDF). Advances in Neural Information Processing Systems (MIT Press). 2001, 14: 11–18 [2023-04-11]. （原始内容 (PDF)存档于2012-05-25）.

[14] Tobia, M. J., etc. Altered behavioral and neural responsiveness to counterfactual gains in the elderly. Cognitive, Affective, & Behavioral Neuroscience. 2016, 16 (3): 457–472. PMID 26864879. S2CID 11299945. doi:10.3758/s13415-016-0406-7 .

[ASmith-2006-15] Smith, A., Li, M., Becker, S. and Kapur, S. Dopamine, prediction error, and associative learning: a model-based account. Network: Computation in Neural Systems. 2006, 17 (1): 61–84. PMID 16613795. S2CID 991839. doi:10.1080/09548980500361624.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

时序差分学习

数学模型

TD-λ算法

在神经科学领域

参考文献

参考著作

延伸阅读

外部链接

Suggest as cover photo

Thank you for helping!

Install Wikiwand

Don't forget to rate us

Tell your friends about Wikiwand!

Enjoying Wikiwand?

Tell your friends and spread the love:

Your preferred languages

All languages

Follow Us

Don't forget to rate us

Our magic isn't perfect

Thank you for helping!

Oh no, there's been an error

机器学习与数据挖掘

范式监督学习無監督學習線上機器學習元学习（英语：Meta-learning (computer science)）半监督学习自监督学习强化学习基于规则的机器学习（英语：Rule-based machine learning）量子機器學習
问题统计分类生成模型迴歸分析聚类分析降维密度估计（英语：density estimation）异常检测数据清洗自动机器学习关联规则学习語意分析结构预测（英语：Structured prediction）特征工程表征学习排序学习（英语：Learning to rank）语法归纳（英语：Grammar induction）本体学习（英语：Ontology learning）多模态学习（英语：Multimodal learning）
监督学习 (分类 · 回归) 学徒学习（英语：Apprenticeship learning）决策树学习集成学习 Bagging 提升方法随机森林 k-NN 線性回歸朴素贝叶斯人工神经网络邏輯斯諦迴歸感知器相关向量机（RVM）支持向量机（SVM）迁移学习微调
聚类分析 BIRCH CURE算法（英语：CURE algorithm）层次 k-平均 Fuzzy 期望最大化（EM） DBSCAN OPTICS 均值飘移（英语：Mean shift）
降维因素分析 CCA ICA LDA NMF（英语：Non-negative matrix factorization） PCA PGD（英语：Proper generalized decomposition） t-SNE（英语：t-distributed stochastic neighbor embedding） SDL
结构预测（英语：Structured prediction）圖模式貝氏網路條件隨機域隐马尔可夫模型
异常检测 RANSAC k-NN 局部异常因子（英语：Local outlier factor）孤立森林（英语：Isolation forest）
人工神经网络自编码器認知計算深度学习 DeepDream（英语：DeepDream）多层感知器 RNN LSTM GRU（英语：Gated recurrent unit） ESN（英语：Echo state network）储备池计算（英语：reservoir computing）受限玻尔兹曼机 GAN SOM CNN U-Net Transformer Vision transforme（英语：Vision transformer）脉冲神经网络（英语：Spiking neural network） Memtransistor（英语：Memtransistor）电化学RAM（英语：Electrochemical RAM）（ECRAM）
强化学习 Q学习 SARSA 时序差分（TD）多智能体（英语：Multi-agent reinforcement learning） Self-play（英语：Self-play (reinforcement learning technique)） RLHF
与人类学习主动学习（英语：Active learning (machine learning)）众包 Human-in-the-loop（英语：Human-in-the-loop）
模型诊断学习曲线（英语：Learning curve (machine learning)）
数学基础内核机器（英语：Kernel machines）偏差–方差困境（英语：Bias–variance tradeoff）计算学习理论（英语：Computational learning theory）经验风险最小化奥卡姆学习（英语：Occam learning） PAC学习（英语：Probably approximately correct learning）统计学习 VC理论
大会与出版物 NeurIPS ICML（英语：International Conference on Machine Learning） ICLR ML（英语：Machine Learning (journal)） JMLR（英语：Journal of Machine Learning Research）
相关条目人工智能术语（英语：Glossary of artificial intelligence）机器学习研究数据集列表（英语：List of datasets for machine-learning research）机器学习概要（英语：Outline of machine learning）
查论编