Large sequence models for sequential decision-making: a survey

doi:10.1007/s11704-023-2689-5

Front. Comput. Sci.

2023, Vol. 17

Issue (6) : 176349 https://doi.org/10.1007/s11704-023-2689-5

Excellent Young Computer Scientists Forum

Large sequence models for sequential decision-making: a survey

Muning WEN^1,², Runji LIN^3,⁴, Hanjing WANG^1,², Yaodong YANG⁵, Ying WEN¹, Luo MAI⁶, Jun WANG^2,⁷, Haifeng ZHANG^3,⁴, Weinan ZHANG¹(

)

¹. School of Electronics Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200241, China
². Digital Brain Lab, Shanghai 201306, China
³. Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
⁴. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
⁵. Institute for Artificial Intelligence, Peking University, Beijing 100091, China
⁶. School of Informatics, The University of Edinburgh, Edinburgh EH8 9JU, UK
⁷. Department of Computer Science, University College London, London WC1E 6BT, UK

Download: PDF(2853 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Transformer architectures have facilitated the development of large-scale and general-purpose sequence models for prediction tasks in natural language processing and computer vision, e.g., GPT-3 and Swin Transformer. Although originally designed for prediction problems, it is natural to inquire about their suitability for sequential decision-making and reinforcement learning problems, which are typically beset by long-standing issues involving sample efficiency, credit assignment, and partial observability. In recent years, sequence models, especially the Transformer, have attracted increasing interest in the RL communities, spawning numerous approaches with notable effectiveness and generalizability. This survey presents a comprehensive overview of recent works aimed at solving sequential decision-making tasks with sequence models such as the Transformer, by discussing the connection between sequential decision-making and sequence modeling, and categorizing them based on the way they utilize the Transformer. Moreover, this paper puts forth various potential avenues for future research intending to improve the effectiveness of large sequence models for sequential decision-making, encompassing theoretical foundations, network architectures, algorithms, and efficient training systems.

Keywords sequential decision-making sequence modeling the Transformer training system

Corresponding Author(s): Weinan ZHANG

Just Accepted Date: 05 May 2023 Issue Date: 04 August 2023

Cite this article:

Muning WEN,Runji LIN,Hanjing WANG, et al. Large sequence models for sequential decision-making: a survey[J]. Front. Comput. Sci., 2023, 17(6): 176349.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-2689-5
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I6/176349

Fig.1 The difference between sequential decision-making tasks and prediction tasks, such as CV and NLP. (a) A sequential decision-making task is a cycle of agent, task, and world, connected by interactions; (b) in prediction tasks, tasks form a hierarchical structure

Fig.2 Paradigm comparison of conventional RL, IL, UDRL, DT and TT. (a) is a representative method of conventional RL, where

R t

indicates the estimated cumulative rewards with discount starting from

s t

. (b) is a classic method in IL, i.e. Behavioral Cloning. In (c) and (d),

R^t

is the desired cumulative reward without discount. In (e),

r t

means the instant rewards after executing

a t

Method	Sequence	Prediction	Discretized tokens	Benefit	Notes
UPDeT [52]	s	a	No	Multi-task; few-shot learning; interpretability	Model-free; online; multi-agent
PIT [53]	s	Q values	No	Multi-task; few-shot learning; credit assignment	Model-free; online; multi-agent
DT [44]	rtg-s-a	a	No	Long sequence; POMDP; credit assignment	Model-free; offline
TT [45]	s-a-r(-rtg)	s-a-r	Yes	Long sequence; POMDP; sparse-reward	Model-based; offline
GDT [59]	$ψ (s, a)$ -s-a	a	No	HIM problems	Model-free; offline
PDT [46]	s-a	a	No	Few-shot learning	Model-free; pre-train
MADT [50]	s-a	a	No	Multi-task; long sequence	Model-free; offline; multi-agent
ODT [49]	rtg-s-a	a	No	Few-shot learning	Model-free; online
MAT [54]	s	a	No	Monotonic improvement; multi-task; few-shot learning	Model-free; online; multi-agent
MGDT [55]	s-a-r-rtg	a-r-rtg	Yes	Multi-task; few-shot learning	Model-free; offline
TrMRL [60]	s	a	No	Multi-task; few-shot learning	Model-free; online; meta-learning
PG-AR [61]	s	a	No	Monotonic improvement	Model-free; online; multi-agent
Prompt-DT [56]	rtg-s-a	a	No	Multi-task; few-shot learning	Model-free; offline
BooT [62]	s-a-r-rtg	s-a-r-rtg	Yes	Data augmentation	Model-based; offline

Tab.1 Detailed comparison between different transformer-based methods for sequential decision-making

Tab.2 Detailed comparison between different pre-trined decision models, with abbreviations: Language model (LM), language and vision model (LVM)

Fig.3 (a) shows a data-paralleled three-layer model with a parallel size of 2. Data Parallelism (DP) creates replicas of the entire model across the cluster, with each device holding one (or more) of these replicas. (b) illustrates the same three-layer model being assigned to 4 physical devices under Model Parallelism (MP), with a layer-wise (vertical) slicing schema and a horizontal slicing scheme on the second layer (the

2

nd layer being internally sliced and assigned to worker-

1

and worker-

2

). MP splits the model either horizontally (inside a layer, where Tensor Parallelism is often involved since parameters like weights are sliced, e.g., split matrix multiplication into operations into sub-matrices) or vertically (layer-level slice). (c) GPipe [114]: A 4-layer model assigned to 4 physical devices (the vertical axis) with a parallel parallelism schema. Parallel Parallelism (PP) combines DP and MP by slicing the model vertically into chunks, mapping them to different devices, and splitting the mini-batch input into micro-batches fed into the pipeline sequentially to reduce bubbles (device under-utilized periods). Hybrid Parallelism: Though PP has already been a hybrid of DP and MP, it can be further integrated with DP inside a parallel schema by serving multiple homogeneous pipelines (parameters can differ depending on the synchronization schema), orchestrated as a hybrid parallelism schema. A hybrid parallelism schema is often a combination of DP, MP and PP to have fine-grained placement and execution plans based on diverse IO, memory, and computing characteristics of different parallelism methods with an overall optimization goal of efficiency

Fig.4 The data-flow comparison between the paradigms of offline RL and online RL, where offline pre-training relies on large datasets and online fine-tuning requires parallelizing massive environments to accelerate online interaction and data collection. Moreover, the online fine-tuning phase imposes more communication pressure due to strict parameter synchronization requirements between inference and training servers

1	X, Liu F, Zhang Z, Hou L, Mian Z, Wang J, Zhang J Tang . Self-supervised learning: generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 2023, 35( 1): 857–876
2	I, Sutskever O, Vinyals Q V Le . Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 3104−3112
3	A, Krizhevsky I, Sutskever G E Hinton . ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60( 6): 84–90
4	Qin C, Zhang A, Zhang Z, Chen J, Yasunaga M, Yang D. Is ChatGPT a general-purpose natural language processing task solver? 2023, arXiv preprint arXiv: 2302.06476
5	Z, Liu Y, Lin Y, Cao H, Hu Y, Wei Z, Zhang S, Lin B Guo . Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 9992−10002
6	A, Vaswani N, Shazeer N, Parmar J, Uszkoreit L, Jones A N, Gomez Ł, Kaiser I Polosukhin . Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
7	R S, Sutton A G Barto . Reinforcement Learning: An Introduction. 2nd ed. Cambridge: MIT Press, 2018
8	S, Reed K, Zolna E, Parisotto S G, Colmenarejo A, Novikov G, Barth-Maron M, Gimenez Y, Sulsky J, Kay J T, Springenberg T, Eccles J, Bruce A, Razavi A, Edwards N, Heess Y, Chen R, Hadsell O, Vinyals M, Bordbar Freitas N de . A generalist agent. 2022, arXiv preprint arXiv: 2205.06175
9	B, Baker I, Akkaya P, Zhokhov J, Huizinga J, Tang A, Ecoffet B, Houghton R, Sampedro J Clune . Video PreTraining (VPT): learning to act by watching unlabeled online videos. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
10	S, Yang O, Nachum Y, Du J, Wei P, Abbeel D Schuurmans . Foundation models for decision making: problems, methods, and opportunities. 2023, arXiv preprint arXiv: 2303.04129
11	R, Kruse S, Mostaghim C, Borgelt C, Braune M Steinbrecher . Multi-layer perceptrons. In: Kruse R, Mostaghim S, Borgelt C, Braune C, Steinbrecher M, eds. Computational Intelligence: A Methodological Introduction. 3rd ed. Cham: Springer, 2022, 53−124
12	Y, LeCun B, Boser J S, Denker D, Henderson R E, Howard W, Hubbard L D Jackel . Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1( 4): 541–551
13	I H Sarker . Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science, 2021, 2( 6): 420
14	I, Goodfellow Y, Bengio A Courville . Deep Learning. Cambridge: MIT Press, 2016
15	S, Hochreiter J Schmidhuber . Long short-term memory. Neural Computation, 1997, 9( 8): 1735–1780
16	Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1724−1734
17	Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171−4186
18	T B, Brown B, Mann N, Ryder M, Subbiah J, Kaplan P, Dhariwal A, Neelakantan P, Shyam G, Sastry A, Askell S, Agarwal A, Herbert-Voss G, Krueger T, Henighan R, Child A, Ramesh D M, Ziegler J, Wu C, Winter C, Hesse M, Chen E, Sigler M, Litwin S, Gray B, Chess J, Clark C, Berner S, McCandlish A, Radford I, Sutskever D Amodei . Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1877−1901
19	A, Dosovitskiy L, Beyer A, Kolesnikov D, Weissenborn X, Zhai T, Unterthiner M, Dehghani M, Minderer G, Heigold S, Gelly J, Uszkoreit N Houlsby . An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. 2021
20	D, Silver A, Huang C J, Maddison A, Guez L, Sifre den Driessche G, van J, Schrittwieser I, Antonoglou V, Panneershelvam M, Lanctot S, Dieleman D, Grewe J, Nham N, Kalchbrenner I, Sutskever T, Lillicrap M, Leach K, Kavukcuoglu T, Graepel D Hassabis . Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529( 7587): 484–489
21	O, Vinyals I, Babuschkin W M, Czarnecki M, Mathieu A, Dudzik J, Chung D H, Choi R, Powell T, Ewalds P, Georgiev J, Oh D, Horgan M, Kroiss I, Danihelka A, Huang L, Sifre T, Cai J P, Agapiou M, Jaderberg A S, Vezhnevets R, Leblond T, Pohlen V, Dalibard D, Budden Y, Sulsky J, Molloy T L, Paine C, Gulcehre Z Y, Wang T, Pfaff Y H, Wu R, Ring D, Yogatama D, Wünsch K, Mckinney O, Smith T, Schaul T, Lillicrap K, Kavukcuoglu D, Hassabis C, Apps D Silver . Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019, 575( 7782): 350–354
22	R S Sutton . Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3( 1): 9–44
23	R J Williams . Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8( 3): 229–256
24	V, Mnih K, Kavukcuoglu D, Silver A A, Rusu J, Veness M G, Bellemare A, Graves M, Riedmiller A K, Fidjeland G, Ostrovski S, Petersen C, Beattie A, Sadik I, Antonoglou H, King D, Kumaran D, Wierstra S, Legg D Hassabis . Human-level control through deep reinforcement learning. Nature, 2015, 518( 7540): 529–533
25	V R, Konda J N Tsitsiklis . Actor-critic algorithms. In: Proceedings of the 13th Conference on Neural Information Processing Systems. 1999
26	Camacho E F, Alba C B. Model Predictive Control. Advanced Textbooks in Control and Signal Processing. Springer London, 2013
27	B, Peng X, Li J, Gao J, Liu K F, Wong S Y Su . Deep Dyna-Q: integrating planning for task-completion dialogue policy learning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 2182−2192
28	M, Botvinick S, Ritter J X, Wang Z, Kurth-Nelson C, Blundell D Hassabis . Reinforcement learning, fast and slow. Trends in Cognitive Sciences, 2019, 23( 5): 408–422
29	R S Sutton . Temporal credit assignment in reinforcement learning. University of Massachusetts Amherst, Dissertation, 1984
30	M J, Hausknecht P Stone . Deep recurrent q-learning for partially observable MDPs. In: Proceedings of 2015 AAAI Fall Symposium Series. 2015, 29−37
31	McFarlane R. A survey of exploration strategies in reinforcement learning. McGill University, 2018
32	J, Hao T, Yang H, Tang C, Bai J, Liu Z, Meng P, Liu Z Wang . Exploration in deep reinforcement learning: from single-agent to multiagent domain. 2021, arXiv preprint arXiv: 2109.06668
33	M, Zhou J, Luo J, Villella Y, Yang D, Rusu J, Miao W, Zhang M, Alban I, Fadakar Z, Chen A C, Huang Y, Wen K, Hassanzadeh D, Graves D, Chen Z, Zhu N, Nguyen M, Elsayed K, Shao S, Ahilan B, Zhang J, Wu Z, Fu K, Rezaee P, Yadmellat M, Rohani N P, Nieves Y, Ni S, Banijamali A C, Rivers Z, Tian D, Palenicek Ammar H, bou H, Zhang W, Liu J, Hao J Wang . SMARTS: scalable multi-agent reinforcement learning training school for autonomous driving. In: Proceedings of the Conference on Robot Learning. 2020
34	R J, Qin X, Zhang S, Gao X H, Chen Z, Li W, Zhang Y Yu . NeoRL: a near real-world benchmark for offline reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
35	N, Jakobi P, Husbands I Harvey . Noise and the reality gap: the use of simulation in evolutionary robotics. In: Proceedings of the 3rd European Conference on Artificial Life. 1995, 704−720
36	A, Harutyunyan W, Dabney T, Mesnard N, Heess M G, Azar B, Piot Hasselt H, van S, Singh G, Wayne D, Precup R Munos . Hindsight credit assignment. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 1120
37	Schulman J, Moritz P, Levine S, Jordan M, Abbeel P. High-dimensional continuous control using generalized advantage estimation. 2015, arXiv preprintarXiv: 1506.02438
38	F A, Oliehoek C Amato . A Concise Introduction to Decentralized POMDPs. Cham: Springer, 2016
39	Torabi F, Warnell G, Stone P. Behavioral cloning from observation. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 4950−4957
40	J, Ho S Ermon . Generative adversarial imitation learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4572−4580
41	E, Jang A, Irpan M, Khansari D, Kappler F, Ebert C, Lynch S, Levine C Finn . BC-Z: zero-shot task generalization with robotic imitation learning. In: Proceedings of the Conference on Robot Learning. 2021, 991−1002
42	Agents Team Interactive . Creating multimodal interactive agents with imitation and self-supervised learning. 2021, arXiv preprint arXiv: 2112.03763
43	R K, Srivastava P, Shyam F, Mutz W, Jaśkowski J Schmidhuber . Training agents using upside-down reinforcement learning. 2019, arXiv preprint arXiv: 1912, 0287, 7:
44	L, Chen K, Lu A, Rajeswaran K, Lee A, Grover M, Laskin P, Abbeel A, Srinivas I Mordatch . Decision transformer: reinforcement learning via sequence modeling. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 15084−15097
45	M, Janner Q, Li S Levine . Offline reinforcement learning as one big sequence modeling problem. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 1273−1286
46	C, Cang K, Hakhamaneshi R, Rudes I, Mordatch A, Rajeswaran P, Abbeel M Laskin . Semi-supervised offline reinforcement learning with pre-trained decision transformers. In: Proceedings of the International Conference on Learning Representations. 2022
47	Z, Wang C, Chen D Dong . Lifelong incremental reinforcement learning with online Bayesian inference. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33( 8): 4003–4016
48	Z, Wang C, Chen D Dong . A dirichlet process mixture of robust task models for scalable lifelong reinforcement learning. IEEE Transactions on Cybernetics, 2022, doi:
49	Q, Zheng A, Zhang A Grover . Online decision transformer. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 27042−27059
50	L, Meng M, Wen Y, Yang C, Le X, Li W, Zhang Y, Wen H, Zhang J, Wang B Xu . Offline pre-trained multi-agent decision transformer: one big sequence model tackles all SMAC tasks. 2021, arXiv preprint arXiv: 2112.02845
51	L, Fan G, Wang Y, Jiang A, Mandlekar Y, Yang H, Zhu A, Tang D A, Huang Y, Zhu A Anandkumar . MINEDOJO: building open-ended embodied agents with internet-scale knowledge. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
52	S, Hu F, Zhu X, Chang X Liang . UPDeT: universal multi-agent reinforcement learning via policy decoupling with transformers. 2021, arXiv preprint arXiv: 2101.08001
53	T, Zhou F, Zhang K, Shao K, Li W, Huang J, Luo W, Wang Y, Yang H, Mao B, Wang D, Li W, Liu J Hao . Cooperative multi-agent transfer learning with level-adaptive credit assignment. 2021, arXiv preprint arXiv: 2106.00517
54	M, Wen J G, Kuba R, Lin W, Zhang Y, Wen J, Wang Y Yang . Multi-agent reinforcement learning is a sequence modeling problem. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 16509−16521
55	K H, Lee O, Nachum M, Yang L, Lee D, Freeman W, Xu S, Guadarrama I, Fischer E, Jang H, Michalewski I Mordatch . Multi-game decision transformers. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
56	M, Xu Y, Shen S, Zhang Y, Lu D, Zhao J B, Tenenbaum C Gan . Prompting decision transformer for few-shot policy generalization. In: Proceedings of the International Conference on Machine Learning. 2022, 24631−24645
57	Ferret J, Marinier R, Geist M, Pietquin O. Selfattentional credit assignment for transfer in reinforcement learning. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. 2021, 368
58	T, Mesnard T, Weber F, Viola S, Thakoor A, Saade A, Harutyunyan W, Dabney T S, Stepleton N, Heess A, Guez E, Moulines M, Hutter L, Buesing R Munos . Counterfactual credit assignment in model-free reinforcement learning. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 7654−7664
59	H, Furuta Y, Matsuo S S Gu . Generalized decision transformer for offline hindsight information matching. In: Proceedings of the 10th International Conference on Learning Representations. 2022
60	L C Melo . Transformers are meta-reinforcement learners. In: Proceedings of the International Conference on Machine Learning. 2022, 15340−15359
61	W, Fu C, Yu Z, Xu J, Yang Y Wu . Revisiting some common practices in cooperative multi-agent reinforcement learning. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 6863−6877
62	K, Wang H, Zhao X, Luo K, Ren W, Zhang D Li . Bootstrapped transformer for offline reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
63	X, Zhai A, Kolesnikov N, Houlsby L Beyer . Scaling vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 1204−1213
64	P, Goyal M, Caron B, Lefaudeux M, Xu P, Wang V, Pai M, Singh V, Liptchinsky I, Misra A, Joulin P Bojanowski . Self-supervised pretraining of visual features in the wild. 2021, arXiv preprint arXiv: 2103.01988
65	A, Radford J W, Kim C, Hallacy A, Ramesh G, Goh S, Agarwal G, Sastry A, Askell P, Mishkin J, Clark G, Krueger I Sutskever . Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748−8763
66	A, Ramesh M, Pavlov G, Goh S, Gray C, Voss A, Radford M, Chen I Sutskever . Zero-shot text-to-image generation. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8821−8831
67	M, Dehghani S, Gouws O, Vinyals J, Uszkoreit Ł Kaiser . Universal transformers. In: Proceedings of the 7th International Conference on Learning Representations. 2019
68	W, Wang H, Bao L, Dong J, Bjorck Z, Peng Q, Liu K, Aggarwal O K, Mohammed S, Singhal S, Som F Wei . Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. 2022, arXiv preprint arXiv: 2208.10442
69	W, Fedus B, Zoph N Shazeer . Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. 2021, arXiv preprint arXiv: 2101.03961
70	A, Kolesnikov L, Beyer X, Zhai J, Puigcerver J, Yung S, Gelly N Houlsby . Big transfer (BiT): general visual representation learning. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 491−507
71	J, Kaplan S, McCandlish T, Henighan T B, Brown B, Chess R, Child S, Gray A, Radford J, Wu D Amodei . Scaling laws for neural language models. 2020, arXiv preprint arXiv: 2001.08361
72	Kharitonov E, Chaabouni R. What they do when in doubt: a study of inductive biases in seq2seq learners.2020, arXiv preprint arXiv: 2006.14953
73	B L, Edelman S, Goel S, Kakade C Zhang . Inductive biases and variable creation in self-attention mechanisms. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 5793−5831
74	B, Ghorbani O, Firat M, Freitag A, Bapna M, Krikun X, Garcia C, Chelba C Cherry . Scaling laws for neural machine translation. In: Proceedings of the 10th International Conference on Learning Representations. 2022
75	H Shen . Mutual information scaling and expressive power of sequence models. 2019, arXiv preprint arXiv: 1905.04271
76	R, Pascanu T, Mikolov Y Bengio . On the difficulty of training recurrent neural networks. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 1310−1318
77	Olsson C, Elhage N, Nanda N, Joseph N, DasSarma N, Henighan T, Mann B, Askell A, Bai Y, Chen A, Conerly T, Drain D, Ganguli D, Hatfield-Dodds Z, Hernandez D, Johnston S, Jones A, Kernion J, Lovitt L, Ndousse K, Amodei D, Brown T, Clark J, Kaplan J, McCandlish S, Olah C. In-context learning and induction heads. 2022, arXiv preprint arXiv:2209.11895
78	C, Wei Y, Chen T Ma . Statistically meaningful approximation: a case study on approximating turing machines with transformers. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
79	J, Pérez J, Marinković P Barceló . On the Turing completeness of modern neural network architectures. In: Proceedings of the 7th International Conference on Learning Representations. 2019
80	S, Levine A, Kumar G, Tucker J Fu . Offline reinforcement learning: tutorial, review, and perspectives on open problems. 2020, arXiv preprint arXiv: 2005.01643
81	L Li . A perspective on off-policy evaluation in reinforcement learning. Frontiers of Computer Science, 2019, 13( 5): 911–912
82	T M, Moerland J, Broekens A, Plaat C M Jonker . Model-based reinforcement learning: a survey. Foundations and Trends® in Machine Learning, 2023, 16( 1): 1–118
83	C, Chen Y F, Wu J, Yoon S Ahn . TransDreamer: reinforcement learning with transformer world models. 2022, arXiv preprint arXiv: 2202.09481
84	Zeng C, Docter J, Amini A, Gilitschenski I, Hasani R, Rus D. Dreaming with transformers. In: Proceedings of the AAAI Workshop on Reinforcement Learning in Games. 2022
85	D, Hafner T P, Lillicrap J, Ba M Norouzi . Dream to control: learning behaviors by latent imagination. In: Proceedings of the 8th International Conference on Learning Representations. 2020
86	L P Kaelbling . Learning to achieve goals. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence. 1993, 1094−1099
87	T G J, Rudner V H, Pong R, McAllister Y, Gal S Levine . Outcome-driven reinforcement learning via variational inference. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 13045−13058
88	Liu M, Zhu M, Zhang W. Goal-conditioned reinforcement learning: problems and solutions. In: Proceedings of the 31st International Joint Conference on Artificial Intelligence. 2022, 5502−5511
89	M, Carroll J, Lin O, Paradise R, Georgescu M, Sun D, Bignell S, Milani K, Hofmann M, Hausknecht A, Dragan S Devlin . Towards flexible inference in sequential decision problems via bidirectional transformers. 2022, arXiv preprint arXiv: 2204.13326
90	A L, Putterman K, Lu I, Mordatch P Abbeel . Pretraining for language-conditioned imitation with transformers. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021
91	Ended Learning Team, Open A, Stooke A, Mahajan C, Barros C, Deck J, Bauer J, Sygnowski M, Trebacz M, Jaderberg M, Mathieu N, McAleese N, Bradley-Schmieg N, Wong N, Porcel R, Raileanu S, Hughes-Fitt V, Dalibard W M Czarnecki . Open-ended learning leads to generally capable agents. 2021, arXiv preprint arXiv: 2107.12808
92	M, Ahn A, Brohan N, Brown Y, Chebotar O, Cortes B, David C, Finn C, Fu K, Gopalakrishnan K, Hausman A, Herzog D, Ho J, Hsu J, Ibarz B, Ichter A, Irpan E, Jang R J, Ruano K, Jeffrey S, Jesmonth N J, Joshi R, Julian D, Kalashnikov Y, Kuang K H, Lee S, Levine Y, Lu L, Luu C, Parada P, Pastor J, Quiambao K, Rao J, Rettinghouse D, Reyes P, Sermanet N, Sievers C, Tan A, Toshev V, Vanhoucke F, Xia T, Xiao P, Xu S, Xu M, Yan A Zeng . Do as I can, not as I say: grounding language in robotic affordances. 2022, arXiv preprint arXiv: 2204.01691
93	D, Shah B, Osiński B, Ichter S Levine . LM-Nav: robotic navigation with large pre-trained models of language, vision, and action. In: Proceedings of the 6th Conference on Robot Learning. 2023, 492−504
94	W, Huang F, Xia T, Xiao H, Chan J, Liang P, Florence A, Zeng J, Tompson I, Mordatch Y, Chebotar P, Sermanet T, Jackson N, Brown L, Luu S, Levine K, Hausman B Ichter . Inner monologue: embodied reasoning through planning with language models. In: Proceedings of the Conference on Robot Learning. 2022, 1769−1782
95	T, Chen S, Kornblith M, Norouzi G Hinton . A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 149
96	K, He H, Fan Y, Wu S, Xie R Girshick . Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 9726−9735
97	S Levine . Understanding the world through action. In: Proceedings of the 5th Conference on Robot Learning. 2022, 1752−1757
98	D, Krueger T, Maharaj J Leike . Hidden incentives for auto-induced distributional shift. 2020, arXiv preprint arXiv: 2009.09153
99	A, Kumar J, Fu G, Tucker S Levine . Stabilizing off-policy Q-learning via bootstrapping error reduction. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 1055
100	M, Kaspar J D M, Osorio J Bock . Sim2Real transfer for reinforcement learning without dynamics randomization. In: Proceedings of 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2020, 4383−4388
101	M, Tancik V, Casser X, Yan S, Pradhan B P, Mildenhall P, Srinivasan J T, Barron H Kretzschmar . Block-NeRF: scalable large scene neural view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 8238−8248
102	A, Nair A, Gupta M, Dalal S Levine . AWAC: accelerating online reinforcement learning with offline datasets. 2020, arXiv preprint arXiv: 2006.09359
103	Y, Mao C, Wang B, Wang C Zhang . MOORe: model-based offline-to-online reinforcement learning. 2022, arXiv preprint arXiv: 2201.10070
104	Z H Zhou . Rehearsal: learning from prediction to decision. Frontiers of Computer Science, 2022, 16( 4): 164352
105	W, Huang P, Abbeel D, Pathak I Mordatch . Language models as zero-shot planners: extracting actionable knowledge for embodied agents. In: Proceedings of the International Conference on Machine Learning. 2022, 9118−9147
106	S, Bai J Z, Kolter V Koltun . An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. 2018, arXiv preprint arXiv: 1803.01271
107	I, Tolstikhin N, Houlsby A, Kolesnikov L, Beyer X, Zhai T, Unterthiner J, Yung A, Steiner D, Keysers J, Uszkoreit M, Lucic A Dosovitskiy . MLP-mixer: an all-MLP architecture for vision. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 24261−24272
108	A, Jaegle S, Borgeaud J B, Alayrac C, Doersch C, Ionescu D, Ding S, Koppula D, Zoran A, Brock E, Shelhamer O J, Hénaff M M, Botvinick A, Zisserman O, Vinyals J Carreira . Perceiver IO: a general architecture for structured inputs & outputs. In: Proceedings of the 10th International Conference on Learning Representations. 2022
109	N, Shazeer A, Mirhoseini K, Maziarz A, Davis Q V, Le G E, Hinton J Dean . Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In: Proceedings of the 5th International Conference on Learning Representations. 2017
110	R, Yang H, Xu Y, Wu X Wang . Multi-task reinforcement learning with soft modularization. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 400
111	C, Fernando D, Banarse C, Blundell Y, Zwols D, Ha A A, Rusu A, Pritzel D Wierstra . PathNet: evolution channels gradient descent in super neural networks. 2017, arXiv preprint arXiv: 1701.08734
112	D, Lepikhin H, Lee Y, Xu D, Chen O, Firat Y, Huang M, Krikun N, Shazeer Z Chen . GShard: scaling giant models with conditional computation and automatic sharding. In: Proceedings of the 9th International Conference on Learning Representations. 2021
113	S, Rajbhandari C, Li Z, Yao M, Zhang R Y, Aminabadi A A, Awan J, Rasley Y He . DeepSpeed-MoE: advancing mixture-of-experts inference and training to power next-generation AI scale. In: Proceedings of the International Conference on Machine Learning. 2022, 18332−18346
114	Y, Huang Y, Cheng A, Bapna O, Firat M X, Chen D, Chen H, Lee J, Ngiam Q V, Le Y, Wu Z F Chen . GPipe: efficient training of giant neural networks using pipeline parallelism. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 10
115	S, Li J, Fang Z, Bian H, Liu Y, Liu H, Huang B, Wang Y You . Colossal-AI: a unified deep learning system for large-scale parallel training. 2021, arXiv preprint arXiv: 2110.14883
116	L, Espeholt H, Soyer R, Munos K, Simonyan V, Mnih T, Ward Y, Doron V, Firoiu T, Harley I, Dunning S, Legg K Kavukcuoglu . IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1406−1415
117	L, Espeholt R, Marinier P, Stanczyk K, Wang M Michalski . SEED RL: scalable and efficient deep-RL with accelerated central inference. In: Proceedings of the 8th International Conference on Learning Representations. 2020
118	Ozbulak U, Lee H J, Boga B, Anzaku E T, Park H, Van Messem A, De Neve W, Vankerschaver J. Know your self-supervised learning: A survey on image-based generative and discriminative training. 2023, arXiv preprint arXiv: 2305.13689
119	M, Carroll O, Paradise J, Lin R, Georgescu M, Sun D, Bignell S, Milani K, Hofmann M, Hausknecht A, Dragan S Devlin . UniMASK: unified inference in sequential decision problems. 2022, arXiv preprint arXiv: 2211.10869
120	J, Wei X, Wang D, Schuurmans M, Bosma B, Ichter F, Xia E H, Chi Q V, Le D Zhou . Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022

[1]

FCS-22689-OF-MW_suppl_1

Download

Viewed

Full text

Abstract

Cited

Shared

Discussed