Reinforcement Learning And Approximate Dynamic Programming For Feedback Control by Derong Liu and Frank L. Lewis

4856571a1ed7b88.jpg Author Derong Liu and Frank L. Lewis
Isbn 978-1118104200
File size 45.1 MB
Year 2013
Pages 648
Language English
File format PDF
Category programming


IEEE Press 445 Hoes Lane Piscataway, NJ 08854 IEEE Press Editorial Board 2012 John Anderson, Editor in Chief Ramesh Abhari Bernhard M . Haemmerli Saeid Nahavandi George W. Arnold David Jacobson Tariq Samad Flavio Canavero Mary Lanzerotti George Zobrist Dmitry Goldgof Om P. Malik Kenneth Moore, Director ofIEEE Book and Information Services (BIS) REINFORCEMENT LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING FOR FEEDBACK CONTROL Edited by Frank L. Lewis UTA Automation and Robotics Research Institute Fort Worth, TX Derong Liu University ofIllinois Chicago, IL +IEEE IEEE PRESS �WILEY A JOHN WILEY & SONS, INC., PUBLICATION Cover Illustration: Courtesy of FrankL.Lewis and DerongLiu Cover Design: John Wiley Copyright & Sons, Inc. © 2013 by The Institute of Electrical and Electronics Engineers, Inc. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., III River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at Limit ofLiabilitylDisclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at Library of Congress Cataloging-in-Publication Data: Reinforcement learning and approximate dynamic programming for feedback control / edited by Frank L.Lewis, DerongLiu. p. cm. ISBN 978-1-118-10420-0 (hardback) I. II. Reinforcement learning. 2. Feedback control systems. 1. Lewis, FrankL. Liu, Derong, 1963Q325.6.R464 2012 003!.5-dc23 2012019014 Printed in the United States of America 10 9 8 7 6 5 4 3 2 I CONTENTS PREFACE xix xxiii CONTRIBUTORS PART I 1. FEEDBACK CONTROL USING RL AND ADP Reinforcement Learning and Approximate Dynamic Programming {RLADP) -Foundations, Common Misconceptions, and the Challenges Ahead 3 Paul J Werbos 1.1 Introduction 3 1.2 W hat is RLADP? 4 1.2.1 Definition of RLADP and the Task it Addresses 4 1.2.2 Basic Tools-Bellman Equation, and Value and Policy Functions 1.2.3 1.3 9 Optimization Over Time Without Value Functions 14 1.3.1 Accounting for Unseen Variables 15 1.3.2 Offline Controller Design Versus Real-Time Learning 17 1.3.3 "Model-Based" Versus "Model Free" Designs 18 1.3.4 How to Approximate the Value Function Better 19 1.3.5 How to Choose 22 1.3.6 How to Build Cooperative Multiagent Systems with u (t) Based on a Value Function RLADP References 2. 13 Some Basic Challenges in Implementing ADP 25 26 Stable Adaptive Neural Control of Partially Observable 31 Dynamic Systems J Nate Knight and Charles W Anderson 2.1 Introduction 31 2.2 Background 32 2.3 Stability Bias 35 2.4 Example Application 38 2.4.1 The Simulated System 38 2.4.2 An Uncertain Linear Plant Model 40 v vi CONTENTS 2.4.3 The Closed Loop Control System 2.4.4 Determining RNN Weight Updates by Reinforcement Learning 44 2.4.5 Results 46 2.4.6 Conclusions 50 50 References 3. 41 Optimal Control of Unknown Nonlinear Discrete-Time Systems Using the Iterative Globalized Dual Heuristic Programming Algorithm 52 Derong Liu and Ding Wang 3.1 Background Material 3.2 Neuro-Optimal Control Scheme Based on the Iterative ADP Algorithm 55 3.2.1 Identification of the Unknown Nonlinear System 55 3.2.2 Derivation of the Iterative ADP Algorithm 59 3.2.3 Convergence Analysis of the Iterative ADP Algorithm 59 3.2.4 Design Procedure of the Iterative ADP Algorithm 64 3.2.5 NN Implementation of the Iterative ADP Algorithm Using GDHP Technique 64 3.3 Generalization 67 3.4 Simulation Studies 68 3.5 Summary 74 References 4. 53 74 Learning and Optimization in Hierarchical Adaptive Critic Design 78 Haibo He, Zhen Ni, and Dongbin Zhao 4.1 Introduction 4.2 Hierarchical ADP Architecture with Multiple-Goal 4.3 4.4 Representation 80 4.2.1 System Level Structure 80 4.2.2 Architecture Design and Implementation 81 4.2.3 Learning and Adaptation in Hierarchical ADP 83 Case Study: The Ball-and-Beam System 87 4.3.1 Problem Formulation 88 4.3.2 Experiment Configuration and Parameters Setup 89 4.3.3 Simulation Results and Analysis 90 Conclusions and Future Work References 5. 78 94 95 Single Network Adaptive Critics Networks-Development, Analysis, and Applications 98 lie Ding, Ali Heydari, and 5.N Balakrishnan 5.1 Introduction 5.2 Approximate DynamiC Programing 98 100 CONTENTS 5.3 5.5 5.6 6. 102 SNAC State Generation for Neural Network Training 103 5.3.2 Neural Network Training 103 5.3.3 Convergence Condition 104 5.3.1 5.4 vii 104 ]-SNAC 5.4.1 Neural Network Training 105 5.4.2 Numerical Analysis 105 Finite-SNAC 108 5.5.1 Neural Network Training 5.5.2 Convergence Theorems 111 5.5.3 Numerical Analysis 112 109 116 Conclusions References 116 Linearly Solvable Optimal Control 119 K. Dvijotham and E. Todorov 6.1 6.2 6.3 6.4 6.5 119 Introduction 6.1.1 Notation 121 6.1.2 Markov Decision Processes 122 Linearly Solvable Optimal Control Problems 123 6.2.1 Probability Shift: An Alternate View of Control 123 6.2.2 Linearly Solvable Markov Decision Processes (LMDPs) 124 6.2.3 An Alternate View of LMDPs 124 6.2.4 Other Problem Formulations 126 6.2.5 Applications 126 6.2.6 Linearly Solvable Controlled Diffusions (LDs) 127 6.2.7 Relationship Between Discrete and Continuous-Time Problems 128 6.2.8 Historical Perspective 129 Extension to Risk-Sensitive Control and Game Theory 130 6.3.1 Game Theoretic Control: Competitive Games 130 6.3.2 Renyi Divergence 130 6.3.3 Linearly Solvable Markov Games 130 6.3.4 Linearly Solvable Differential Games 133 6.3.5 Relationships Among the Different Formulations 134 Properties and Algorithms 134 6.4.1 Sampling Approximations and Path-Integral Control 134 6.4.2 Residual Minimization via Function Approximation 135 6.4.3 Natural Policy Gradient 136 6.4.4 Compositionality of Optimal Control Laws 136 6.4.5 Stochastic Maximum Principle 137 6.4.6 Inverse Optimal Control 138 Conclusions and Future Work References 139 139 vi i i 7. CONTENTS Approximating Optimal Control with Value Gradient Learning 142 Michael Fairbank, Danil Pmkhomv, and Eduardo Alonso 7.1 7.2 7.3 7.4 7.5 Introduction 142 Value Gradient Learning and BPTT Algorithms 144 7.2.1 Preliminary Definitions 144 7.2.2 V GL (A) Algorithm 145 7.2.3 BPTT Algorithm 147 A Convergence Proof for V GL (1) for Control with Function Approximation 148 7.3.1 Using a Greedy Policy with a Critic Function 149 7.3.2 The Equivalence of V GL (1) to BPTT 151 7.3.3 Convergence Conditions 152 7.3.4 Notes on the S"2t Matrix 154 7.4.1 Problem Definition 154 7.4.2 Efficient Evaluation of the Greedy Policy 155 7.4.3 Observations on the Purpose of S"2t 157 7.4.4 Experimental Results for Vertical Lander Problem Conclusions References 8. 152 Vertical Lander Experiment 158 159 160 A Constrained Backpropagation Approach to Function Approximation and Approximate Dynamic Programming 162 Silvia Ferrari. Keith Rudd. and Gianluca Di Mum 8.1 Background 163 8.2 Constrained Backpropagation (CPROP) Approach 163 8.2.1 Neural Network Architecture and Procedural 8.2.2 Derivation of LTM Equality Constraints and Adjoined Error Gradient 165 8.2.3 Example: Incremental Function Approximation 168 Memories 8.3 Solution of Partial Differential Equations in Nonstationary Environments 8.4 170 8.3.1 CPROP Solution of Boundary Value Problems 170 8.3.2 Example: PDE Solution on a Unit Circle 171 8.3.3 CPROP Solution to Parabolic PDEs 174 Preserving Prior Knowledge in Exploratory Adaptive Critic Designs 8.5 165 174 8.4.1 Derivation of LTM Constraints for Feedback Control 175 8.4.2 Constrained Adaptive Critic Design 177 Summary 179 Appendix: Algebraic ANN Control Matrices 180 References 180 CONTENTS 9. ix Toward Design o f Nonlinear ADP Learning Controllers with Performance Assurance 182 Jennie Si, Lei Yang, Chao Lu, Kostas S. Tsakalis, and Armando A. Rodriguez 9.1 Introduction 183 9.2 Direct Heuristic Dynamic Programming 184 9.3 A Control Theoretic View on the Direct HDP 186 9.3.1 Problem Setup 187 9.3.2 Frequency Domain Analysis of Direct HDP 189 9.3.3 Insight from Comparing Direct HDP to LQR 192 9.4 Direct HDP Design with Improved Performance Case I-Design Guided by a Priori LQR Information 9.4.1 9.4.2 9.5 193 Direct HDP Design Guided by a Priori LQR Information 193 Performance of the Direct HDP Beyond Linearization 195 Direct HDP Design with Improved Performance Case 2-Direct HDP for Coorindated Damping Control of Low-Frequency 9.6 Oscillation 198 Summary 201 202 References 10. Reinforcement Learning Control with Time-Dependent Agent Dynamics 203 Kenton Kirkpatrick and John Valasek 10.1 Introduction 203 10.2 Q-Learning 205 10.2.1 Q-Learning Algorithm 205 10.2.2 .s-Greedy 207 10.2.3 Function Approximation 208 10.3 Sampled Data Q-Learning 209 10.3.1 Sampled Data Q-Learning Algorithm 209 10.3.2 Example 210 10.4 System Dynamics Approximation 213 10.4.1 First-Order Dynamics Learning 214 10.4.2 Multiagent System Thought Experiment 216 10.5 Closing Remarks 218 References 219 11. Online Optimal Control of Nonaffine Nonlinear Discrete-Time Systems without Using Value and Policy Iterations 221 Hassan Zargarzadeh Qinmin Yang, and S. Jagannathan 221 11.1 Introduction 11.2 Background 224 11.3 Reinforcement Learning Based Control 225 11.3.1 Affine-Like DynamiCS 225 11.3.2 Online Reinforcement Learning Controller DeSign 229 X CONTENTS 11.3.3 The Action NN Design 229 11.3.4 The Critic NN Design 230 11.3.5 Weight Updating Laws for the NNs 231 11.3.6 Main Theoretic Results 232 11.4 Time-Based Adaptive Dynamic Programming-Based Optimal Control 234 11.4.1 Online NN-Based Identifier 235 11.4.2 Neural Network-Based Optimal Controller DeSign 237 11.4.3 Cost Function Approximation for Optimal Regulator Design 238 11.4.4 Estimation of the Optimal Feedback Control Signal 240 11.4.5 Convergence Proof 242 11.4.6 Robustness 244 11.5 Simulation Result 247 11.5.1 Reinforcement-Learning-Based Control of a Nonlinear System 247 11.5.2 The Drawback of HDP Policy Iteration Approach 250 11.5.3 OLA-Based Optimal Control Applied to HCCI Engine 251 References 255 12. An Actor-Critic-Identifier Architecture for Adaptive Approximate Optimal Control 258 S. Bhasin, R. KamaJapurkar; M lohnson, K C. Vamvoudakis, F.I. Lewis, and WE. Dixon 12.1 Introduction 259 12.2 Actor-Critic-Identifier Architecture for H]B Approximation 260 12.3 Actor-Critic DeSign 263 12.4 Identifier Design 264 12.5 Convergence and Stability Analysis 270 12.6 Simulation 274 12.7 Conclusion 275 References 278 13. Robust Adaptive Dynamic Programming 281 Yu liang and Zhong-Ping liang 13.1 Introduction 281 13.2 Optimality Versus Robustness 283 13.2.1 Systems with Matched Disturbance Input 283 13.2.2 Adding One Integrator 284 13.2.3 Systems in Lower-Triangular Form 286 13.3 Robust-ADP Design for Disturbance Attenuation 288 13.3.1 Horizontal Learning 288 13.3.2 Vertical Learning 290 13.3.3 Robust-ADP Algorithm for Disturbance Attenuation 13.4 Robust-ADP for Partial-State Feedback Control 291 292 CONTENTS 13.4.1 The ISS Property 13.4.2 Online Learning Strategy xi 293 295 296 13.5 Applications 13.5.1 Load-Frequency Control for a Power System 13.5.2 Machine Tool Power Drive System 296 298 13.6 Summary 300 References 301 PART II LEARNING AND CONTROL IN MULTIAGENT GAMES 14. Hybrid Learning in Stochastic Games and Its Application in Network Security 305 Quanyan Zhu. Hamidou Tembine. and Tamer Ba�ar 305 14.1 Introduction 14.1.1 Related Work 306 14.1.2 Contribution 307 14.1.3 Organization of the Chapter 308 14.2 Two-Person Game 308 14.3 Learning in NZSGs 310 14.3.1 Learning Procedures 310 14.3.2 Learning Schemes 311 314 14.4 Main Results 14.4.1 Stochastic Approximation of the Pure Learning Schemes 314 14.4.2 Stochastic Approximation of the Hybrid Learning Scheme 14.4.3 Connection with Equilibria of the Expected Game 315 317 14.5 Security Application 322 14.6 Conclusions and Future Works 326 Appendix: Assumptions for Stochastic Approximation 327 References 328 15. Integral Reinforcement Learning for Online Computation of Nash Strategies of Nonzero-Sum Differential Games 330 Draguna Vrabie and FL. Lewis 15.1 Introduction 331 15.2 Two-Player Games and Integral Reinforcement Learning 333 15.2.1 Two-Player Nonzero-Sum Games and Nash Equilibrium 333 15.2.2 Integral Reinforcement Learning for Two-Player Nonzero-Sum Games 335 15.3 Continuous-Time Value Iteration to Solve the Riccati Equation 337 15.4 Online Algorithm to Solve Nonzero-Sum Games 339 xii CONTENTS 15.4.1 Finding Stabilizing Gains to Initialize the Online Algorithm 339 15.4.2 Online Partially Model-Free Algorithm for Solving the Nonzero-Sum Differential Game 339 15.4.3 Adaptive Critic Structure for Solving the Two-Player Nash Differential Game 15.5 Analysis of the Online Learning Algorithm for NZS Games 15.5.1 Mathematical Formulation of the Online Algorithm 340 342 342 15.6 Simulation Result for the Online Game Algorithm 345 15.7 Conclusion 347 References 348 16. Online Learning Algorithms for Optimal Control and Dynamic Games 350 Kyriakos C. Vamvoudakis and Frank L. Lewis 16.1 Introduction 350 16.2 Optimal Control and the Continuous Time Hamilton-Jacobi-Bellman Equation 352 16.2.1 Optimal Control and Hamilton-Jacobi-Bellman Equation 352 16.2.2 Policy Iteration for Optimal Control 354 16.2.3 Online Synchronous Policy Iteration 355 16.2.4 Simulation 357 16.3 Online Solution of Nonlinear Two-Player Zero-Sum Games and Hamilton-Jacobi-Isaacs Equation 360 16.3.1 Zero-Sum Games and Hamilton-Jacobi-Isaacs Equation 360 16.3.2 Policy Iteration for Two-Player Zero-Sum Differential Games 361 16.3.3 Online Solution for Two-Player Zero-Sum Differential Games 16.3.4 Simulation 362 364 16.4 Online Solution of Nonlinear Nonzero-Sum Games and Coupled Hamilton-Jacobi Equations 366 16.4.1 Nonzero Sum Games and Coupled Hamilton-Jacobi-Equations 16.4.2 Policy Iteration for Nonzero Sum Differential Games 367 369 16.4.3 Online Solution for Two-Player Nonzero Sum Differential Games 16.4.4 Simulation References 370 372 376 CONTENTS PART III xiii FOUNDATIONS IN MDP AND RL 17. Lambda-Policy Iteration: A Review and a New Implementation 381 Dimitri P Bertsekas 17.1 Introduction 381 17.2 Lambda-Policy Iteration without Cost Function Approximation 17.3 Approximate Policy Evaluation Using Projected Equations 386 388 17.3.1 Exploration-Contraction Trade-off 389 17.3.2 Bias 390 17.3.3 Bias-Variance Trade-off 390 17.3.4 TD Methods 391 17.3.5 Comparison of LSTD(A) and LSPE(A) 394 17.4 Lambda-Policy Iteration with Cost Function Approximation 17.4.1 The LSPE{A) Implementation 395 396 17.4.2 A-PI{O)-An Implementation Based on a Discounted MDP 397 17.4.3 A-PI{ I)-An Implementation Based on a Stopping Problem 17.4.4 Comparison with Alternative Approximate PI Methods 398 404 17.4.5 Exploration-Enhanced LSTD{A) with Geometric Sampling 404 17.5 Conclusions 406 References 406 18. Optimal Learning and Approximate Dynamic Programming 410 Warren B. Powell and Ilya 0. Ryzhov 18.1 Introduction 410 18.2 Modeling 411 18.3 The Four Classes of Policies 412 18.3.1 Myopic Cost Function Approximation 412 18.3.2 Lookahead Policies 413 18.3.3 Policy Function Approximation 414 18.3.4 Policies Based on Value Function Approximations 414 18.3.5 Learning Policies 415 18.4 Basic Learning Policies for Policy Search 416 18.4.1 The Belief Model 417 18.4.2 Objective Functions for Offline and Online Learning 418 18.4.3 Some Heuristic Policies 419 18.5 Optimal Learning Policies for Policy Search 421 18.5.1 The Knowledge Gradient for Offline Learning 421 18.5.2 The Knowledge Gradient for Correlated Beliefs 423 18.5.3 The Knowledge Gradient for Online Learning 425 xiv CONTENTS 18.5.4 The Knowledge Gradient for a Parametric Belief Model 18.5.5 Discussion 425 426 18.6 Learning with a Physical State 427 18.6.1 Heuristic Policies 428 18.6.2 The Knowledge Gradient with a Physical State 428 References 429 19. An Introduction to Event-Based Optimization: Theory and Applications 432 Xi-Ren Cao. Yanjia Zhao. Qing-Shan Jia. and Qianchuan Zhao 19.1 Introduction 432 19.2 Literature Review 433 19.3 Problem Formulation 434 19.4 Policy Iteration for EBO 435 19.4.1 Performance Difference and Derivative Formulas 435 19.4.2 Policy Iteration for EBO 440 19.5 Example: Material Handling Problem 441 19.5.1 Problem Formulation 441 19.5.2 Event-Based Optimization for the Material Handling Problem 444 19.5.3 Numerical Results 446 19.6 Conclusions 448 References 449 20. Bounds for Markov Decision Processes 452 Vijay V Desai. Vlvek F. Farias. and Ciamac C. Moallemi 20.1 Introduction 20.1.1 Related Literature 452 454 20.2 Problem Formulation 455 20.3 The Linear Programming Approach 456 20.3.1 The Exact Linear Program 456 20.3.2 Cost-to-Go Function Approximation 457 20.3.3 The Approximate Linear Program 457 20.4 The Martingale Duality Approach 458 20.5 The Path wise Optimization Method 461 20.6 Applications 463 20.6.1 Optimal Stopping 464 20.6.2 Linear Convex Control 467 20.7 Conclusion 470 References 471 CONTENTS XV 21. Approximate Dynamic Programming and Backpropagation 474 on Timescales John Seiifertt and Donald Wunsch 21.1 Introduction: Timescales Fundamentals 474 21.1.1 Single-Variable Calculus 475 21.1.2 Calculus of Multiple Variables 476 21.1.3 Extension of the Chain Rule 477 21.1.4 Induction on Timescales 479 21.2 Dynamic Programming 479 21.2.1 Dynamic Programming Overview 480 21.2.2 Dynamic Programming Algorithm on Timescales 481 21.2.3 H]B Equation on Timescales 483 21.3 Backpropagation 485 21.3.1 Ordered Derivatives 486 21.3.2 The Backpropagation Algorithm on Timescales 490 21.4 Conclusions 492 References 492 22. A Survey of Optimistic Planning in Markov Decision Processes 494 Lucian Bu�oniu. Remi Munos. and Robert Babuska 22.1 Introduction 494 22.2 Optimistic Online Optimization 497 22.2.1 Bandit Problems 497 22.2.2 Lipschitz Functions and Deterministic Samples 498 22.2.3 Lipschitz Functions and Random Samples 499 22.3 Optimistic Planning Algorithms 22.3.1 Optimistic Planning for Deterministic Systems 500 502 22.3.2 Open-Loop Optimistic Planning 504 22.3.3 Optimistic Planning for Sparsely Stochastic Systems 505 22.3.4 Theoretical Guarantees 509 22.4 Related Planning Algorithms 509 22.5 Numerical Example 510 References 515 23. Adaptive Feature Pursuit: Online Adaptation of Features in Reinforcement Learning 517 Shalabh Bhatnagar, VIvek S. Borkar, and L.A. Prashanth 23.1 Introduction 517 23.2 The Framework 520 23.2.1 The TD (O) Learning Algorithm 23.3 The Feature Adaptation Scheme 23.3.1 The Feature Adaptation Scheme 521 522 522 23.4 Convergence Analysis 525 23.5 Application to Traffic Signal Control 527 xvi CONTENTS 23.6 Conclusions 532 References 533 24. Feature Selection for Neuro-Dynamic Programming 535 Dayu Huang. W Chen. P Mehta. S. Meyn. and A. Surana 24.1 Introduction 535 24.2 Optimality Equations 536 24.2.1 Deterministic Model 537 24.2.2 Diffusion Model 538 24.2.3 Models in Discrete Time 539 24.2.4 Approximations 539 24.3 Neuro-Dynamic Algorithms 542 24.3.1 MDP Model 542 24.3.2 TD-Learning 543 24.3.3 SARSA 546 24.3.4 Q-Learning 547 24.3.5 Architecture 550 24.4 Fluid Models 551 24.4.1 The CRW Queue 551 24.4.2 Speed-Scaling Model 552 24.5 Diffusion Models 554 24.5.1 The CRW Queue 555 24.5.2 Speed-Scaling Model 556 24.6 Mean Field Games 556 24.7 Conclusions 557 References 558 25. Approximate Dynamic Programming for Optimizing Oil Production 560 Zheng Wen. Louis J Durlofsky. Benjamin Uln Roy. and Khalid Aziz 25.1 Introduction 560 25.2 Petroleum Reservoir Production Optimization Problem 562 25.3 Review of Dynamic Programming and Approximate Dynamic Programming 564 25.4 Approximate Dynamic Programming Algorithm for Reservoir Production Optimization 566 25.4.1 Basis Function Construction 566 25.4.2 Computation of Coefficients 568 25.4.3 Solving Subproblems 570 25.4.4 Adaptive Basis Function Selection and Bootstrapping 571 25.4.5 Computational Requirements 572 25.5 Simulation Results 573 25.6 Concluding Remarks 578 References 580 CONTENTS xvii 26. A Learning Strategy for Source Tracking in Unstructured Environments 582 Titus Appel, Rafael Fierro, Brandon Rohrer; Ron Lumia, and fohn Wood 26.1 Introduction 582 26.2 Reinforcement Learning 583 26.2.1 Q-Learning 584 26.2.2 Q-Learning and Robotics 589 26.3 Light-Following Robot 589 26.4 Simulation Results 592 26.5 Experimental Results 595 26.5.1 Hardware 596 26.5.2 Problems in Hardware Implementation 597 26.5.3 Results 598 26.6 Conclusions and Future Work 599 References 599 INDEX 601 PREFACE Modern day society relies on the operation of complex systems including aircraft, au­ tomobiles, electric power systems, economic entities, business organizations, banking and finance systems, computer networks, manufacturing systems, and industrial pro­ cesses, Decision and control are responsible for ensuring that these systems perform properly and meet prescribed performance objectives, The safe, reliable, and efficient control of these systems is essential for our society, Therefore, automatic decision and control systems are ubiquitous in human engineered systems and have had an enormous impact on our lives. As modern systems become more complex and per­ formance requirements more stringent, improved methods of decision and control are required that deliver guaranteed performance and the satisfaction of prescribed goals. Feedback control works on the principle of observing the actual outputs of a sys­ tem, comparing them to desired trajectories, and computing a control Signal based on that error, which is used to modify the performance of the system to make the actual output follow the desired trajectory. The optimization of sequential decisions or controls that are repeated over time arises in many fields, including artificial intel­ ligence, automatic control systems, power systems, economics, medicine, operations research, resource allocation, collaboration and coalitions, business and finance, and games including chess and backgammon. Optimal control theory provides meth­ ods for computing feedback control systems that deliver optimal performance. Op­ timal controllers optimize user-prescribed performance functions and are normally designed offline by solving Hamilton-Jacobi-Bellman (HJB) design equations. This requires knowledge of the full system dynamics model. However, it is often difficult to determine an accurate dynamical model of practical systems. Moreover, deter­ mining optimal control policies for nonlinear systems requires the offline solution of nonlinear HJB equations, which are often difficult or impossible to solve. Dynamic programming (DP) is a sequential algorithmic method for finding optimal solutions in sequential decision problems. DP was developed beginning in the 1960s with the work of Bellman and Pontryagin. DP is fundamentally a backwards-in-time procedure that does not offer methods for solving optimal decision problems in a forward manner in real time. The real-time adaptive learning of optimal controllers for complex unknown sys­ tems has been solved in nature. Every agent or system is concerned with acting on its environment in such a way as to achieve its goals. Agents seek to learn how to collaborate to improve their chances of survival and increase. The idea that there is xix XX PREFACE a cause and effect relation between actions and rewards is inherent in animal learn­ ing. Most organisms in nature act in an optimal fashion to conserve resources while achieving their goals. It is possible to study natural methods of learning and use them to develop computerized machine learning methods that solve sequential decision problems. Reinforcement learning (RL) describes a family of machine learning systems that operate based on principles used in animals, social groups, and naturally occurring systems. RL methods were used by Ivan Pavlov in the 1860s to train his dogs. RL refers to an actor or agent that interacts with its environment and modifies its actions, or control policies, based on stimuli received in response to its actions. RL computa­ tional methods have been developed by the Computational Intelligence Community that solve optimal decision problems in real time and do not require the availability of analytical system models. The RL algorithms are constructed on the idea that suc­ cessful control decisions should be remembered, by means of a reinforcement signal, such that they become more likely to be used another time. Successful collaborating groups should be reinforced. Although the idea originates from experimental animal learning, it has also been observed that RL has strong support from neurobiology, where it has been noted that the dopamine neurotransmitter in the basal ganglia acts as a reinforcement informational signal, which favors learning at the level of the neu­ rons in the brain. RL techniques were first developed for Markov decision processes having finite state spaces. They have been extended for the control of dynamical systems with infinite state spaces. One class of RL methods is based on the actor-critic structure, where an actor component applies an action or a control policy to the environment, whereas a critic component assesses the value of that action. Actor-critic structures are particularly well adapted for solving optimal decision problems in real time through reinforcement learning techniques. Approximate dynamiC programing (ADP) refers to a family of practical actor-critic methods for finding optimal solutions in real time. These tech­ niques use computational enhancements such as function approximation to develop practical algorithms for complex systems with disturbances and uncertain dynamics. Now, the ADP approach has become a key direction for future research in under­ standing brain intelligence and building intelligent systems. The purpose of this book is to give an exposition of recently developed RL and ADP techniques for decision and control in human engineered systems. Included are both single-player decision and control and multiplayer games. RL is strongly connected from a theoretical point of view with both adaptive learning control and optimal control methods. There has been a great deal of interest in RL and recent work has shown that ideas based on ADP can be used to design a family of adaptive learning algorithms that converge in real-time to optimal control solutions by measuring data along the system trajectories. The study of RL and ADP requires methods from many fields, including computational intelligence, automatic control systems, Markov decision processes, stochastic games, psychology, operations research, cybernetics, neural networks, and neurobiology. Therefore, this book is interested in bringing together ideas from many communities. PREFACE xxi This book has three parts. Part I develops methods for feedback control of systems based on RL and ADP. Part II treats learning and control in multiagent games. Part III presents some ideas of fundamental importance in understanding and implementing decision algorithm in Markov processes. F.L. LEWIS DERONG Lru Fort Worth, TX Chicago, II

Author Derong Liu and Frank L. Lewis Isbn 978-1118104200 File size 45.1 MB Year 2013 Pages 648 Language English File format PDF Category Programming Book Description: FacebookTwitterGoogle+TumblrDiggMySpaceShare Reinforcement learning (RL) and adaptive dynamic programming (ADP) has been one of the most critical research fields in science and engineering for modern complex systems. This book describes the latest RL and ADP techniques for decision and control in human engineered systems, covering both single player decision and control and multi-player games. Edited by the pioneers of RL and ADP research, the book brings together ideas and methods from many fields and provides an important and timely guidance on controlling a wide variety of systems, such as robots, industrial processes, and economic decision-making.     Download (45.1 MB) Pro Bash Programming: Scripting The Gnu/linux Shell (2nd Edition) Doing Math With Python: Use Programming To Explore Algebra, Statistics, Calculus, And More! Neural Networks For Applied Sciences And Engineering Modern Fortran Explained Modern Software Engineering Concepts And Practices: Advanced Approaches Load more posts

Leave a Reply

Your email address will not be published. Required fields are marked *