Author | Derong Liu and Frank L. Lewis | |

Isbn | 978-1118104200 | |

File size | 45.1 MB | |

Year | 2013 | |

Pages | 648 | |

Language | English | |

File format | ||

Category | programming |

IEEE Press
445 Hoes Lane
Piscataway, NJ 08854
IEEE Press Editorial Board 2012
John Anderson,
Editor in Chief
Ramesh Abhari
Bernhard M . Haemmerli
Saeid Nahavandi
George W. Arnold
David Jacobson
Tariq Samad
Flavio Canavero
Mary Lanzerotti
George Zobrist
Dmitry Goldgof
Om P. Malik
Kenneth Moore,
Director ofIEEE Book and Information Services (BIS)
REINFORCEMENT
LEARNING AND
APPROXIMATE DYNAMIC
PROGRAMMING FOR
FEEDBACK CONTROL
Edited by
Frank L. Lewis
UTA Automation and Robotics Research Institute
Fort Worth, TX
Derong Liu
University ofIllinois
Chicago, IL
+IEEE
IEEE PRESS
�WILEY
A JOHN WILEY & SONS, INC., PUBLICATION
Cover Illustration: Courtesy of FrankL.Lewis and DerongLiu
Cover Design: John Wiley
Copyright
& Sons, Inc.
© 2013 by The Institute of Electrical and Electronics Engineers, Inc.
Published by John Wiley
& Sons, Inc., Hoboken, New Jersey. All rights reserved
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee
to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400,
fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission
should be addressed to the Permissions Department, John Wiley
& Sons, Inc., III River Street, Hoboken,
NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit ofLiabilitylDisclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be suitable
for your situation. You should consult with a professional where appropriate. Neither the publisher nor
author shall be liable for any loss of profit or any other commercial damages, including but not limited to
special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our
Customer Care Department within the United States at (800) 762-2974, outside the United States at (317)
572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may
not be available in electronic formats. For more information about Wiley products, visit our web site at
www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Reinforcement learning and approximate dynamic programming for feedback control / edited by Frank
L.Lewis, DerongLiu.
p. cm.
ISBN 978-1-118-10420-0 (hardback)
I.
II.
Reinforcement learning. 2.
Feedback control systems.
1. Lewis, FrankL.
Liu, Derong, 1963Q325.6.R464
2012
003!.5-dc23
2012019014
Printed in the United States of America
10 9 8 7 6 5 4 3 2
I
CONTENTS
PREFACE
xix
xxiii
CONTRIBUTORS
PART I
1.
FEEDBACK CONTROL USING RL AND ADP
Reinforcement Learning and Approximate Dynamic
Programming {RLADP) -Foundations, Common
Misconceptions, and the Challenges Ahead
3
Paul J Werbos
1.1
Introduction
3
1.2
W hat is RLADP?
4
1.2.1
Definition of RLADP and the Task it Addresses
4
1.2.2
Basic Tools-Bellman Equation, and Value and Policy
Functions
1.2.3
1.3
9
Optimization Over Time Without Value Functions
14
1.3.1
Accounting for Unseen Variables
15
1.3.2
Offline Controller Design Versus Real-Time Learning
17
1.3.3
"Model-Based" Versus "Model Free" Designs
18
1.3.4
How to Approximate the Value Function Better
19
1.3.5
How to Choose
22
1.3.6
How to Build Cooperative Multiagent Systems with
u
(t)
Based on a Value Function
RLADP
References
2.
13
Some Basic Challenges in Implementing ADP
25
26
Stable Adaptive Neural Control of Partially Observable
31
Dynamic Systems
J Nate Knight and Charles W Anderson
2.1
Introduction
31
2.2
Background
32
2.3
Stability Bias
35
2.4
Example Application
38
2.4.1
The Simulated System
38
2.4.2
An Uncertain Linear Plant Model
40
v
vi
CONTENTS
2.4.3
The Closed Loop Control System
2.4.4
Determining RNN Weight Updates by Reinforcement
Learning
44
2.4.5
Results
46
2.4.6
Conclusions
50
50
References
3.
41
Optimal Control of Unknown Nonlinear Discrete-Time Systems
Using the Iterative Globalized Dual Heuristic Programming
Algorithm
52
Derong Liu and Ding Wang
3.1
Background Material
3.2
Neuro-Optimal Control Scheme Based on the Iterative ADP
Algorithm
55
3.2.1
Identification of the Unknown Nonlinear System
55
3.2.2
Derivation of the Iterative ADP Algorithm
59
3.2.3
Convergence Analysis of the Iterative ADP Algorithm
59
3.2.4
Design Procedure of the Iterative ADP Algorithm
64
3.2.5
NN Implementation of the Iterative ADP Algorithm Using
GDHP Technique
64
3.3
Generalization
67
3.4
Simulation Studies
68
3.5
Summary
74
References
4.
53
74
Learning and Optimization in Hierarchical Adaptive Critic
Design
78
Haibo He, Zhen Ni, and Dongbin Zhao
4.1
Introduction
4.2
Hierarchical ADP Architecture with Multiple-Goal
4.3
4.4
Representation
80
4.2.1
System Level Structure
80
4.2.2
Architecture Design and Implementation
81
4.2.3
Learning and Adaptation in Hierarchical ADP
83
Case Study: The Ball-and-Beam System
87
4.3.1
Problem Formulation
88
4.3.2
Experiment Configuration and Parameters Setup
89
4.3.3
Simulation Results and Analysis
90
Conclusions and Future Work
References
5.
78
94
95
Single Network Adaptive Critics Networks-Development,
Analysis, and Applications
98
lie Ding, Ali Heydari, and 5.N Balakrishnan
5.1
Introduction
5.2
Approximate DynamiC Programing
98
100
CONTENTS
5.3
5.5
5.6
6.
102
SNAC
State Generation for Neural Network Training
103
5.3.2
Neural Network Training
103
5.3.3
Convergence Condition
104
5.3.1
5.4
vii
104
]-SNAC
5.4.1
Neural Network Training
105
5.4.2
Numerical Analysis
105
Finite-SNAC
108
5.5.1
Neural Network Training
5.5.2
Convergence Theorems
111
5.5.3
Numerical Analysis
112
109
116
Conclusions
References
116
Linearly Solvable Optimal Control
119
K. Dvijotham and E. Todorov
6.1
6.2
6.3
6.4
6.5
119
Introduction
6.1.1
Notation
121
6.1.2
Markov Decision Processes
122
Linearly Solvable Optimal Control Problems
123
6.2.1
Probability Shift: An Alternate View of Control
123
6.2.2
Linearly Solvable Markov Decision Processes
(LMDPs)
124
6.2.3
An Alternate View of LMDPs
124
6.2.4
Other Problem Formulations
126
6.2.5
Applications
126
6.2.6
Linearly Solvable Controlled Diffusions (LDs)
127
6.2.7
Relationship Between Discrete and Continuous-Time
Problems
128
6.2.8
Historical Perspective
129
Extension to Risk-Sensitive Control and Game Theory
130
6.3.1
Game Theoretic Control: Competitive Games
130
6.3.2
Renyi Divergence
130
6.3.3
Linearly Solvable Markov Games
130
6.3.4
Linearly Solvable Differential Games
133
6.3.5
Relationships Among the Different Formulations
134
Properties and Algorithms
134
6.4.1
Sampling Approximations and Path-Integral Control
134
6.4.2
Residual Minimization via Function Approximation
135
6.4.3
Natural Policy Gradient
136
6.4.4
Compositionality of Optimal Control Laws
136
6.4.5
Stochastic Maximum Principle
137
6.4.6
Inverse Optimal Control
138
Conclusions and Future Work
References
139
139
vi i i
7.
CONTENTS
Approximating Optimal Control with Value Gradient Learning
142
Michael Fairbank, Danil Pmkhomv, and Eduardo Alonso
7.1
7.2
7.3
7.4
7.5
Introduction
142
Value Gradient Learning and BPTT Algorithms
144
7.2.1
Preliminary Definitions
144
7.2.2
V GL (A) Algorithm
145
7.2.3
BPTT Algorithm
147
A Convergence Proof for V GL (1) for Control with Function
Approximation
148
7.3.1
Using a Greedy Policy with a Critic Function
149
7.3.2
The Equivalence of V GL (1) to BPTT
151
7.3.3
Convergence Conditions
152
7.3.4
Notes on the S"2t Matrix
154
7.4.1
Problem Definition
154
7.4.2
Efficient Evaluation of the Greedy Policy
155
7.4.3
Observations on the Purpose of S"2t
157
7.4.4
Experimental Results for Vertical Lander Problem
Conclusions
References
8.
152
Vertical Lander Experiment
158
159
160
A Constrained Backpropagation Approach to Function
Approximation and Approximate Dynamic Programming
162
Silvia Ferrari. Keith Rudd. and Gianluca Di Mum
8.1
Background
163
8.2
Constrained Backpropagation (CPROP) Approach
163
8.2.1
Neural Network Architecture and Procedural
8.2.2
Derivation of LTM Equality Constraints and Adjoined Error
Gradient
165
8.2.3
Example: Incremental Function Approximation
168
Memories
8.3
Solution of Partial Differential Equations in Nonstationary
Environments
8.4
170
8.3.1
CPROP Solution of Boundary Value Problems
170
8.3.2
Example: PDE Solution on a Unit Circle
171
8.3.3
CPROP Solution to Parabolic PDEs
174
Preserving Prior Knowledge in Exploratory Adaptive Critic
Designs
8.5
165
174
8.4.1
Derivation of LTM Constraints for Feedback Control
175
8.4.2
Constrained Adaptive Critic Design
177
Summary
179
Appendix: Algebraic ANN Control Matrices
180
References
180
CONTENTS
9.
ix
Toward Design o f Nonlinear ADP Learning Controllers with
Performance Assurance
182
Jennie Si, Lei Yang, Chao Lu, Kostas S. Tsakalis, and
Armando A. Rodriguez
9.1
Introduction
183
9.2
Direct Heuristic Dynamic Programming
184
9.3
A Control Theoretic View on the Direct HDP
186
9.3.1
Problem Setup
187
9.3.2
Frequency Domain Analysis of Direct HDP
189
9.3.3
Insight from Comparing Direct HDP to LQR
192
9.4
Direct HDP Design with Improved Performance
Case I-Design Guided by a Priori LQR Information
9.4.1
9.4.2
9.5
193
Direct HDP Design Guided by a Priori LQR
Information
193
Performance of the Direct HDP Beyond Linearization
195
Direct HDP Design with Improved Performance Case 2-Direct
HDP for Coorindated Damping Control of Low-Frequency
9.6
Oscillation
198
Summary
201
202
References
10. Reinforcement Learning Control with Time-Dependent Agent
Dynamics
203
Kenton Kirkpatrick and John Valasek
10.1 Introduction
203
10.2 Q-Learning
205
10.2.1 Q-Learning Algorithm
205
10.2.2 .s-Greedy
207
10.2.3 Function Approximation
208
10.3 Sampled Data Q-Learning
209
10.3.1 Sampled Data Q-Learning Algorithm
209
10.3.2 Example
210
10.4 System Dynamics Approximation
213
10.4.1 First-Order Dynamics Learning
214
10.4.2 Multiagent System Thought Experiment
216
10.5 Closing Remarks
218
References
219
11. Online Optimal Control of Nonaffine Nonlinear Discrete-Time
Systems without Using Value and Policy Iterations
221
Hassan Zargarzadeh Qinmin Yang, and S. Jagannathan
221
11.1 Introduction
11.2 Background
224
11.3 Reinforcement Learning Based Control
225
11.3.1 Affine-Like DynamiCS
225
11.3.2 Online Reinforcement Learning Controller DeSign
229
X
CONTENTS
11.3.3 The Action NN Design
229
11.3.4 The Critic NN Design
230
11.3.5 Weight Updating Laws for the NNs
231
11.3.6 Main Theoretic Results
232
11.4 Time-Based Adaptive Dynamic Programming-Based Optimal
Control
234
11.4.1 Online NN-Based Identifier
235
11.4.2 Neural Network-Based Optimal Controller DeSign
237
11.4.3 Cost Function Approximation for Optimal Regulator
Design
238
11.4.4 Estimation of the Optimal Feedback Control Signal
240
11.4.5 Convergence Proof
242
11.4.6 Robustness
244
11.5 Simulation Result
247
11.5.1 Reinforcement-Learning-Based Control of a Nonlinear
System
247
11.5.2 The Drawback of HDP Policy Iteration Approach
250
11.5.3 OLA-Based Optimal Control Applied to HCCI Engine
251
References
255
12. An Actor-Critic-Identifier Architecture for Adaptive
Approximate Optimal Control
258
S. Bhasin, R. KamaJapurkar; M lohnson, K C. Vamvoudakis,
F.I. Lewis, and WE. Dixon
12.1 Introduction
259
12.2 Actor-Critic-Identifier Architecture for H]B Approximation
260
12.3 Actor-Critic DeSign
263
12.4 Identifier Design
264
12.5 Convergence and Stability Analysis
270
12.6 Simulation
274
12.7 Conclusion
275
References
278
13. Robust Adaptive Dynamic Programming
281
Yu liang and Zhong-Ping liang
13.1 Introduction
281
13.2 Optimality Versus Robustness
283
13.2.1 Systems with Matched Disturbance Input
283
13.2.2 Adding One Integrator
284
13.2.3 Systems in Lower-Triangular Form
286
13.3 Robust-ADP Design for Disturbance Attenuation
288
13.3.1 Horizontal Learning
288
13.3.2 Vertical Learning
290
13.3.3 Robust-ADP Algorithm for Disturbance Attenuation
13.4 Robust-ADP for Partial-State Feedback Control
291
292
CONTENTS
13.4.1 The ISS Property
13.4.2 Online Learning Strategy
xi
293
295
296
13.5 Applications
13.5.1 Load-Frequency Control for a Power System
13.5.2 Machine Tool Power Drive System
296
298
13.6 Summary
300
References
301
PART II
LEARNING AND CONTROL IN MULTIAGENT
GAMES
14. Hybrid Learning in Stochastic Games and Its Application
in Network Security
305
Quanyan Zhu. Hamidou Tembine. and Tamer Ba�ar
305
14.1 Introduction
14.1.1 Related Work
306
14.1.2 Contribution
307
14.1.3 Organization of the Chapter
308
14.2 Two-Person Game
308
14.3 Learning in NZSGs
310
14.3.1 Learning Procedures
310
14.3.2 Learning Schemes
311
314
14.4 Main Results
14.4.1 Stochastic Approximation of the Pure Learning
Schemes
314
14.4.2 Stochastic Approximation of the Hybrid Learning
Scheme
14.4.3 Connection with Equilibria of the Expected Game
315
317
14.5 Security Application
322
14.6 Conclusions and Future Works
326
Appendix: Assumptions for Stochastic Approximation
327
References
328
15. Integral Reinforcement Learning for Online Computation of
Nash Strategies of Nonzero-Sum Differential Games
330
Draguna Vrabie and FL. Lewis
15.1 Introduction
331
15.2 Two-Player Games and Integral Reinforcement Learning
333
15.2.1 Two-Player Nonzero-Sum Games and Nash
Equilibrium
333
15.2.2 Integral Reinforcement Learning for Two-Player
Nonzero-Sum Games
335
15.3 Continuous-Time Value Iteration to Solve the Riccati Equation
337
15.4 Online Algorithm to Solve Nonzero-Sum Games
339
xii
CONTENTS
15.4.1 Finding Stabilizing Gains to Initialize the Online
Algorithm
339
15.4.2 Online Partially Model-Free Algorithm for Solving the
Nonzero-Sum Differential Game
339
15.4.3 Adaptive Critic Structure for Solving the Two-Player Nash
Differential Game
15.5 Analysis of the Online Learning Algorithm for NZS Games
15.5.1 Mathematical Formulation of the Online Algorithm
340
342
342
15.6 Simulation Result for the Online Game Algorithm
345
15.7 Conclusion
347
References
348
16. Online Learning Algorithms for Optimal Control and Dynamic
Games
350
Kyriakos C. Vamvoudakis and Frank L. Lewis
16.1 Introduction
350
16.2 Optimal Control and the Continuous Time
Hamilton-Jacobi-Bellman Equation
352
16.2.1 Optimal Control and Hamilton-Jacobi-Bellman
Equation
352
16.2.2 Policy Iteration for Optimal Control
354
16.2.3 Online Synchronous Policy Iteration
355
16.2.4 Simulation
357
16.3 Online Solution of Nonlinear Two-Player Zero-Sum Games and
Hamilton-Jacobi-Isaacs Equation
360
16.3.1 Zero-Sum Games and Hamilton-Jacobi-Isaacs
Equation
360
16.3.2 Policy Iteration for Two-Player Zero-Sum Differential
Games
361
16.3.3 Online Solution for Two-Player Zero-Sum Differential
Games
16.3.4 Simulation
362
364
16.4 Online Solution of Nonlinear Nonzero-Sum Games and Coupled
Hamilton-Jacobi Equations
366
16.4.1 Nonzero Sum Games and Coupled
Hamilton-Jacobi-Equations
16.4.2 Policy Iteration for Nonzero Sum Differential Games
367
369
16.4.3 Online Solution for Two-Player Nonzero Sum Differential
Games
16.4.4 Simulation
References
370
372
376
CONTENTS
PART III
xiii
FOUNDATIONS IN MDP AND RL
17. Lambda-Policy Iteration: A Review and a New Implementation
381
Dimitri P Bertsekas
17.1 Introduction
381
17.2 Lambda-Policy Iteration without Cost Function
Approximation
17.3 Approximate Policy Evaluation Using Projected Equations
386
388
17.3.1 Exploration-Contraction Trade-off
389
17.3.2 Bias
390
17.3.3 Bias-Variance Trade-off
390
17.3.4 TD Methods
391
17.3.5 Comparison of LSTD(A) and LSPE(A)
394
17.4 Lambda-Policy Iteration with Cost Function Approximation
17.4.1 The LSPE{A) Implementation
395
396
17.4.2 A-PI{O)-An Implementation Based on a Discounted
MDP
397
17.4.3 A-PI{ I)-An Implementation Based on a Stopping
Problem
17.4.4 Comparison with Alternative Approximate PI Methods
398
404
17.4.5 Exploration-Enhanced LSTD{A) with Geometric
Sampling
404
17.5 Conclusions
406
References
406
18. Optimal Learning and Approximate Dynamic Programming
410
Warren B. Powell and Ilya 0. Ryzhov
18.1 Introduction
410
18.2 Modeling
411
18.3 The Four Classes of Policies
412
18.3.1 Myopic Cost Function Approximation
412
18.3.2 Lookahead Policies
413
18.3.3 Policy Function Approximation
414
18.3.4 Policies Based on Value Function Approximations
414
18.3.5 Learning Policies
415
18.4 Basic Learning Policies for Policy Search
416
18.4.1 The Belief Model
417
18.4.2 Objective Functions for Offline and Online Learning
418
18.4.3 Some Heuristic Policies
419
18.5 Optimal Learning Policies for Policy Search
421
18.5.1 The Knowledge Gradient for Offline Learning
421
18.5.2 The Knowledge Gradient for Correlated Beliefs
423
18.5.3 The Knowledge Gradient for Online Learning
425
xiv
CONTENTS
18.5.4 The Knowledge Gradient for a Parametric
Belief Model
18.5.5 Discussion
425
426
18.6 Learning with a Physical State
427
18.6.1 Heuristic Policies
428
18.6.2 The Knowledge Gradient with a Physical State
428
References
429
19. An Introduction to Event-Based Optimization: Theory
and Applications
432
Xi-Ren Cao. Yanjia Zhao. Qing-Shan Jia. and Qianchuan Zhao
19.1 Introduction
432
19.2 Literature Review
433
19.3 Problem Formulation
434
19.4 Policy Iteration for EBO
435
19.4.1 Performance Difference and Derivative Formulas
435
19.4.2 Policy Iteration for EBO
440
19.5 Example: Material Handling Problem
441
19.5.1 Problem Formulation
441
19.5.2 Event-Based Optimization for the Material
Handling Problem
444
19.5.3 Numerical Results
446
19.6 Conclusions
448
References
449
20. Bounds for Markov Decision Processes
452
Vijay V Desai. Vlvek F. Farias. and Ciamac C. Moallemi
20.1 Introduction
20.1.1 Related Literature
452
454
20.2 Problem Formulation
455
20.3 The Linear Programming Approach
456
20.3.1 The Exact Linear Program
456
20.3.2 Cost-to-Go Function Approximation
457
20.3.3 The Approximate Linear Program
457
20.4 The Martingale Duality Approach
458
20.5 The Path wise Optimization Method
461
20.6 Applications
463
20.6.1 Optimal Stopping
464
20.6.2 Linear Convex Control
467
20.7 Conclusion
470
References
471
CONTENTS
XV
21. Approximate Dynamic Programming and Backpropagation
474
on Timescales
John Seiifertt and Donald Wunsch
21.1 Introduction: Timescales Fundamentals
474
21.1.1 Single-Variable Calculus
475
21.1.2 Calculus of Multiple Variables
476
21.1.3 Extension of the Chain Rule
477
21.1.4 Induction on Timescales
479
21.2 Dynamic Programming
479
21.2.1 Dynamic Programming Overview
480
21.2.2 Dynamic Programming Algorithm on Timescales
481
21.2.3 H]B Equation on Timescales
483
21.3 Backpropagation
485
21.3.1 Ordered Derivatives
486
21.3.2 The Backpropagation Algorithm on Timescales
490
21.4 Conclusions
492
References
492
22. A Survey of Optimistic Planning in Markov Decision Processes
494
Lucian Bu�oniu. Remi Munos. and Robert Babuska
22.1 Introduction
494
22.2 Optimistic Online Optimization
497
22.2.1 Bandit Problems
497
22.2.2 Lipschitz Functions and Deterministic Samples
498
22.2.3 Lipschitz Functions and Random Samples
499
22.3 Optimistic Planning Algorithms
22.3.1 Optimistic Planning for Deterministic Systems
500
502
22.3.2 Open-Loop Optimistic Planning
504
22.3.3 Optimistic Planning for Sparsely Stochastic Systems
505
22.3.4 Theoretical Guarantees
509
22.4 Related Planning Algorithms
509
22.5 Numerical Example
510
References
515
23. Adaptive Feature Pursuit: Online Adaptation of Features
in Reinforcement Learning
517
Shalabh Bhatnagar, VIvek S. Borkar, and L.A. Prashanth
23.1 Introduction
517
23.2 The Framework
520
23.2.1 The TD (O) Learning Algorithm
23.3 The Feature Adaptation Scheme
23.3.1 The Feature Adaptation Scheme
521
522
522
23.4 Convergence Analysis
525
23.5 Application to Traffic Signal Control
527
xvi
CONTENTS
23.6 Conclusions
532
References
533
24. Feature Selection for Neuro-Dynamic Programming
535
Dayu Huang. W Chen. P Mehta. S. Meyn. and A. Surana
24.1 Introduction
535
24.2 Optimality Equations
536
24.2.1 Deterministic Model
537
24.2.2 Diffusion Model
538
24.2.3 Models in Discrete Time
539
24.2.4 Approximations
539
24.3 Neuro-Dynamic Algorithms
542
24.3.1 MDP Model
542
24.3.2 TD-Learning
543
24.3.3 SARSA
546
24.3.4 Q-Learning
547
24.3.5 Architecture
550
24.4 Fluid Models
551
24.4.1 The CRW Queue
551
24.4.2 Speed-Scaling Model
552
24.5 Diffusion Models
554
24.5.1 The CRW Queue
555
24.5.2 Speed-Scaling Model
556
24.6 Mean Field Games
556
24.7 Conclusions
557
References
558
25. Approximate Dynamic Programming for Optimizing
Oil Production
560
Zheng Wen. Louis J Durlofsky. Benjamin Uln Roy. and Khalid Aziz
25.1 Introduction
560
25.2 Petroleum Reservoir Production Optimization Problem
562
25.3 Review of Dynamic Programming and Approximate
Dynamic Programming
564
25.4 Approximate Dynamic Programming Algorithm for Reservoir
Production Optimization
566
25.4.1 Basis Function Construction
566
25.4.2 Computation of Coefficients
568
25.4.3 Solving Subproblems
570
25.4.4 Adaptive Basis Function Selection and Bootstrapping
571
25.4.5 Computational Requirements
572
25.5 Simulation Results
573
25.6 Concluding Remarks
578
References
580
CONTENTS
xvii
26. A Learning Strategy for Source Tracking in Unstructured
Environments
582
Titus Appel, Rafael Fierro, Brandon Rohrer; Ron Lumia,
and fohn Wood
26.1 Introduction
582
26.2 Reinforcement Learning
583
26.2.1 Q-Learning
584
26.2.2 Q-Learning and Robotics
589
26.3 Light-Following Robot
589
26.4 Simulation Results
592
26.5 Experimental Results
595
26.5.1 Hardware
596
26.5.2 Problems in Hardware Implementation
597
26.5.3 Results
598
26.6 Conclusions and Future Work
599
References
599
INDEX
601
PREFACE
Modern day society relies on the operation of complex systems including aircraft, au
tomobiles, electric power systems, economic entities, business organizations, banking
and finance systems, computer networks, manufacturing systems, and industrial pro
cesses, Decision and control are responsible for ensuring that these systems perform
properly and meet prescribed performance objectives, The safe, reliable, and efficient
control of these systems is essential for our society, Therefore, automatic decision
and control systems are ubiquitous in human engineered systems and have had an
enormous impact on our lives. As modern systems become more complex and per
formance requirements more stringent, improved methods of decision and control
are required that deliver guaranteed performance and the satisfaction of prescribed
goals.
Feedback control works on the principle of observing the actual outputs of a sys
tem, comparing them to desired trajectories, and computing a control Signal based
on that error, which is used to modify the performance of the system to make the
actual output follow the desired trajectory. The optimization of sequential decisions
or controls that are repeated over time arises in many fields, including artificial intel
ligence, automatic control systems, power systems, economics, medicine, operations
research, resource allocation, collaboration and coalitions, business and finance, and
games including chess and backgammon. Optimal control theory provides meth
ods for computing feedback control systems that deliver optimal performance. Op
timal controllers optimize user-prescribed performance functions and are normally
designed offline by solving Hamilton-Jacobi-Bellman (HJB) design equations. This
requires knowledge of the full system dynamics model. However, it is often difficult
to determine an accurate dynamical model of practical systems. Moreover, deter
mining optimal control policies for nonlinear systems requires the offline solution of
nonlinear HJB equations, which are often difficult or impossible to solve. Dynamic
programming (DP) is a sequential algorithmic method for finding optimal solutions in
sequential decision problems. DP was developed beginning in the 1960s with the work
of Bellman and Pontryagin. DP is fundamentally a backwards-in-time procedure that
does not offer methods for solving optimal decision problems in a forward manner in
real time.
The real-time adaptive learning of optimal controllers for complex unknown sys
tems has been solved in nature. Every agent or system is concerned with acting on
its environment in such a way as to achieve its goals. Agents seek to learn how to
collaborate to improve their chances of survival and increase. The idea that there is
xix
XX
PREFACE
a cause and effect relation between actions and rewards is inherent in animal learn
ing. Most organisms in nature act in an optimal fashion to conserve resources while
achieving their goals. It is possible to study natural methods of learning and use them
to develop computerized machine learning methods that solve sequential decision
problems.
Reinforcement learning (RL) describes a family of machine learning systems that
operate based on principles used in animals, social groups, and naturally occurring
systems. RL methods were used by Ivan Pavlov in the 1860s to train his dogs. RL
refers to an actor or agent that interacts with its environment and modifies its actions,
or control policies, based on stimuli received in response to its actions. RL computa
tional methods have been developed by the Computational Intelligence Community
that solve optimal decision problems in real time and do not require the availability
of analytical system models. The RL algorithms are constructed on the idea that suc
cessful control decisions should be remembered, by means of a reinforcement signal,
such that they become more likely to be used another time. Successful collaborating
groups should be reinforced. Although the idea originates from experimental animal
learning, it has also been observed that RL has strong support from neurobiology,
where it has been noted that the dopamine neurotransmitter in the basal ganglia acts
as a reinforcement informational signal, which favors learning at the level of the neu
rons in the brain. RL techniques were first developed for Markov decision processes
having finite state spaces. They have been extended for the control of dynamical
systems with infinite state spaces.
One class of RL methods is based on the actor-critic structure, where an actor
component applies an action or a control policy to the environment, whereas a critic
component assesses the value of that action. Actor-critic structures are particularly
well adapted for solving optimal decision problems in real time through reinforcement
learning techniques. Approximate dynamiC programing (ADP) refers to a family of
practical actor-critic methods for finding optimal solutions in real time. These tech
niques use computational enhancements such as function approximation to develop
practical algorithms for complex systems with disturbances and uncertain dynamics.
Now, the ADP approach has become a key direction for future research in under
standing brain intelligence and building intelligent systems.
The purpose of this book is to give an exposition of recently developed RL and ADP
techniques for decision and control in human engineered systems. Included are both
single-player decision and control and multiplayer games. RL is strongly connected
from a theoretical point of view with both adaptive learning control and optimal control
methods. There has been a great deal of interest in RL and recent work has shown that
ideas based on ADP can be used to design a family of adaptive learning algorithms that
converge in real-time to optimal control solutions by measuring data along the system
trajectories. The study of RL and ADP requires methods from many fields, including
computational intelligence, automatic control systems, Markov decision processes,
stochastic games, psychology, operations research, cybernetics, neural networks, and
neurobiology. Therefore, this book is interested in bringing together ideas from many
communities.
PREFACE
xxi
This book has three parts. Part I develops methods for feedback control of systems
based on RL and ADP. Part II treats learning and control in multiagent games. Part III
presents some ideas of fundamental importance in understanding and implementing
decision algorithm in Markov processes.
F.L. LEWIS
DERONG Lru
Fort Worth, TX
Chicago, II

Author Derong Liu and Frank L. Lewis Isbn 978-1118104200 File size 45.1 MB Year 2013 Pages 648 Language English File format PDF Category Programming Book Description: FacebookTwitterGoogle+TumblrDiggMySpaceShare Reinforcement learning (RL) and adaptive dynamic programming (ADP) has been one of the most critical research fields in science and engineering for modern complex systems. This book describes the latest RL and ADP techniques for decision and control in human engineered systems, covering both single player decision and control and multi-player games. Edited by the pioneers of RL and ADP research, the book brings together ideas and methods from many fields and provides an important and timely guidance on controlling a wide variety of systems, such as robots, industrial processes, and economic decision-making. Download (45.1 MB) Pro Bash Programming: Scripting The Gnu/linux Shell (2nd Edition) Doing Math With Python: Use Programming To Explore Algebra, Statistics, Calculus, And More! Neural Networks For Applied Sciences And Engineering Modern Fortran Explained Modern Software Engineering Concepts And Practices: Advanced Approaches Load more posts