{"title": "Parallel Optimization of Motion Controllers via Policy Iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 996, "page_last": 1002, "abstract": null, "full_text": "Parallel Optimization of Motion \nControllers via Policy Iteration \n\nJ. A. Coelho Jr., R. Sitaraman, and R. A. Grupen \n\nDepartment of Computer Science \n\nUniversity of Massachusetts, Amherst, 01003 \n\nAbstract \n\nThis paper describes a policy iteration algorithm for optimizing the \nperformance of a harmonic function-based controller with respect \nto a user-defined index. Value functions are represented as poten(cid:173)\ntial distributions over the problem domain, being control policies \nrepresented as gradient fields over the same domain. All interme(cid:173)\ndiate policies are intrinsically safe, i.e. collisions are not promoted \nduring the adaptation process. The algorithm has efficient imple(cid:173)\nmentation in parallel SIMD architectures. One potential applica(cid:173)\ntion - travel distance minimization - illustrates its usefulness. \n\n1 \n\nINTRODUCTION \n\nHarmonic functions have been proposed as a uniform framework for the solu(cid:173)\ntion of several versions of the motion planning problem. Connolly and Gru(cid:173)\npen [Connolly and Grupen, 1993] have demonstrated how harmonic functions \ncan be used to construct smooth, complete artificial potentials with no lo(cid:173)\ncal minima. \n[Rimon and Koditschek, 1990] for navigation functions. This implies that the gra(cid:173)\ndient of harmonic functions yields smooth (\"realizable\") motion controllers. \n\nthese potentials meet the criteria established in \n\nIn addition, \n\nBy construction, harmonic function-based motion controllers will always command \nthe robot from any initial configuration to a goal configuration. The intermediate \nconfigurations adopted by the robot are determined by the boundary constraints \nand conductance properties set for the domain. Therefore, it is possible to tune \nboth factors so as to extremize user-specified performance indices (e.g. travel time \nor energy) without affecting controller completeness. \n\nBased on this idea, Singh et al. [Singh et al., 1994] devised a policy iteration method \nfor combining two harmonic function-based control policies into a controller that \nminimized travel time on a given environment. The two initial control policies were \n\n\fParallel Optimization of Motion Controllers via Policy Iteration \n\n997 \n\nderived from solutions to two distinct boundary constraints (Neumann and Dirichlet \nconstraints). The policy space spawned by the two control policies was parameter(cid:173)\nized by a mixing coefficient, that ultimately determined the obstacle avoidance be(cid:173)\nhavior adopted by the robot. The resulting controller preserved obstacle avoidance, \nensuring safety at every iteration of the learning procedure. \n\nThis paper addresses the question of how to adjust the conductance properties as(cid:173)\nsociated with the problem domain 0, such as to extremize an user-specified perfor(cid:173)\nmance index. Initially, conductance properties are homogeneous across 0, and the \nresulting controller is optimal in the sense that it minimizes collision probabilities at \nevery step [Connolly, 1994]1. The method proposed is a policy iteration algorithm, \nin which the policy space is parameterized by the set of node conductances. \n\n2 PROBLEM CHARACTERIZATION \n\nThe problem consists in constructing a path controller ifo that maximizes an integral \nperformance index 'P defined over the set of all possible paths on a lattice for a closed \ndomain 0 C Rn, subjected to boundary constraints. The controller ifo is responsible \nfor generating the sequence of configurations from an initial configuration qo on the \nlattice to the goal configuration qG, therefore determining the performance index \n'P. In formal terms, the performance index 'P can be defined as follows: \n\nDef. 1 Performance indez 'P : \n\n'P \n\nfor all q E L(O), where \n\nqa \n\n'Pqo.* = L f(q)\u00b7 \n\nq=qo \n\nL(O) is a lattice over the domain 0, qo denotes an arbitrary configuration on L(O), \nqG is the goal configuration, and f(q) is a function of the configuration q. \n\nFor example, one can define f(q) to be the available joint range associated with the \nconfiguration q of a manipulator; in this case, 'P would be measuring the available \njoint range associated with all paths generated within a given domain. \n\n2.1 DERIVATION OF REFERENCE CONTROLLER \n\nThe derivation of ifo is very laborious, requiring the exploration of the set of all \npossible paths. Out of this set, one is primarily interested in the subset of smooth \npaths. We propose to solve a simpler problem, in which the derived controller \nif is a numerical approximation to the optimal controller ifo, and (1) generates \nsmooth paths, (2) is admissible, and (3) locally maximizes P. To guarantee (1) and \n(2), it is assumed that the control actions of if are proportional to the gradient of a \nharmonic function f/J, represented as the voltage distribution across a resistive lattice \nthat tessellates the domain O. The condition (3) is achieved through incremental \nchanges in the set G of internodal conductancesj such changes maximize P locally. \nNecessary condition for optimality: Note that 'Pqo.* defines a scalar field over \nL(O). It is assumed that there exists a well-defined neighborhood .N(q) for node qj \nin fact, it is assumed that every node q has two neighbors across each dimension. \nTherefore, it is possible to compute the gradient over the scalar field Pqo.ff by locally \napproximating its rate of change across all dimensions. The gradient VP qo defines \n\nlThis is exactly the control policy derived by the TD(O) reinforcement learning method, \nfor the particular case of an agent travelling in a grid world with absorbing obstacle and \ngoal states, and being rewarded only for getting to the goal states (see [Connolly, 1994]). \n\n\f998 \n\nJ. A. COELHO Jr., R. SITARAMAN, R. A. GRUPEN \n\na reference controller; in the optimal situation, the actions of the controller if will \nparallel the actions of the reference controller. One can now formulate a policy \niteration algorithm for the synthesis of the reference controller: \n\n1. Compute if = -V~, given conductances G; \n2. Evaluate VP q: \n\n- for each cell, compute 'P ;;.. \n1'''' \n- for each cell, compute V'P q. \n\n3. Change G incrementally, minimizing the approz. error \u20ac = f(if, VPq); \n4. If \u20ac \n\nstop. Otherwise, return to (1). \n\nis below a threshold \u20aco, \n\nOn convergence, the policy iteration algorithm will have derived a control policy \nthat maximizes l' globally, and is capable of generating smooth paths to the goal \nconfiguration. The key step on the algorithm is step (3), or how to reduce the \ncurrent approximation error by changing the conductances G. \n\n3 APPROXIMATION ALGORITHM \n\nGiven a set of internodal conductances, the approximation error \u20ac \n\n\u20ac \n\n- L cos (if, VP) \n\nqEL(n) \n\nis defined as \n\n(1) \n\nor the sum over L(n) of the cosine of the angle between vectors if and VP. The \napproximation error \u20ac is therefore a function ofthe set G of internodal conductances. \nThere exist 0(nd\") conductances in a n-dimensional grid, where d is the discretiza(cid:173)\ntion adopted for each dimension. Discrete search methods for the set of conductance \nvalues that minimizes \u20ac are ruled out by the cardinality of the search space: 0(knd\"), \nif k is the number of distinct values each conductance can assume. We will represent \nconductances as real values and use gradient descent to minimize \u20ac, according to \nthe approximation algorithm below: \n\n1. Evaluate the apprommation error \u20acj \n2. Compute the gradient V\u20ac = g;; j \n3. Update conductances, making G = G - aVE; \n4. Normalize conductances, such that minimum conductance gmin = 1; \n\nStep (4) guarantees that every conductance g E G will be strictly positive. The \nconductances in a resistive grid can be normalized without constraining the volt(cid:173)\nage distribution across it, due to the linear nature of the underlying circuit. The \ncomplexity of the approximation algorithm is dominated by the computation of the \ngradient V\u20ac(G). Each component of the vector V\u20ac(G) can be expressed as \n\n8\u20ac = _ ' \" 8cos(ifq, VPq). \n8g \u00b7 \n\nL...J \n\n'qEL(n) \n\n8g\u00b7 \n, \n\n(2) \n\nBy assumption, if is itself the gradient of a harmonic function \u00a2> that describes \nthe voltage distribution across a resistive lattice. Therefore, the calculation of :;. \ninvolves the evaluation of ~ over all domain L(n), or how the voltage \u00a2>q is affected \nby changes in a certain conductance gi. \nFor n-dimensional grids, *\u00a3 is a matrix with d\" rows and 0( nd\") columns. We posit \nthat the computation of every element of it is unnecessary: the effects of changing \n\n\fParallel Optimization of Motion Controllers via Policy Iteration \n\n999 \n\ng, will be more pronounced in a certain grid neighborhood of it, and essentially \nnegligible for nodes beyond that neighborhood. Furthermore, this simplification \nallows for breaking up the original problem into smaller, independent sub-problems \nsuitable to simultaneous solution in parallel architectures. \n\n3.1 THE LOCALITY ASSUMPTION \n\nThe first simplifying assumption considered in this work establishes bounds on \nthe neighborhood affected by changes on conductances at node ij specifically, \nwe will assume that changes in elements of g, affect only the voltage at nodes \nin J/(i) , being J/(i) the set composed of node i and its direct neighbors. See \n[Coelho Jr. et al., 1995] for a discussion on the validity of this assumption. \nIn \nparticular, it is demonstrated that the effects of changing one conductance decay \nexponentially with grid distance, for infinite 2D grids. Local changes in resistive \ngrids with higher dimensionality will be confined to even smaller neighborhoods. \n\nThe locality assumption simplifies the calculation of :;. to \n\nBut \n\n~ [ if . V!, 1 = \n\nlifllV'P1 \n\n8g, \n\n[8if. VP _ if\u00b7 VP 8if. if l. \n\nlifl2 (8g, \n\n) \n\n1.... \n\nlifllV'P1 8g, \n\nNote that in the derivation above it is assumed that changes in G affects primarily \nthe control policy if, leaving VP relatively unaffected, at least in a first order \napproximation. \n\nGiven that if = - V~, it follows that the component 7r; at node q can be ap(cid:173)\nproximated by the change of potential across the dimension j, as measured by the \npotential on the corresponding neighboring nodes: \n\n7r\"1 = \u00a2q- - \u00a2q+, and 87r; = _1_ [8\u00a2q _ _ 8\u00a2q+] \n, q \n\n2b. 2 \n\n2b. 2 \n\n8gi' \n\n8g, \n\n8gi \n\nwhere b. is the internodal distance on the lattice L(n). \n3.2 DERIVATION OF G; \n\nThe derivation of ~ involves computing the Thevenin equiValent circuit for the \nresistive lattice, when every conductance 9 connected to node i is removed. For \nclarity, a 2D resistive grid was chosen to illustrate the procedure. Figure 1 depicts \nthe equivalence warranted by Thevenin's theorem [Chua et al., 1987] and the rel-\nevant variables for the derivation of ~. As shown, the equivalent circuit for the \nresistive grid consists of a four-port resistor, driven by four independent voltage \nsources. The relation between the voltage vector i = [\u00a2t \n\u00a24Y and the current \nvector r = [it ... i 4]T is expressed as \n\n(3) \nwhere R is the impedance matrix for the grid equivalent circuit and w is the vector \nof open-circuit voltage sources. The grid equivalent circuit behaves exactly like the \nwhole resistive gridj there is no approximation error. \n\nRf+w, \n\n\f1000 \n\nJ. A. COELHO Jr., R. SITARAMAN, R. A. GRUPEN \n\n... + .............. . \n! \ncJl2 \n\ni cJl 3 \n\n1 i 3 \ni \n\ncJl4 \n... + .............. . \n\nGrid Equivalent \n\nCircuit \n\nFigure 1: Equivalence established by Thevenin's theorem. \n\ncJl o \n\nThe derivation of the 20 parameters (the elements of Rand w) of the equivalent \ncircuit is detailed in [Coelho Jr. et al., 1995]j it involves a series ofrelaxation opera(cid:173)\ntions that can be efficiently implemented in SIMD architectures. The total number \nof relaxations for a grid with n l nodes is exactly 6n - 12, or an average of 1/2n \nrelaxations per link. In the context of this paper, it is assumed that Rand w are \nknown. Our primary interest is to compute how changes in conductances g1c affect \nthe voltage vector i, or the matrix \n84> = I 84>j I, \n\nfor {Jk\u00b7 \n\n1, . .. ,4 \n1, ... ,4. \n\n8g \n\n8g1c \n\nThe elements of ~ can be computed by derivating each of the four equality relations \nin Equation 3 with respect to g1c, resulting in a system of 16 linear equations, and 16 \nvariables - the elements of ~. Notice that each element of i can be expressed as a \nlinear function of the potentials i, by applying Kirchhoff's laws [Chua et al., 1987]: \n\n4 APPLICATION EXAMPLE \n\nA robot moves repeatedly toward a goal configuration. Its initial configuration is \nnot known in advance, and every configuration is equally likely of being the initial \nconfiguration. The problem is to construct a motion controller that minimizes the \noverall travel distance for the whole configuration space. If the configuration space \no is discretized into a number of cells, define the combined travel distance D(?T) as \n\nD(?T) \n\nL dq,if, \n\nqEL(O) \n\n(4) \n\nwhere dq,if is the travel distance from cell q to the goal configuration qG, and robot \ndisplacements are determined by the controller?T. Figure 2 depicts an instance \nof the travel distance minimization problem, and the paths corresponding to its \noptimal solution, given the obstacle distribution and the goal configuration shown. \n\nA resistive grid with 17 x 17 nodes was chosen to represent the control policies \ngenerated by our algorithm. Initially, the resistive grid is homogeneous, with all \ninternodal resistances set to 10. Figure 3 indicates the paths the robot takes when \ncommanded by ifO, the initial control policy derived from an homogeneous resistive \ngrid. \n\n\fParallel Optimization of Motion Controllers via Policy Iteration \n\n1001 \n\n16r----,....---,....----,r-----, \n\n16,....----,r-----,-----,----, \n\n12 \n\n12 \n\n12 \n\n16 \n\nFigure 2: Paths for optimal solution of \nthe travel distance minimization prob(cid:173)\nlem. \n\nFigure 3: Paths for the initial solution \nof the same problem. \n\nThe conductances in the resistive grid were then adjusted over 400 steps of the policy \niteration algorithm, and Figure 4 is a plot of the overall travel distance as a function \nof the number of steps. It also shows the optimal travel distance (horizontal line), \ncorresponding to the optimal solution depicted in Figure 2. The plot shows that \nconvergence is initially fast; in fact, the first 140 iterations are responsible for 90% \nof the overall improvement. After 400 iterations, the travel distance is within 2.8% \nof its optimal value. This residual error may be explained by the approximation \nincurred in using a discrete resistive grid to represent the potential distribution. \n\nFigure 5 shows the paths taken by the robot after convergence. The final paths are \nstraightened versions of the paths in Figure 3. Notice also that some of the final \npaths originating on the left of the I-shaped obstacle take the robot south of the \nobstacle, resembling the optimal paths depicted in Figure 2. \n\n5 CONCLUSION \n\nThis paper presented a policy iteration algorithm for the synthesis of provably cor(cid:173)\nrect navigation functions that also extremize user-specified performance indices. \nThe algorithm proposed solves the optimal feedback control problem, in which the \nfinal control policy optimizes the performance index over the whole domain, assum(cid:173)\ning that every state in the domain is as likely of being the initial state as any other \nstate. \n\nThe algorithm modifies an existing harmonic function-based path controller by in(cid:173)\ncrementally changing the conductances in a resistive grid. Departing from an homo(cid:173)\ngeneous grid, the algorithm transforms an optimal controller (i.e. a controller that \nminimizes collision probabilities) into another optimal controller, that extremizes \nlocally the performance index of interest. The tradeoff may require reducing the \nsafety margin between the robot and obstacles, but collision avoidance is preserved \nat each step of the algorithm. \n\nOther Applications: The algorithm presented can be used (1) in the synthesis \nof time-optimal velocity controllers, and (2) in the optimization of non-holonomic \npath controllers. The algorithm can also be a component technology for Intelligent \nVehicle Highway Systems (IVHS), by combining (1) and (2). \n\n\f1002 \n\nJ. A. COELHO Jr .\u2022 R. SITARAMAN. R. A. GRUPEN \n\n1170....----,----,----r----, \n\n16r-----.r-----.....---.,---., \n\n17.~ \n\n--------------~ \n\n12 \n\n1710 \n\n1680 \n\n16500L---IOO~-~200'\":--~300~-~400 \n\n12 \n\n16 \n\nFigure 4: Overall travel distance, as a \nfunction of iteration steps. \n\nFigure 5: Final paths, after 800 policy \niteration steps. \n\nPerformance on Parallel Architectures: The proposed algorithm is computa(cid:173)\ntionally demandingj however, it is suitable for implementation on parallel architec(cid:173)\ntures. Its sequential implementation on a SPARC 10 workstation requires ~ 30 sec. \nper iteration, for the example presented. We estimate that a parallel implementa(cid:173)\ntion of the proposed example would require ~ 4.3 ms per iteration, or 1. 7 seconds \nfor 400 iterations, given conservative speedups available on parallel architectures \n[Coelho Jr. et al., 1995]. \n\nAcknowledgements \n\nThis work was supported in part by grants NSF CCR-9410077, IRI-9116297, IRI-\n9208920, and CNPq 202107/90.6. \n\nReferences \n\n[Chua et aI., 1987] Chua, L., Desoer, C., and Kuh, E. (1987). Linear and Nonlinear \n\nCircuits. McGraw-Hill, Inc., New York, NY. \n\n[Coelho Jr. et al., 1995] Coelho Jr., J., Sitaraman, R., and Grupen, R. (1995). \n\nControl-oriented tuning of harmonic functions. Technical Report CMPSCI Tech(cid:173)\nnical Report 95-112, Dept. Computer Science, University of Massachusetts. \n\n[Connolly, 1994] Connolly, C. I. (1994). Harmonic functions and collision proba(cid:173)\n\nbilities. \nIEEE. \n\nIn Proc. 1994 IEEE Int. Conf. Robotics Automat., pages 3015-3019. \n\n[Connolly and Grupen, 1993] Connolly, C. I. and Grupen, R. (1993). The applica(cid:173)\ntions of harmonic functions to robotics. Journal of Robotic Systems, 10(7):931-\n946. \n\n[Rimon and Koditschek, 1990] Rimon, E. and Koditschek, D. (1990). Exact robot \nnavigation in geometrically complicated but topologically simple spaces. In Proc. \n1990 IEEE Int. Conf. Robotics Automat., volume 3, pages 1937-1942, Cincinnati, \nOH. \n\n[Singh et aI., 1994] Singh, S., Barto, A., Grupen, R., and Connolly, C. (1994). Ro(cid:173)\nbust reinforcement learning in motion planning. In Advances in Neural Informa(cid:173)\ntion Processing Systems 6, pages 655-662, San Francisco, CA. Morgan Kaufmann \nPublishers. \n\n\f", "award": [], "sourceid": 1117, "authors": [{"given_name": "Jefferson", "family_name": "Coelho", "institution": null}, {"given_name": "R.", "family_name": "Sitaraman", "institution": null}, {"given_name": "Roderic", "family_name": "Grupen", "institution": null}]}