The Army’s future model of operations, multidomain operations, includes autonomous agents with learning components working alongside the warfighter. New Army research decreases the unpredictability of current techniques of training reinforcement learning to be realistic for physical systems, particularly ground robots.
These components of learning would enable autonomous agents to think and adapt on the battlefield to changing circumstances, said Army researcher Dr.
Alec Koppel of the United States
Technology Command of Army Combat Capabilities, now referred to as DEVCOM, Army Research Laboratory.
The underlying mechanism for adaptation and replanning consists of instructions based on reinforcement learning. In order to make the MDO principle of operations a reality, he said, achieving these policies effectively is crucial.
According to Koppel, reinforcement learning policy gradient approaches are the basis for scalable continuous space algorithms, but current techniques do not integrate wider decision goals such as risk sensitivity, safety constraints, discovery, and prior divergence.
Reinforcement learning, which has recently gained popularity for solving previously intractable tasks such as strategy games such as Go, chess, and video games such as Atari and Starcraft II, Koppel said, will address the design of autonomous behaviors when the relationship between dynamics and objectives is complex.
Unfortunately, the prevalent practice involves astronomical complexity of instances, such as, he said, thousands of years of simulated game play.
This complexity makes several traditional training mechanisms inappropriate for the data-poor settings needed for the Next-Generation Combat Vehicle background of the MDO (NGCV).
“To enable reinforcement learning for MDO and NGCV, training mechanisms must improve sampling efficiency and reliability in continuous spaces,” Koppel said. We are taking a step towards breaking the existing sampling efficiency barriers of the dominant practice in reinforcement learning by generalizing existing policy quest schemes to general utilities.’
For general utilities, Koppel and his research team have built new policy search schemes whose sampling complexity is also known.
They noted that the resulting policy search schemes reduce the uncertainty of the accumulation of rewards, allow unknown domains to be explored efficiently, and provide a framework for integrating prior experience.
This research contributes to an expansion in reinforcement learning of the classical policy gradient theorem,”This research contributes to an extension of the classical policy gradient theorem in reinforcement learning,” It introduces new policy search systems for general utilities, the complexity of which is also established as an example.
The U.S. Army is important to these developments.
Army by allowing reinforcement learning goals beyond the cumulative standard return, such as risk sensitivity, safety constraints, discovery, and prior divergence.
Especially in the context of ground robots, he said, it is very expensive to get data.
“Reducing the volatility of reward accumulation, ensuring that one explores an unknown domain in an efficient manner, or incorporating prior experience all contribute to breaking the existing barriers of sampling efficiency of the prevailing practice of reinforcement learning by reducing the amount of random sampling one needs to complete strategy optimization,” Koppel said.
The future of this study is very bright, and Koppel concentrated his energies on making his results relevant to innovative battlefield soldiers’ technologies.
“I am optimistic that autonomous robots equipped with reinforcement learning will be able to assist the warfighter in reconnaissance, intelligence and risk assessment on the future battlefield,” Koppel said. “Seeing this vision become a reality is a key motivation for what research problems I am pursuing.”
The next step for this study is to incorporate into multi-agent settings the wider decision-making objectives allowed by general reinforcement learning utilities and explore how collaborative settings between reinforcement learning agents contribute to synergistic and antagonistic team reasoning.
The technology resulting from this study, according to Koppel, would be capable of reasoning in team situations under uncertainty.
Reference; “Variational Policy Gradient Method for Reinforcement Learning with General Utilities” by.