Pyquaticus Software for Training and Deploying Decision Agents

More Information

Problem Statement
Teamwide coordination and strategy execution are difficult in operational environments. Developing and refining algorithms to perform these functions take time and can be very slow to test if utilizing the end user hardware or full high-fidelity simulators.

Technology Solution
A parallelizable, full environment control software simulator platform developed to simulate algorithm performance at low fidelity of higher complexity hardware systems.

Value Proposition
Allow for high-frequency iterative algorithm development on the simulator tool to reduce development time.

Overview
This is a PettingZoo environment for maritime Capture the Flag with uncrewed surface vehicles (USVs). It is a lightweight environment for developing algorithms to play multi-agent Capture the Flag with surface vehicle dynamics.

Motivation
The primary motivation is to enable quick iteration of algorithm development or training loops for reinforcement learning (RL). The implementation is pure-Python, supports faster-than-real-time execution, and can easily be parallelized on a cluster. This is critical for scaling up learning-based techniques before transitioning to higher-fidelity simulation and/or USV hardware.

The vehicle dynamics are based on the MOOS-IvP uSimMarine dynamics here. MOOS-IvP stands for Mission Oriented Operating Suite with Interval Programming. The IvP addition to the core MOOS software package is developed and maintained by the Laboratory for Autonomous Marine Sensing Systems at MIT. MOOS-IvP is a popular choice for maritime autonomy filling a similar role to the Robot Operating System (ROS) used in other robotics fields.

Key Capabilities

Supports standard PettingZoo interface for multi-agent RL
Pure-Python implementation without many dependencies
Easy integration with standard learning frameworks
Implementation of MOOS-IvP vehicle dynamics such that algorithms and learning policies can easily be ported to MOOS-IvP and deployed on hardware
Baseline policies to train against
Parameterized number of agents
Decentralized and agent-relative observation space
Configurable reward function