joyyang.dev

SwarmSim — A mini distributed AI agents sandbox

Why this exists

We’re entering an era where many specialized AIs will need to coordinate in the open: share partial observations, align on plans, allocate work, and stay resilient under faults. SwarmSim is a hands-on learning lab to explore that future. It lets you watch 100–1,000+ lightweight agents collaborate in a shared world while you toggle real distributed patterns (gossip, majority vote, auction), inject failures (latency, drops, partitions, kill), and compare convergence and cost. It’s built to make the invisible visible: how information spreads, how groups align, how assignments stabilize, and how faults shape outcomes.

The vision

MVP: 100–1,000 agents in a grid world solving tasks (resource gathering, cooperative pathfinding, K‑of‑N “build a bridge,” predator–prey). Show coordination patterns (voting, gossip, leader election) and emergent behavior.
V2: Plug‑in “brains” (rule‑based, RL, LLM tool‑users). Agents form coalitions; swap roles when leaders fail; run on laptop + optional cloud worker pool.
V3: Research playground: swap message‑passing protocols, compare convergence time/cost, add adversarial agents.

Architecture

Frontend (Next.js + TS + Three.js/Canvas)
- Streams WS frames at 10–20 Hz; renders bees with role halos (worker/leader/scout), bridges with have/need labels and progress wedge, flowers sized by “nectar.”
- Left panel: scenarios, agents scale, protocol apply, LLM highlight/ratio, tasks, view/2D fallback, speed slider, resource churn.
- Right drawer (Docs): deep explanations, diagrams, observability tips.
Coordinator (FastAPI)
- Authoritative world state + fixed tick loop; runs protocol on_tick; broadcasts compact WS frames; exposes /metrics (Prometheus) and control APIs (/scenario, /protocol, /faults, /tasks, /agents/scale).
- Fault adapters (latency/drop/partition) for bus traffic; optional LLM proxy for natural‑language “gossip.”
Agents (Python, multiprocessing; Ray optional)
- Sense → plan → act loop; adapters for gossip/vote/auction; publish status to world.events; optional LLM planner (proxy call with backoff).
Messaging
- Redis pub/sub for broadcast channels; envelopes carry ts/from/type/payload.
Observability
- Prometheus + Grafana: WS frame rate, messages/sec, consensus rounds, tasks completed, LLM latency.

Protocols

Gossip: rumor‑mongering (fanout, TTL) + push‑sum if desired. Robust; eventually consistent; bandwidth scales with fanout×TTL.
Majority vote: periodic windows collect proposals → aligned decisions; great for mode flips; sensitive to partitions / window tuning.
Auction (second‑price flavored): tasks are items; bees bid (distance/energy); winners assigned; fast but can flap without cadence/hold.
(Stretch) Leader election (Bully/Raft‑lite): coordination roles only, not replicated logs.

Fault & scale levers (what to watch)

Latency: wedges and message waves slow visibly.
Drop rate: gossip still covers; vote/auction may mis‑decide without retries.
Partition: sub‑clusters act independently; merge on heal (wedge jumps).
Kill %: remaining bees re‑assign; auctions stabilize quickly.
Scale: frame rate + messages/sec in Grafana; UI drops old frames to stay real‑time.

State & metrics

World state authoritative in Coordinator; WS snapshots every tick.
Prometheus: swarmsim_ws_frame_rate, swarmsim_messages_total, swarmsim_consensus_rounds_total, swarmsim_tasks_completed_total, swarmsim_llm_latency_ms.

Learning outcomes

Compare gossip vs quorum vs market mechanisms: message complexity vs convergence speed vs robustness.
See how partitions fundamentally change guarantees (why gossip “wins ugly” in degraded networks).
Appreciate planning/concurrency trade‑offs (multiprocessing/Ray vs GIL‑bounded asyncio).
Understand why real systems layer patterns (gossip for presence and soft‑state, auctions for local assignment, voting for global flips).

Lots of documentation and animations so that users can learn as well!

Repo and deployment

One‑command stack (Docker Compose) brings up frontend, coordinator, agents, Redis, Prometheus, and Grafana; .env toggles protocol, counts, latency, LLM.
GitHub repo: J0YY/SwarmSim

Motivation: this is a low‑friction playground to learn collaborative intelligence—how many small agents, each with bounded capabilities, can do surprisingly big things when they communicate well and use the right protocol for the situation.