Architecture
Architecture
godon is a distributed system for live system optimization. It coordinates metaheuristic search algorithms with real-world effectuation and observation, running continuously against production systems.
High-Level Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ Control Plane │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Godon API │───▶│ Windmill │───▶│ Workers │ │
│ │ (extern) │ │ (scheduler) │ │ (execute) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Metadata DB │ │ Archive DB │ │
│ │ (PostgreSQL) │ │ (YugabyteDB) │ │
│ │ │ │ │ │
│ │ Component state │ │ Trial history │ │
│ │ Job tracking │ │ Cooperation │ │
│ └──────────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Execution Layer │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Breeder │────────▶│ Effectuator │────────▶│ Target │ │
│ │ (driver) │ │ (apply) │ │ System │ │
│ └──────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ ┌──────────────┐ │
│ └─────────│Reconnaissance│◀────────│ Metrics │ │
│ │ (observe) │ │ Sources │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Components
Godon API
The external interface for managing optimization runs.
| Responsibility | Description |
|---|---|
| Breeder lifecycle | Create, start, stop, delete breeders |
| Status queries | Check breeder and trial status |
| Configuration | Submit optimization configs |
| Results | Retrieve best configurations |
The API is stateless — it delegates to Windmill for orchestration.
Windmill
Workflow orchestration engine that schedules and executes godon jobs.
| Responsibility | Description |
|---|---|
| Job scheduling | Queue and dispatch work to workers |
| Worker management | Maintain worker pools by group |
| Retry handling | Recover from transient failures |
| Dependency resolution | Coordinate multi-step workflows |
Windmill provides the execution backbone without godon needing to implement scheduling logic.
Worker Groups
Workers are organized by job type:
| Group | Replicas | Timeout | Purpose |
|---|---|---|---|
| controller | 3 | 2 minutes | Fast operations: preflight, breeder create, status checks |
| breeder | 5 | None | Long-running optimization loops |
| default | 2 | Default | General operations, dependency resolution |
Why separate groups: - Controller jobs are fast but frequent — need quick response - Breeder jobs run continuously — no timeout, crash recovery via Optuna DB - Default handles everything else without blocking specialized groups
Metadata DB (PostgreSQL)
Stores godon's operational state.
| Data | Purpose |
|---|---|
| Breeder definitions | Configurations submitted via API |
| Job state | Windmill job tracking |
| Component metadata | Internal godon state |
PostgreSQL is sufficient here — moderate write volume, strong consistency needs.
Archive DB (YugabyteDB)
Stores trial history for optimization and cooperation.
| Data | Purpose |
|---|---|
| Trial records | Parameters, metrics, fitness |
| Pareto fronts | Best configurations found |
| Cooperation data | Shared trials between breeders |
Why YugabyteDB: - Horizontal scalability — Many concurrent breeders writing trials - PostgreSQL compatibility — Uses YSQL, same queries as Optuna expects - Distribution — Cooperative breeders need shared storage
Metrics Exporter
Exposes godon metrics for observability.
| Metric Type | Examples |
|---|---|
| Trials | Total, successful, failed |
| Duration | Effectuation time, reconnaissance time |
| Breeder | Active count, worker utilization |
Pushes to Prometheus Push Gateway for aggregation.
Optimization Loop
The core cycle that each breeder worker executes:
┌──────────────────────────────────────────────────────────────────────────┐
│ Breeder Worker Loop │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────────┐ │
│ │ Sample │───▶│ Effectuate │───▶│Reconnoiter │───▶│ Update │ │
│ │ (algorithm)│ │ (apply) │ │ (observe) │ │ (fitness) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └──────────┘ │
│ │ │ │ │ │
│ │ ▼ │ │ │
│ │ ┌──────────────────────────────────┐ │ │
│ │ │ Target System │ │ │
│ │ │ ┌────────┐ ┌────────────┐ │ │ │
│ └─────────▶│ SSH │ │ Kubernetes │─────▶ │ │
│ │ HTTP │ │ API │ │ │ │
│ └────────┘ └────────────┘ │ │ │
│ │ │ │
│ ▼ │ │
│ ┌──────────────────────────┐ │ │
│ │ Prometheus / Metrics │ │ │
│ └────────────┬─────────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌──────────────────────────┐ │ │
│ │ Guardrails? Fitness? │ │ │
│ └────────────┬─────────────┘ │ │
│ │ │ │
│ ┌────────────┴─────────────┐ │ │
│ ▼ ▼ │ │
│ ┌──────────┐ ┌──────────┐ │ │
│ │ Share │ │ Next │ │ │
│ │ (opt) │ │ Sample │ │ │
│ └──────────┘ └──────────┘ │ │
│ │
└──────────────────────────────────────────────────────────────────────────┘
| Phase | Action | Duration |
|---|---|---|
| Sample | Algorithm suggests next parameters | Milliseconds |
| Effectuate | Apply config to target system | Seconds to minutes |
| Reconnoiter | Wait for steady state, collect metrics | Seconds |
| Update | Check guardrails, compute fitness, update algorithm | Milliseconds |
| Communicate (optional) | Publish trial to Archive DB for cooperation | Milliseconds |
Key properties: - Effectuation is idempotent — safe to retry - Reconnaissance waits for steady state before collecting - Guardrail violations short-circuit the loop, mark trial failed - Archive DB write is async, doesn't block next sample
Technology Choices
| Technology | Role | Why |
|---|---|---|
| Windmill | Workflow orchestration | Most mature and best performing open source workflow engine, abstracts Kubernetes complexity |
| PostgreSQL | Metadata storage | Reliable, well-understood, sufficient for component state |
| YugabyteDB | Trial archive | PostgreSQL-compatible, horizontally scalable, enables cooperation |
| Kubernetes | Deployment platform | Container orchestration, Helm for config, standard in cloud-native |
| Prometheus | Metrics | Industry standard, Push Gateway for batch job metrics |
Design principles:
- Open source stack — Built entirely on open source components, no vendor lock-in
- Separate concerns — Metadata (operational) vs Archive (optimization) have different scaling needs
- PostgreSQL ecosystem — Both databases speak PostgreSQL, reducing cognitive load
- Kubernetes-native — Helm charts, Pod Disruption Budgets, standard deployment patterns
Deployment
godon is deployed via Helm chart to Kubernetes.
┌─────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ godon-api │ │ windmill │ │ workers │ │
│ │ (pod) │ │ (pods) │ │ (pods) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ metadata-db │ │ archive-db │ │ pushgateway │ │
│ │ (postgres) │ │ (yugabyte) │ │ (prometheus)│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Deployment characteristics:
- Stateless API — Can scale horizontally, rolling updates without downtime
- Stateful databases — YugabyteDB handles its own replication
- Worker pools — Scale independently based on load
- Helm-managed — Single chart installs the full stack
Failure Modes
| Failure | Impact | Recovery |
|---|---|---|
| API pod dies | No new requests | Kubernetes restarts, stateless |
| Worker dies | In-flight trial lost | Optuna DB enables resume, algorithm continues |
| Metadata DB down | No new breeders | Existing breeders continue (state already dispatched) |
| Archive DB down | No cooperation, no persistence | Breeders continue locally, no cross-learning |
| Target system unreachable | Trial fails | Marked failed, algorithm learns to avoid |
Crash safety: - Breeder workers have no timeout — they run until completion or crash - Optuna stores trial state in Archive DB — restart resumes from last known state - No half-applied configs — effectuation is idempotent
Scaling
| Component | Scale by | Limit |
|---|---|---|
| API | Replicas | Stateless, scale freely |
| Workers | Group replicas | More workers = more parallel trials |
| Metadata DB | Vertical | Single PostgreSQL instance |
| Archive DB | Horizontal | YugabyteDB distributes across nodes |
Cooperation scaling: - Multiple breeders share Archive DB - Each learns from others' trials - Diminishing returns after ~10 cooperating breeders (search space coverage)
See Also
- Core Concepts — Breeder, Effectuator, Reconnaissance, Guardrails
- Configuration Guide — How to configure optimization runs
- Setup — Installation instructions