Guardrails
Guardrails
Guardrails are safety limits that protect production systems during optimization. Unlike objectives which guide the search, guardrails enforce hard constraints — trials that violate them are rejected.
Guardrails vs Objectives
| Aspect | Objective | Guardrail |
|---|---|---|
| Purpose | What to optimize | What to avoid |
| Behavior | Continuous guidance | Binary pass/fail |
| Violation | Suboptimal result | Trial rejected |
| Example | Minimize latency | Latency must stay < 100ms |
Objectives say "make this better." Guardrails say "don't break this."
How Guardrails Work
┌─────────────────────────────────────────────────────────────┐
│ Trial Execution │
│ │
│ Parameters ──▶ Effectuate ──▶ Reconnoiter ──▶ Check │
│ │ │
│ ┌────────────┴───┐ │
│ │ Guardrails? │ │
│ └────────────┬───┘ │
│ │ │
│ ┌───────────────────────┼─────┐ │
│ ▼ ▼ │ │
│ Violated OK │ │
│ │ │ │ │
│ ▼ ▼ │ │
│ Mark FAILED Accept trial │ │
│ Trigger rollback Update best │ │
└─────────────────────────────────────────────────────────────┘
Guardrails check reconnaissance data after each trial.
Defining Guardrails
guardrails:
- name: max_latency
metric: latency_p99
condition: "< 100ms"
on_violation: fail_trial
- name: error_budget
metric: error_rate
condition: "< 0.01"
on_violation: fail_trial
- name: cpu_limit
metric: cpu_usage_percent
condition: "< 90"
on_violation: fail_trial
| Field | Meaning |
|---|---|
metric |
Which reconnaissance value to check |
condition |
Threshold expression |
on_violation |
What to do when violated |
Condition Syntax
| Operator | Meaning | Example |
|---|---|---|
< |
Less than | latency < 100 |
<= |
Less than or equal | errors <= 10 |
> |
Greater than | throughput > 1000 |
>= |
Greater than or equal | uptime >= 0.99 |
== |
Equals | status == "healthy" |
!= |
Not equals | state != "degraded" |
Guardrail States
┌─────────────────────────────────────────────────────────────┐
│ Guardrail Evaluation │
│ │
│ Metric value ──▶ Compare ──▶ [PASS] or [FAIL] │
│ │
│ latency_p99 = 45ms │
│ condition: < 100ms │
│ result: PASS │
│ │
│ error_rate = 0.05 │
│ condition: < 0.01 │
│ result: FAIL │
│ │
└─────────────────────────────────────────────────────────────┘
| State | Meaning |
|---|---|
| PASS | Metric within bounds, trial continues |
| FAIL | Metric exceeds limit, trial rejected |
Multiple Guardrails
All guardrails must pass for a trial to succeed:
┌─────────────────────────────────────────────────────────────┐
│ Guardrail Checks │
│ │
│ latency_p99: 45ms < 100ms ✓ PASS │
│ error_rate: 0.001 < 0.01 ✓ PASS │
│ cpu_usage: 85% < 90% ✓ PASS │
│ │
│ All pass → Trial accepted │
│ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Guardrail Checks │
│ │
│ latency_p99: 45ms < 100ms ✓ PASS │
│ error_rate: 0.05 < 0.01 ✗ FAIL │
│ cpu_usage: 85% < 90% (skipped) │
│ │
│ Any fail → Trial rejected │
│ │
└─────────────────────────────────────────────────────────────┘
On Violation
When a guardrail fails, several actions are possible:
| Action | What happens |
|---|---|
fail_trial |
Mark trial failed, continue optimization |
fail_and_rollback |
Mark failed, restore previous config |
stop_breeder |
Halt optimization entirely |
guardrails:
- name: critical_error_rate
metric: error_rate
condition: "< 0.1"
on_violation: fail_and_rollback
- name: catastrophic_failure
metric: system_down
condition: "== false"
on_violation: stop_breeder
Rollback on Violation
Consecutive guardrail violations can trigger automatic rollback:
┌─────────────────────────────────────────────────────────────┐
│ Rollback Flow │
│ │
│ Trial 1: OK ──▶ Trial 2: VIOLATED ──▶ Trial 3: VIOLATED │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ last_good failures=1 failures=2 │
│ = params_1 │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Threshold hit? │ │
│ │ (e.g., 3) │ │
│ └────────┬─────────┘ │
│ │ │
│ ┌──────────────┴──────────┐ │
│ ▼ ▼ │
│ Yes No │
│ │ │ │
│ ▼ │ │
│ Apply last_good params │ │
│ Reset failure counter │ │
│ │ │ │
│ └────────────┬────────────┘ │
│ ▼ │
│ Continue trials │
└─────────────────────────────────────────────────────────────┘
rollback:
enabled: true
consecutive_failures: 3 # Trigger after N violations
target_state: previous # previous, best, or baseline
on_failure: stop # stop, continue, or skip_target
Rollback Strategies
| Strategy | What it restores | Use Case |
|---|---|---|
previous |
Last successful trial | Conservative, safe |
best |
Best Pareto-optimal trial | Aggressive, assumes best is stable |
baseline |
Original configuration | Fallback to known defaults |
Soft vs Hard Guardrails
Soft guardrails — Warning threshold, continue optimization
Hard guardrails — Hard limit, fail trial immediately
guardrails:
- name: latency_warning
metric: latency_p99
condition: "< 80ms"
severity: soft # Log warning, continue
- name: latency_limit
metric: latency_p99
condition: "< 100ms"
severity: hard # Fail trial
Soft guardrails inform without blocking. Hard guardrails protect.
Guardrail Design Guidelines
| Guideline | Why |
|---|---|
| Set guardrails before objectives | Safety first |
| Start conservative | Loosen after proven safe |
| Monitor guardrail hit rate | Too many hits = search space issue |
| Use soft before hard | Graduated response |
Too tight: Algorithm can't explore, all trials fail
Too loose: Risk of production incidents
Guardrails in Multi-Objective Optimization
Guardrails constrain the Pareto front:
Throughput
▲
│
│ · · · (outside guardrail)
│ · · ·
│ · X ← guardrail boundary
│ · ·
│ · · · (inside guardrail)
│
└─────────────────────▶
Latency
Only trials inside guardrail boundary are considered
The effective Pareto front excludes guardrail-violating trials.
Summary
| Aspect | What it means |
|---|---|
| Purpose | Protect systems during optimization |
| Behavior | Binary pass/fail on metric thresholds |
| Violation | Trial rejected, possible rollback |
| Types | Soft (warning) vs hard (fail) |
| Rollback | Auto-restore on consecutive failures |
See Also
- Reconnaissance — Provides metrics for guardrail checks
- Effectuator — May execute rollback
- Breeder — Configures and enforces guardrails