DC-Ops Operations Console
Select a scenario from the panel to begin a datacenter operations episode. Issue commands and monitor the facility in real-time.
Select a scenario from the panel to begin a datacenter operations episode. Issue commands and monitor the facility in real-time.
Nine verified demo walkthroughs across all scenarios and facility sizes. Each demo resolves in 8–10 steps following proper operational procedure: assess → diagnose → compensate → verify → resolve.
| # | Scenario | Facility | Steps | Reward | Key Skill |
|---|---|---|---|---|---|
| 1 | A1 Setpoint Optimization | Default | 9 | +0.324 | PUE optimization |
| 2 | A2 Thermal Event | Default | 8 | +0.873 | Single-failure response |
| 3 | A2 Thermal Event | Large | 8 | +0.831 | Multi-zone + H1 isolation |
| 4 | A4 CRAC Cascade | Default | 8 | +1.230 | Multi-failure triage |
| 5 | A4 CRAC Cascade | Large | 8 | +1.150 | Multi-zone cascade |
| 6 | B1 UPS Alarm | Default | 8 | +0.512 | Power chain audit |
| 7 | B3 Generator Test | Default | 10 | +0.567 | Protocol compliance |
| 8 | B4 Power Failure | Default | 8 | +0.934 | Battery + gen startup |
| 9 | B4 Power Failure | Small | 8 | +0.948 | Aggressive load shedding |
| # | Command | Reward | Cumul | PUE | Inlets | Reasoning |
|---|---|---|---|---|---|---|
| 1 | check_status | +0.131 | +0.131 | 1.87 | 17.1 / 17.1 | Procedure bonus (+0.2): must check before adjusting. Baseline — all CRACs at 15°C. |
| 2 | adjust_setpoint CRAC-1 24 | +0.047 | +0.178 | 1.80 | 17.7 / 17.1 | Raise from 15→24°C. Compressor works less → immediate PUE drop. |
| 3 | adjust_setpoint CRAC-2 24 | +0.039 | +0.217 | 1.76 | 19.2 / 17.0 | PUE continues falling. Zone A inlets rising — still safely below 27°C. |
| 4 | adjust_setpoint CRAC-3 24 | +0.038 | +0.255 | 1.71 | 20.7 / 17.6 | Zone B responding. Thermal mass (11.1 kJ/K per server) causes gradual warming. |
| 5 | adjust_setpoint CRAC-4 24 | +0.025 | +0.280 | 1.69 | 21.9 / 19.1 | All CRACs at 24°C. PUE 1.69 — still above 1.6 target. Fan reduction needed. |
| 6 | set_fan_speed CRAC-1 70 | −0.019 | +0.261 | 1.68 | 22.8 / 20.6 | Fan power follows cubic law: at 70%, power = 34% of rated (66% saving). |
| 7 | set_fan_speed CRAC-2 70 | −0.017 | +0.244 | 1.66 | 23.4 / 21.9 | PUE 1.66. Inlets 23.4°C — still 3.6°C below ASHRAE A2 max (27°C). |
| 8 | set_fan_speed CRAC-3 70 | −0.012 | +0.232 | 1.63 | 23.9 / 22.7 | Almost there. System approaching equilibrium. |
| 9 | set_fan_speed CRAC-4 70 | +0.092 | +0.324 | 1.60 | 24.3 / 23.3 | RESOLVED. PUE hits target (≤1.6). Speed bonus: (10−9)/10 = +0.1. |
| # | Command | Reward | Cumul | Inlets (A/B) | Reasoning |
|---|---|---|---|---|---|
| 1 | check_status | +0.204 | +0.204 | 19.8 / 20.0 | CRAC-3 shows "!! COMPRESSOR" — no supply temp, no airflow, 0 kW. |
| 2 | diagnose CRAC-3 | +0.054 | +0.258 | 19.8 / 20.2 | Unlocks resolution gate. Confirms "FAULT: compressor." +0.3 bonus on subsequent setpoint changes. |
| 3 | diagnose CRAC-1 | −0.021 | +0.237 | 19.8 / 20.4 | Verify remaining CRACs healthy. "No faults detected." |
| 4 | adjust_setpoint CRAC-1 16 | +0.034 | +0.270 | 19.6 / 20.5 | Lower setpoint 18→16°C. Increases cooling output. Earns procedure bonus. |
| 5 | adjust_setpoint CRAC-2 16 | +0.034 | +0.304 | 19.3 / 20.6 | Both zone A CRACs overcooling to compensate. |
| 6 | set_fan_speed CRAC-1 100 | +0.034 | +0.338 | 19.0 / 20.8 | Max airflow on surviving CRACs. |
| 7 | set_fan_speed CRAC-2 100 | +0.034 | +0.372 | 18.7 / 20.8 | Zone B stabilizing at ~20.8°C — within ASHRAE recommended range. |
| 8 | set_fan_speed CRAC-4 100 | +0.501 | +0.873 | 18.5 / 20.9 | RESOLVED. All zones stable for 2+ steps. Speed bonus: (15−8)/15 = +0.467. |
| # | Command | Reward | Cumul | Inlets (A/B/C/D) | Reasoning |
|---|---|---|---|---|---|
| 1 | check_status | +0.207 | +0.207 | 20.0 / 20.8 / 19.2 / 19.7 | 4-zone dashboard. Zone B 0.8°C warmer. Zone C (H1) at 19.2°C — 2.8°C below its 22°C max. |
| 2 | diagnose CRAC-3 | +0.057 | +0.263 | 20.0 / 21.6 / 19.2 / 19.7 | FAULT: compressor. Zone B rising +0.8°C/step. Zone C stable. |
| 3 | diagnose CRAC-1 | −0.019 | +0.245 | 20.0 / 22.3 / 19.2 / 19.6 | Confirm CRAC-1 healthy. |
| 4 | diagnose CRAC-2 | −0.019 | +0.225 | 20.0 / 23.0 / 19.2 / 19.6 | Confirm CRAC-2 healthy. Thorough zone B CRAC audit. |
| 5 | adjust_setpoint CRAC-2 16 | +0.035 | +0.261 | 19.9 / 23.6 / 19.2 / 19.6 | Lower surviving zone B CRAC. Procedure bonus earned. |
| 6 | adjust_setpoint CRAC-4 16 | +0.035 | +0.296 | 19.7 / 24.0 / 19.2 / 19.6 | Rate of rise slowing. H1 zone unaffected. |
| 7 | set_fan_speed CRAC-2 100 | +0.035 | +0.330 | 19.6 / 24.3 / 19.2 / 19.6 | Max airflow. Zone B 2.7°C below ASHRAE A2 max. |
| 8 | set_fan_speed CRAC-4 100 | +0.501 | +0.831 | 19.5 / 24.6 / 19.2 / 19.6 | RESOLVED. Zone B stabilized. H1 completely unaffected. |
| Metric | Default | Large |
|---|---|---|
| Max inlet temp | 20.9°C | 24.6°C |
| H1 zone impact | N/A | None (19.2°C) |
| Cumulative reward | +0.873 | +0.831 |
| # | Command | Reward | Cumul | Inlets (A/B) | Reasoning |
|---|---|---|---|---|---|
| 1 | check_status | +0.212 | +0.212 | 19.9 / 19.9 | Two red-flagged CRACs. 50% cooling capacity lost. |
| 2 | diagnose CRAC-1 | +0.062 | +0.274 | 20.0 / 20.0 | "FAULT: compressor." First of two required diagnoses. |
| 3 | diagnose CRAC-3 | +0.027 | +0.301 | 20.1 / 20.1 | "FAULT: fan." Both diagnosed → resolution gate unlocked. +0.2 procedure bonus. |
| 4 | adjust_setpoint CRAC-2 16 | +0.062 | +0.363 | 20.2 / 20.2 | Lower surviving CRAC setpoint. Procedure bonus (diagnosis first). |
| 5 | adjust_setpoint CRAC-4 16 | +0.062 | +0.425 | 20.2 / 20.3 | Both survivors at 16°C. Temps stabilizing. |
| 6 | set_fan_speed CRAC-2 100 | +0.062 | +0.487 | 20.1 / 20.2 | Max airflow confirmed. |
| 7 | set_fan_speed CRAC-4 100 | +0.062 | +0.549 | 20.1 / 20.2 | Temps flat at ~20.1°C. Stable consecutive steps. |
| 8 | set_rack_load B-05 4 | +0.681 | +1.230 | 20.0 / 20.1 | RESOLVED. Load shed for thermal margin. Speed bonus: (20−8)/20 = +0.600. |
| # | Command | Reward | Cumul | Inlets (A/B/C/D) | Reasoning |
|---|---|---|---|---|---|
| 1 | check_status | +0.210 | +0.210 | 20.4 / 20.4 / 19.2 / 19.7 | CRAC-1 and CRAC-3 down. Zones C/D have own CRACs. |
| 2 | diagnose CRAC-1 | +0.059 | +0.269 | 20.8 / 20.8 / 19.2 / 19.7 | FAULT: compressor. Zone A/B rising ~0.4°C/step. |
| 3 | diagnose CRAC-3 | +0.024 | +0.293 | 21.2 / 21.2 / 19.2 / 19.7 | FAULT: fan. Both diagnosed → gate unlocked. |
| 4 | diagnose CRAC-2 | +0.024 | +0.317 | 21.6 / 21.6 / 19.2 / 19.7 | Verify CRAC-2 healthy. Critical with 2 failures. |
| 5 | adjust_setpoint CRAC-2 16 | +0.059 | +0.376 | 21.9 / 22.0 / 19.2 / 19.6 | Compensate. Zone B at 22.0°C — 13°C below ASHRAE allowable max (35°C). |
| 6 | adjust_setpoint CRAC-4 16 | +0.058 | +0.434 | 22.2 / 22.3 / 19.2 / 19.6 | Rate of rise slowing. H1 zone unaffected. |
| 7 | set_fan_speed CRAC-2 100 | +0.058 | +0.492 | 22.5 / 22.5 / 19.2 / 19.6 | Max airflow. Zone B stabilizing. |
| 8 | set_fan_speed CRAC-4 100 | +0.658 | +1.150 | 22.7 / 22.8 / 19.2 / 19.6 | RESOLVED. All zones within allowable. Speed bonus: +0.600. |
| Metric | Default | Large |
|---|---|---|
| Final zone B inlet | 20.1°C | 22.8°C |
| H1 zone impact | N/A | None (19.2°C) |
| Cumulative reward | +1.230 | +1.150 |
| # | Command | Reward | Cumul | Reasoning |
|---|---|---|---|---|
| 1 | check_status | −0.007 | −0.007 | Baseline. Utility NORMAL, generator OFF, ATS on UTILITY. |
| 2 | diagnose UPS-1 | +0.143 | +0.137 | Key step. mode=double_conversion, SOC=86%. Resolution gate requires this. |
| 3 | diagnose UPS-2 | −0.007 | +0.130 | Verify redundant UPS. mode=double_conversion, SOC=87%. |
| 4 | diagnose GEN-1 | −0.007 | +0.123 | Generator in standby — confirming readiness. |
| 5 | diagnose PDU-A-01 | −0.007 | +0.117 | Verify power distribution intact. |
| 6 | check_status | −0.007 | +0.110 | Re-verify before closing incident. |
| 7 | diagnose PDU-B-01 | −0.007 | +0.103 | Complete the power chain audit. |
| 8 | acknowledge_alarm | +0.408 | +0.512 | RESOLVED. Alarm acknowledged after thorough investigation. Speed bonus: (10−8)/10 = +0.200. |
Steps 3–7 return −0.007 each because the delta-based progress metric doesn't change during investigation — only the final acknowledgment triggers progress completion. The cumulative reward is still positive (+0.512) thanks to the large resolution step reward.
| # | Command | Reward | Cumul | Gen State | Reasoning |
|---|---|---|---|---|---|
| 1 | check_status | −0.007 | −0.007 | OFF | Baseline. All systems normal. |
| 2 | diagnose GEN-1 | −0.007 | −0.013 | off | Pre-test inspection — verify before starting. |
| 3 | start_generator | +0.113 | +0.100 | CRANKING | Generator start sequence initiated. |
| 4 | wait | −0.022 | +0.078 | LOADED | Let generator complete warmup. |
| 5 | diagnose GEN-1 | +0.043 | +0.122 | ready | Critical verification. Confirms generator running properly. |
| 6 | check_status | −0.007 | +0.115 | LOADED | Full dashboard confirms generator loaded. |
| 7 | stop_generator | +0.113 | +0.228 | COOLDOWN | Initiate cooldown (300s for turbocharger). |
| 8 | wait | −0.022 | +0.207 | COOLDOWN | Allow cooldown to proceed. |
| 9 | diagnose GEN-1 | −0.032 | +0.175 | cooldown | Post-shutdown inspection. |
| 10 | acknowledge_alarm | +0.392 | +0.567 | COOLDOWN | RESOLVED. Protocol complete. Speed bonus: (15−10)/15 = +0.333. |
B3 tracks four internal flags that must be set in order:
start_generator issueddiagnose GEN-1 while generator is runningstop_generator (only if started + verified)acknowledge_alarm (only if stopped)The agent cannot skip steps — issuing stop_generator before diagnose GEN-1 won't set _stopped.
| # | Command | Reward | Cumul | Key Metrics | Reasoning |
|---|---|---|---|---|---|
| 1 | check_status | +0.108 | +0.108 | UPS battery, SOC ~97% | Utility LOST. ATS transferring. Generator auto-starting. |
| 2 | diagnose UPS-1 | +0.131 | +0.239 | on_battery, SOC=95% | Resolution gate unlocked. Battery draining ~2%/step at 160 kW. |
| 3 | diagnose UPS-2 | +0.078 | +0.317 | on_battery, SOC=90% | Redundant UPS also on battery. |
| 4 | start_generator | −0.007 | +0.310 | Gen: CRANKING | Gen already auto-starting — slight negative for redundant command. |
| 5 | set_rack_load A-05 4 | +0.062 | +0.371 | IT: 156 kW | Shed 4 kW to extend battery life. |
| 6 | set_rack_load B-05 4 | +0.054 | +0.425 | IT: 152 kW | Total shed: 8 kW (5% of IT load). |
| 7 | wait | −0.052 | +0.373 | Gen LOADED, ATS: GEN | Generator online. Battery recharging. |
| 8 | diagnose GEN-1 | +0.561 | +0.934 | state=loaded | RESOLVED. Gen loaded, temps OK, SOC >10%. Speed bonus: (20−8)/20 = +0.600. |
| Step | SOC (UPS-1) | Event |
|---|---|---|
| 0 | 97% | Utility lost |
| 2 | 95% | Diagnosed |
| 4 | ~90% | Gen starting |
| 6 | ~87% | Load shed |
| 7 | ~88% | Gen loaded, recharging begins |
| 8 | ~89% | Resolved |
| # | Command | Reward | Cumul | Key Metrics | Reasoning |
|---|---|---|---|---|---|
| 1 | check_status | +0.111 | +0.111 | 1 zone, UPS battery | 80 kW IT, 1 zone, 2 CRACs. Less redundancy. |
| 2 | diagnose UPS-1 | +0.133 | +0.244 | SOC=91% | Battery draining faster relative to capacity. Gate unlocked. |
| 3 | start_generator | +0.073 | +0.317 | Gen starting | Explicit start command. |
| 4 | set_rack_load A-05 4 | +0.066 | +0.383 | IT: 76 kW | 5% load reduction. |
| 5 | set_rack_load A-04 4 | +0.056 | +0.438 | IT: 72 kW | 10% load reduction. |
| 6 | set_rack_load A-03 4 | +0.042 | +0.481 | IT: 68 kW | 15% total load shed — more aggressive for smaller facility. |
| 7 | wait | −0.069 | +0.412 | Gen LOADED | Generator online. Battery recharging. |
| 8 | diagnose GEN-1 | +0.537 | +0.948 | state=loaded | RESOLVED. Speed bonus: +0.600. |
| Metric | Default (160 kW) | Small (80 kW) |
|---|---|---|
| Racks shed | 2 (8 kW, 5%) | 3 (12 kW, 15%) |
| Cumulative reward | +0.934 | +0.948 |
The small facility earns slightly higher reward due to more aggressive proportional load shedding, producing a stronger positive signal from the power safety component.
Each affected scenario requires the agent to actually do something before resolution. Without these gates, scenarios A2, A4, and B4 would auto-resolve within 2–3 steps of passive wait commands.
| Scenario | Diagnosis Gate | Min Steps |
|---|---|---|
| A2 | Must diagnose CRAC-3 | ≥ 8 steps |
| A4 | Must diagnose CRAC-1 AND diagnose CRAC-3 | ≥ 8 steps |
| B4 | Must diagnose UPS-* | ≥ 8 steps |
Reward ordering validates the design: fast diagnosis > late diagnosis > no diagnosis (never resolves).
A comprehensive reference for operating the physics-based datacenter simulation. Master thermal management, power systems, and incident response.
Pro tip: Always diagnose before making changes — the reward system gives a bonus for proper diagnostic procedures and penalizes blind interventions.
Six operational scenarios across two categories and three difficulty levels:
| Command | Description | Example |
|---|---|---|
diagnose <unit> |
Inspect a CRAC, UPS, Generator, or PDU for faults and status | diagnose CRAC-3 |
adjust_setpoint <crac> <°C> |
Change CRAC supply air setpoint (10–35°C). Supply temp converges over ~30s. | adjust_setpoint CRAC-1 22 |
set_fan_speed <crac> <%> |
Set CRAC fan speed (0–100%). Fan power follows cubic law. | set_fan_speed CRAC-2 100 |
set_rack_load <rack> <kW> |
Adjust rack IT load (0–30 kW) — simulates workload migration. | set_rack_load B-05 4 |
start_crac <crac> |
Start a standby CRAC unit. | start_crac CRAC-3 |
stop_crac <crac> |
Put a CRAC into standby mode. | stop_crac CRAC-4 |
start_generator |
Initiate diesel generator start sequence (OFF → CRANKING → WARMING → READY → LOADED). | start_generator |
stop_generator |
Initiate generator cooldown sequence (300s). | stop_generator |
set_ups_mode <ups> <mode> |
Set UPS mode: eco, double_conversion, line_interactive, or bypass. |
set_ups_mode UPS-1 eco |
refuel_generator [liters] |
Refuel the generator. Omit liters to fill tank. | refuel_generator 500 |
acknowledge_alarm |
Acknowledge the current alert — clears the alert banner. | acknowledge_alarm |
check_status |
Request full status report. Refreshes the dashboard. | check_status |
escalate |
Escalate to senior engineer. Ends the episode. | escalate |
wait |
Take no action — advances simulation time by one step. | wait |
The environment uses a 6-component, research-informed reward function. Each component is bounded to [−1, 1]. The total reward is a weighted sum, clamped to [−1, 1]. Weights auto-adjust based on scenario type.
Dual softplus barriers at ASHRAE recommended and allowable limits. Violations are penalized smoothly — the closer to the limit, the stronger the gradient. Returns +0.1 baseline when all zones are ≥3°C below recommended max (DCRL-Green).
Penalizes low UPS battery state-of-charge (SOC) via softplus barrier at 50% threshold. UPS fault adds a fixed penalty of 5.0. Compounds across multiple UPS units.
PUE-based energy efficiency. PUE 1.0 (ideal) → 0, PUE 2.0 → −0.46, PUE 3.0 → −0.76. Suppressed to 0 during power emergencies (UPS on battery or fault) so the agent isn't penalized for correct load shedding.
Delta-based: rewards the change in progress. This provides credit assignment — only the action that actually caused forward progress gets rewarded. Each scenario defines a normalized [0, 1] progress metric.
Scenario-defined procedural correctness rules. For example, diagnosing before adjusting setpoints earns a bonus (+0.2), while skipping diagnosis incurs a penalty (−0.1). Encourages proper operational procedures.
Context-aware assessment: −0.5 invalid command, −0.2 repeat (except wait/check_status), +0.3 diagnose/check_status, +0.2 interventions, +0.1 acknowledge, −0.1 escalate. Waiting during generator startup: +0.1.
Weights auto-select based on scenario type. Components sum to 1.0.
All safety thresholds follow ASHRAE TC 9.9, 5th Edition (2021). The recommended range is optimal for equipment longevity. The allowable range permits short-term operation during incidents.
| Class | Recommended | Allowable | Application |
|---|---|---|---|
| A1 | 18–27°C | 15–32°C | Enterprise servers |
| A2 | 18–27°C | 10–35°C | Volume servers (most common) |
| A3 | 18–27°C | 5–40°C | Extended temperature range |
| A4 | 18–27°C | 5–45°C | Maximum flexibility |
| H1 | 18–22°C | 5–25°C | High-density / AI / HPC (GPU servers) |
Key insight: The reward system uses softplus barriers at both recommended and allowable limits. Staying ≥3°C below recommended max yields a +0.1 thermal safety bonus. Exceeding allowable limits incurs 3× the per-degree penalty of recommended violations.
The simulation uses a lumped-capacitance RC thermal network — the standard approach for datacenter transient thermal analysis. Each zone's temperature evolves according to:
Important CRAC characteristics:
UPS quadratic loss model (APC White Paper 108):
Battery discharge: SOC depletes based on load, UPS efficiency, and temperature derating.
ATS (Automatic Transfer Switch) performs mechanical transfer in 100ms. Retransfer delay is 300 seconds to prevent rapid switching.