Disconnected
dc-ops-console

DC-Ops Operations Console

Select a scenario from the panel to begin a datacenter operations episode. Issue commands and monitor the facility in real-time.

Pick a scenario to start

Scenario Demos

Nine verified demo walkthroughs across all scenarios and facility sizes. Each demo resolves in 8–10 steps following proper operational procedure: assess → diagnose → compensate → verify → resolve.

#ScenarioFacilityStepsRewardKey Skill
1A1 Setpoint OptimizationDefault9+0.324PUE optimization
2A2 Thermal EventDefault8+0.873Single-failure response
3A2 Thermal EventLarge8+0.831Multi-zone + H1 isolation
4A4 CRAC CascadeDefault8+1.230Multi-failure triage
5A4 CRAC CascadeLarge8+1.150Multi-zone cascade
6B1 UPS AlarmDefault8+0.512Power chain audit
7B3 Generator TestDefault10+0.567Protocol compliance
8B4 Power FailureDefault8+0.934Battery + gen startup
9B4 Power FailureSmall8+0.948Aggressive load shedding
A1

Cooling Setpoint Optimization

Default Facility 9 steps EASY
+0.324
All four CRACs are set to 15°C — far below what the servers need. This wastes energy: compressors run hard, fans blow at 100%, and PUE sits at 1.87. ASHRAE A2 class allows inlet temps up to 27°C. The agent must raise setpoints and reduce fan speeds to approach the PUE target of 1.6 without overheating.
#CommandRewardCumulPUEInletsReasoning
1check_status+0.131+0.1311.8717.1 / 17.1Procedure bonus (+0.2): must check before adjusting. Baseline — all CRACs at 15°C.
2adjust_setpoint CRAC-1 24+0.047+0.1781.8017.7 / 17.1Raise from 15→24°C. Compressor works less → immediate PUE drop.
3adjust_setpoint CRAC-2 24+0.039+0.2171.7619.2 / 17.0PUE continues falling. Zone A inlets rising — still safely below 27°C.
4adjust_setpoint CRAC-3 24+0.038+0.2551.7120.7 / 17.6Zone B responding. Thermal mass (11.1 kJ/K per server) causes gradual warming.
5adjust_setpoint CRAC-4 24+0.025+0.2801.6921.9 / 19.1All CRACs at 24°C. PUE 1.69 — still above 1.6 target. Fan reduction needed.
6set_fan_speed CRAC-1 70−0.019+0.2611.6822.8 / 20.6Fan power follows cubic law: at 70%, power = 34% of rated (66% saving).
7set_fan_speed CRAC-2 70−0.017+0.2441.6623.4 / 21.9PUE 1.66. Inlets 23.4°C — still 3.6°C below ASHRAE A2 max (27°C).
8set_fan_speed CRAC-3 70−0.012+0.2321.6323.9 / 22.7Almost there. System approaching equilibrium.
9set_fan_speed CRAC-4 70+0.092+0.3241.6024.3 / 23.3RESOLVED. PUE hits target (≤1.6). Speed bonus: (10−9)/10 = +0.1.

Why This Works

  • Phase 1 (steps 2–5): Setpoint adjustment. Raising 15→24°C reduces compressor load. PUE drops 1.87→1.69 (10% improvement).
  • Phase 2 (steps 6–9): Fan speed reduction. Cubic fan law means 100→70% cuts fan power by 66%. Pushes PUE from 1.69→1.60.
A2

Thermal Event Response — Default

Default Facility (160 kW) 8 steps MEDIUM
+0.873
CRAC-3 compressor has failed. With only 3 of 4 CRACs operational, cooling capacity is reduced. The default facility has N+1 redundancy, so temperatures won't spike catastrophically, but the agent must diagnose the fault and compensate to ensure long-term stability.
#CommandRewardCumulInlets (A/B)Reasoning
1check_status+0.204+0.20419.8 / 20.0CRAC-3 shows "!! COMPRESSOR" — no supply temp, no airflow, 0 kW.
2diagnose CRAC-3+0.054+0.25819.8 / 20.2Unlocks resolution gate. Confirms "FAULT: compressor." +0.3 bonus on subsequent setpoint changes.
3diagnose CRAC-1−0.021+0.23719.8 / 20.4Verify remaining CRACs healthy. "No faults detected."
4adjust_setpoint CRAC-1 16+0.034+0.27019.6 / 20.5Lower setpoint 18→16°C. Increases cooling output. Earns procedure bonus.
5adjust_setpoint CRAC-2 16+0.034+0.30419.3 / 20.6Both zone A CRACs overcooling to compensate.
6set_fan_speed CRAC-1 100+0.034+0.33819.0 / 20.8Max airflow on surviving CRACs.
7set_fan_speed CRAC-2 100+0.034+0.37218.7 / 20.8Zone B stabilizing at ~20.8°C — within ASHRAE recommended range.
8set_fan_speed CRAC-4 100+0.501+0.87318.5 / 20.9RESOLVED. All zones stable for 2+ steps. Speed bonus: (15−8)/15 = +0.467.
A2

Thermal Event Response — Large

Large Facility (600 kW, H1 zone) 8 steps MEDIUM
+0.831
Same CRAC-3 failure, but in a larger facility with 4 zones including an H1 (high-density GPU) zone. H1 has a tighter thermal envelope (recommended max 22°C vs 27°C for A2).
#CommandRewardCumulInlets (A/B/C/D)Reasoning
1check_status+0.207+0.20720.0 / 20.8 / 19.2 / 19.74-zone dashboard. Zone B 0.8°C warmer. Zone C (H1) at 19.2°C — 2.8°C below its 22°C max.
2diagnose CRAC-3+0.057+0.26320.0 / 21.6 / 19.2 / 19.7FAULT: compressor. Zone B rising +0.8°C/step. Zone C stable.
3diagnose CRAC-1−0.019+0.24520.0 / 22.3 / 19.2 / 19.6Confirm CRAC-1 healthy.
4diagnose CRAC-2−0.019+0.22520.0 / 23.0 / 19.2 / 19.6Confirm CRAC-2 healthy. Thorough zone B CRAC audit.
5adjust_setpoint CRAC-2 16+0.035+0.26119.9 / 23.6 / 19.2 / 19.6Lower surviving zone B CRAC. Procedure bonus earned.
6adjust_setpoint CRAC-4 16+0.035+0.29619.7 / 24.0 / 19.2 / 19.6Rate of rise slowing. H1 zone unaffected.
7set_fan_speed CRAC-2 100+0.035+0.33019.6 / 24.3 / 19.2 / 19.6Max airflow. Zone B 2.7°C below ASHRAE A2 max.
8set_fan_speed CRAC-4 100+0.501+0.83119.5 / 24.6 / 19.2 / 19.6RESOLVED. Zone B stabilized. H1 completely unaffected.

Default vs Large Comparison

MetricDefaultLarge
Max inlet temp20.9°C24.6°C
H1 zone impactN/ANone (19.2°C)
Cumulative reward+0.873+0.831
A4

CRAC Failure Cascade — Default

Default Facility (160 kW) 8 steps HARD
+1.230
Two CRACs fail simultaneously: CRAC-1 (compressor) and CRAC-3 (fan). Only CRAC-2 and CRAC-4 remain — 50% cooling capacity lost. The agent must diagnose both failures, aggressively compensate, and consider load shedding.
#CommandRewardCumulInlets (A/B)Reasoning
1check_status+0.212+0.21219.9 / 19.9Two red-flagged CRACs. 50% cooling capacity lost.
2diagnose CRAC-1+0.062+0.27420.0 / 20.0"FAULT: compressor." First of two required diagnoses.
3diagnose CRAC-3+0.027+0.30120.1 / 20.1"FAULT: fan." Both diagnosed → resolution gate unlocked. +0.2 procedure bonus.
4adjust_setpoint CRAC-2 16+0.062+0.36320.2 / 20.2Lower surviving CRAC setpoint. Procedure bonus (diagnosis first).
5adjust_setpoint CRAC-4 16+0.062+0.42520.2 / 20.3Both survivors at 16°C. Temps stabilizing.
6set_fan_speed CRAC-2 100+0.062+0.48720.1 / 20.2Max airflow confirmed.
7set_fan_speed CRAC-4 100+0.062+0.54920.1 / 20.2Temps flat at ~20.1°C. Stable consecutive steps.
8set_rack_load B-05 4+0.681+1.23020.0 / 20.1RESOLVED. Load shed for thermal margin. Speed bonus: (20−8)/20 = +0.600.

Why Load Shedding Matters

  • Reducing rack B-05 from 8 kW to 4 kW provides additional thermal margin
  • Demonstrates workload migration capability
  • Earns action quality bonus (+0.2 for interventions)
A4

CRAC Failure Cascade — Large

Large Facility (600 kW, H1 zone) 8 steps HARD
+1.150
CRAC-1 and CRAC-3 down out of 8 CRACs. Zones C/D (including H1) have dedicated CRACs and remain unaffected. Zones A/B share the affected CRACs.
#CommandRewardCumulInlets (A/B/C/D)Reasoning
1check_status+0.210+0.21020.4 / 20.4 / 19.2 / 19.7CRAC-1 and CRAC-3 down. Zones C/D have own CRACs.
2diagnose CRAC-1+0.059+0.26920.8 / 20.8 / 19.2 / 19.7FAULT: compressor. Zone A/B rising ~0.4°C/step.
3diagnose CRAC-3+0.024+0.29321.2 / 21.2 / 19.2 / 19.7FAULT: fan. Both diagnosed → gate unlocked.
4diagnose CRAC-2+0.024+0.31721.6 / 21.6 / 19.2 / 19.7Verify CRAC-2 healthy. Critical with 2 failures.
5adjust_setpoint CRAC-2 16+0.059+0.37621.9 / 22.0 / 19.2 / 19.6Compensate. Zone B at 22.0°C — 13°C below ASHRAE allowable max (35°C).
6adjust_setpoint CRAC-4 16+0.058+0.43422.2 / 22.3 / 19.2 / 19.6Rate of rise slowing. H1 zone unaffected.
7set_fan_speed CRAC-2 100+0.058+0.49222.5 / 22.5 / 19.2 / 19.6Max airflow. Zone B stabilizing.
8set_fan_speed CRAC-4 100+0.658+1.15022.7 / 22.8 / 19.2 / 19.6RESOLVED. All zones within allowable. Speed bonus: +0.600.

Default vs Large Comparison

MetricDefaultLarge
Final zone B inlet20.1°C22.8°C
H1 zone impactN/ANone (19.2°C)
Cumulative reward+1.230+1.150
B1

UPS Alarm Response

Default Facility (160 kW) 8 steps MEDIUM
+0.512
A brief utility dip caused UPS-1 to transfer to battery. Utility has been restored and UPS switched back to double-conversion mode, but the alarm persists. The agent must investigate the entire power chain and acknowledge the alarm.
#CommandRewardCumulReasoning
1check_status−0.007−0.007Baseline. Utility NORMAL, generator OFF, ATS on UTILITY.
2diagnose UPS-1+0.143+0.137Key step. mode=double_conversion, SOC=86%. Resolution gate requires this.
3diagnose UPS-2−0.007+0.130Verify redundant UPS. mode=double_conversion, SOC=87%.
4diagnose GEN-1−0.007+0.123Generator in standby — confirming readiness.
5diagnose PDU-A-01−0.007+0.117Verify power distribution intact.
6check_status−0.007+0.110Re-verify before closing incident.
7diagnose PDU-B-01−0.007+0.103Complete the power chain audit.
8acknowledge_alarm+0.408+0.512RESOLVED. Alarm acknowledged after thorough investigation. Speed bonus: (10−8)/10 = +0.200.

Reward Structure Note

Steps 3–7 return −0.007 each because the delta-based progress metric doesn't change during investigation — only the final acknowledgment triggers progress completion. The cumulative reward is still positive (+0.512) thanks to the large resolution step reward.

B3

Generator Test Protocol

Default Facility (160 kW) 10 steps EASY
+0.567
Routine monthly generator test. No fault, no emergency — the agent must follow the 5-step protocol: check → start → verify → stop → acknowledge.
#CommandRewardCumulGen StateReasoning
1check_status−0.007−0.007OFFBaseline. All systems normal.
2diagnose GEN-1−0.007−0.013offPre-test inspection — verify before starting.
3start_generator+0.113+0.100CRANKINGGenerator start sequence initiated.
4wait−0.022+0.078LOADEDLet generator complete warmup.
5diagnose GEN-1+0.043+0.122readyCritical verification. Confirms generator running properly.
6check_status−0.007+0.115LOADEDFull dashboard confirms generator loaded.
7stop_generator+0.113+0.228COOLDOWNInitiate cooldown (300s for turbocharger).
8wait−0.022+0.207COOLDOWNAllow cooldown to proceed.
9diagnose GEN-1−0.032+0.175cooldownPost-shutdown inspection.
10acknowledge_alarm+0.392+0.567COOLDOWNRESOLVED. Protocol complete. Speed bonus: (15−10)/15 = +0.333.

Protocol Enforcement

B3 tracks four internal flags that must be set in order:

  • _startedstart_generator issued
  • _verifieddiagnose GEN-1 while generator is running
  • _stoppedstop_generator (only if started + verified)
  • _completedacknowledge_alarm (only if stopped)

The agent cannot skip steps — issuing stop_generator before diagnose GEN-1 won't set _stopped.

B4

Power Failure Cascade — Default

Default Facility (160 kW) 8 steps HARD
+0.934
Total utility power loss. UPS batteries bridging while generator starts. Generator warmup extended (15s vs default 8s). Agent must manage battery life, consider load shedding, and verify generator operation.
#CommandRewardCumulKey MetricsReasoning
1check_status+0.108+0.108UPS battery, SOC ~97%Utility LOST. ATS transferring. Generator auto-starting.
2diagnose UPS-1+0.131+0.239on_battery, SOC=95%Resolution gate unlocked. Battery draining ~2%/step at 160 kW.
3diagnose UPS-2+0.078+0.317on_battery, SOC=90%Redundant UPS also on battery.
4start_generator−0.007+0.310Gen: CRANKINGGen already auto-starting — slight negative for redundant command.
5set_rack_load A-05 4+0.062+0.371IT: 156 kWShed 4 kW to extend battery life.
6set_rack_load B-05 4+0.054+0.425IT: 152 kWTotal shed: 8 kW (5% of IT load).
7wait−0.052+0.373Gen LOADED, ATS: GENGenerator online. Battery recharging.
8diagnose GEN-1+0.561+0.934state=loadedRESOLVED. Gen loaded, temps OK, SOC >10%. Speed bonus: (20−8)/20 = +0.600.

Battery SOC Timeline

StepSOC (UPS-1)Event
097%Utility lost
295%Diagnosed
4~90%Gen starting
6~87%Load shed
7~88%Gen loaded, recharging begins
8~89%Resolved
B4

Power Failure Cascade — Small

Small Facility (80 kW) 8 steps HARD
+0.948
Same power loss scenario in a smaller facility: 80 kW IT load, 1 zone, 2 CRACs. Less redundancy means more aggressive load shedding is needed.
#CommandRewardCumulKey MetricsReasoning
1check_status+0.111+0.1111 zone, UPS battery80 kW IT, 1 zone, 2 CRACs. Less redundancy.
2diagnose UPS-1+0.133+0.244SOC=91%Battery draining faster relative to capacity. Gate unlocked.
3start_generator+0.073+0.317Gen startingExplicit start command.
4set_rack_load A-05 4+0.066+0.383IT: 76 kW5% load reduction.
5set_rack_load A-04 4+0.056+0.438IT: 72 kW10% load reduction.
6set_rack_load A-03 4+0.042+0.481IT: 68 kW15% total load shed — more aggressive for smaller facility.
7wait−0.069+0.412Gen LOADEDGenerator online. Battery recharging.
8diagnose GEN-1+0.537+0.948state=loadedRESOLVED. Speed bonus: +0.600.

Small vs Default Comparison

MetricDefault (160 kW)Small (80 kW)
Racks shed2 (8 kW, 5%)3 (12 kW, 15%)
Cumulative reward+0.934+0.948

The small facility earns slightly higher reward due to more aggressive proportional load shedding, producing a stronger positive signal from the power safety component.

Resolution Gate Design

Each affected scenario requires the agent to actually do something before resolution. Without these gates, scenarios A2, A4, and B4 would auto-resolve within 2–3 steps of passive wait commands.

ScenarioDiagnosis GateMin Steps
A2Must diagnose CRAC-3≥ 8 steps
A4Must diagnose CRAC-1 AND diagnose CRAC-3≥ 8 steps
B4Must diagnose UPS-*≥ 8 steps

Reward ordering validates the design: fast diagnosis > late diagnosis > no diagnosis (never resolves).

DC-Ops Operations Guide

A comprehensive reference for operating the physics-based datacenter simulation. Master thermal management, power systems, and incident response.

Getting Started

  1. Select a scenario from the sidebar — each presents a unique datacenter challenge.
  2. Choose a facility config (Default 160 kW, Small 80 kW, or Large 600 kW).
  3. Click Start to begin the episode. You'll see the NOC dashboard.
  4. Issue commands in the command bar — diagnose equipment, adjust setpoints, manage power.
  5. Each command advances simulation time. You have a limited step budget.
  6. Maximize your cumulative reward by resolving the scenario efficiently.

Pro tip: Always diagnose before making changes — the reward system gives a bonus for proper diagnostic procedures and penalizes blind interventions.

Scenarios

Six operational scenarios across two categories and three difficulty levels:

Thermal (Category A)

A1 Easy
Cooling Setpoint Optimization
CRACs are overcooling at 15°C — wasting energy. Optimize setpoints for efficiency while keeping all zones within ASHRAE recommended range (18–27°C).
StrategyRaise setpoints to ~22°C. Monitor temps. Target PUE < 1.6. Check that all zones stay in recommended range for 2+ steps.
A2 Medium
Thermal Event Response
CRAC-3 compressor failure. Zone B temps are rising. Diagnose the fault and redistribute cooling to stabilize all zones.
StrategyDiagnose CRAC-3 first. Lower setpoints on remaining CRACs. Boost fan speeds. Keep all zones in recommended range for 2+ steps.
A4 Hard
CRAC Failure Cascade
CRAC-1 compressor failure and CRAC-3 fan failure simultaneously. A cascading thermal event threatens multiple zones.
StrategyDiagnose both CRACs. Aggressively lower setpoints on CRAC-2/4. Max fan speeds. Consider load shedding on hot racks. Keep zones in allowable range.

Power (Category B)

B1 Medium
UPS Alarm Response
UPS transferred to battery after a utility event (now restored). Diagnose the situation and acknowledge the alarm to resolve.
StrategyDiagnose UPS-1 first. Verify utility is restored. Acknowledge the alarm. The UPS should return to normal operation.
B3 Easy
Generator Test Protocol
Routine monthly generator test. Follow the proper 5-step protocol: diagnose → start → verify → stop → confirm shutdown.
Strategy1. diagnose GEN-1 → 2. start_generator → 3. wait (let it warm) → 4. diagnose GEN-1 (verify running) → 5. stop_generator
B4 Hard
Power Failure Cascade
Utility power lost with extended generator warmup. UPS running on battery. Manage battery life and thermal conditions until generator loads.
StrategyStart generator immediately. Shed non-critical rack loads to preserve battery. Monitor SOC. Once generator loads, restore loads. Keep temps stable.

Available Commands

Command Description Example
diagnose <unit> Inspect a CRAC, UPS, Generator, or PDU for faults and status diagnose CRAC-3
adjust_setpoint <crac> <°C> Change CRAC supply air setpoint (10–35°C). Supply temp converges over ~30s. adjust_setpoint CRAC-1 22
set_fan_speed <crac> <%> Set CRAC fan speed (0–100%). Fan power follows cubic law. set_fan_speed CRAC-2 100
set_rack_load <rack> <kW> Adjust rack IT load (0–30 kW) — simulates workload migration. set_rack_load B-05 4
start_crac <crac> Start a standby CRAC unit. start_crac CRAC-3
stop_crac <crac> Put a CRAC into standby mode. stop_crac CRAC-4
start_generator Initiate diesel generator start sequence (OFF → CRANKING → WARMING → READY → LOADED). start_generator
stop_generator Initiate generator cooldown sequence (300s). stop_generator
set_ups_mode <ups> <mode> Set UPS mode: eco, double_conversion, line_interactive, or bypass. set_ups_mode UPS-1 eco
refuel_generator [liters] Refuel the generator. Omit liters to fill tank. refuel_generator 500
acknowledge_alarm Acknowledge the current alert — clears the alert banner. acknowledge_alarm
check_status Request full status report. Refreshes the dashboard. check_status
escalate Escalate to senior engineer. Ends the episode. escalate
wait Take no action — advances simulation time by one step. wait

Reward System

The environment uses a 6-component, research-informed reward function. Each component is bounded to [−1, 1]. The total reward is a weighted sum, clamped to [−1, 1]. Weights auto-adjust based on scenario type.

🌡️ Thermal Safety [−1, +0.1]

Dual softplus barriers at ASHRAE recommended and allowable limits. Violations are penalized smoothly — the closer to the limit, the stronger the gradient. Returns +0.1 baseline when all zones are ≥3°C below recommended max (DCRL-Green).

penalty = softplus((T − T_rec) / 2.0) + 3.0 · softplus((T − T_allow) / 1.5)

⚡ Power Safety [−1, 0]

Penalizes low UPS battery state-of-charge (SOC) via softplus barrier at 50% threshold. UPS fault adds a fixed penalty of 5.0. Compounds across multiple UPS units.

penalty = softplus((0.5 − SOC) / 0.15) + 5.0 · [fault]

📊 Efficiency [−1, 0]

PUE-based energy efficiency. PUE 1.0 (ideal) → 0, PUE 2.0 → −0.46, PUE 3.0 → −0.76. Suppressed to 0 during power emergencies (UPS on battery or fault) so the agent isn't penalized for correct load shedding.

reward = −tanh((PUE − 1.0) / 2.0)

🎯 Scenario Progress [−1, +1]

Delta-based: rewards the change in progress. This provides credit assignment — only the action that actually caused forward progress gets rewarded. Each scenario defines a normalized [0, 1] progress metric.

reward = progress_now − progress_prev

📋 Procedure [−1, +1]

Scenario-defined procedural correctness rules. For example, diagnosing before adjusting setpoints earns a bonus (+0.2), while skipping diagnosis incurs a penalty (−0.1). Encourages proper operational procedures.

reward = scenario.procedure_reward (clamped)

🎮 Action Quality [−1, +1]

Context-aware assessment: −0.5 invalid command, −0.2 repeat (except wait/check_status), +0.3 diagnose/check_status, +0.2 interventions, +0.1 acknowledge, −0.1 escalate. Waiting during generator startup: +0.1.

Heuristic scoring per action type + context

Weight Profiles

Weights auto-select based on scenario type. Components sum to 1.0.

🏛

ASHRAE Thermal Guidelines

All safety thresholds follow ASHRAE TC 9.9, 5th Edition (2021). The recommended range is optimal for equipment longevity. The allowable range permits short-term operation during incidents.

Class Recommended Allowable Application
A1 18–27°C 15–32°C Enterprise servers
A2 18–27°C 10–35°C Volume servers (most common)
A3 18–27°C 5–40°C Extended temperature range
A4 18–27°C 5–45°C Maximum flexibility
H1 18–22°C 5–25°C High-density / AI / HPC (GPU servers)

Key insight: The reward system uses softplus barriers at both recommended and allowable limits. Staying ≥3°C below recommended max yields a +0.1 thermal safety bonus. Exceeding allowable limits incurs 3× the per-degree penalty of recommended violations.

Physics Engine

Thermal Model — RC Network

The simulation uses a lumped-capacitance RC thermal network — the standard approach for datacenter transient thermal analysis. Each zone's temperature evolves according to:

C_total · dT/dt = Q_IT − Q_cooling + Q_envelope + Q_internal Where: C_total = C_air + C_equipment (dominated by server thermal mass) Q_IT = Σ rack IT loads [W] — all electrical power converts to heat Q_cooling = Σ CRAC outputs [W] — capacity varies with return air temp Q_envelope = (T_outside − T_zone) / R_envelope [W]

Important CRAC characteristics:

  • Capacity vs. return temp: Q_actual = Q_rated × [1 + 0.03 × (T_return − T_rated)], so capacity increases when a zone heats up
  • Fan power: Cubic law (affinity laws) — P_fan = P_rated × (speed%)³
  • Supply temp lag: 30-second time constant between setpoint change and actual supply temp
  • Recirculation: Hot air mixing caused by dominant airflow imbalance

Power Model

UPS quadratic loss model (APC White Paper 108):

η(x) = x / (x + 0.013 + 0.006x + 0.011x²) 90.5% efficient at 25% load 93.6% efficient at 50% load 94.0% efficient at 75% load

Battery discharge: SOC depletes based on load, UPS efficiency, and temperature derating.

Generator State Machine

OFF ─→ START_DELAY (4s) ─→ CRANKING (5s) ─→ WARMING (8s) ─→ READY ─→ LOADED ↓ COOLDOWN (300s) ─→ OFF

ATS (Automatic Transfer Switch) performs mechanical transfer in 100ms. Retransfer delay is 300 seconds to prevent rapid switching.

📚

Research Foundation

  • Google/DeepMind (2017): Demonstrated 40% cooling energy reduction using RL with softplus barrier functions for safety constraints.
  • DCRL-Green (ICLR 2025): Multi-objective reward with softplus barriers and positive safe-state baseline for safe RL in datacenters.
  • ASHRAE TC 9.9, 5th Edition (2021): Industry-standard thermal guidelines used for all safety thresholds.
  • APC White Paper 108: UPS quadratic loss model with experimentally calibrated coefficients.
  • Process Reward Models: Delta-based progress rewards for improved credit assignment in multi-step reasoning.