From Power Grid to Inference

A team learning guide

AI is not magic. It is electricity, linear algebra, and scale — disciplined into prediction. This guide walks your team through every layer of the system, from a 130-megawatt facility to the softmax function that selects the next token.

Get Started

↓ scroll to begin

What is AI, really?

When you ask ChatGPT a question and it answers, it feels like thinking. It isn't. What's actually happening is a very fast, very expensive math problem: the system has read billions of sentences and learned statistical patterns between words. Given a prompt, it predicts the most probable next word, then the next, then the next — hundreds of times per second.

That "prediction engine" is called a neural network. Think of it as a massive spreadsheet with billions of adjustable numbers (called parameters). During training, the system reads data, makes predictions, checks its mistakes, and tweaks those numbers to be slightly less wrong next time. Do this trillions of times and the spreadsheet gets eerily good at predicting language, images, code, or music.

Step 1 — Training

Feed the network terabytes of text. It reads, predicts, gets corrected, and adjusts billions of parameters over weeks or months.

Step 2 — The Model

After training, you have a giant file of tuned parameters. This is the AI — a frozen snapshot of everything it learned. GPT-4 is ~1.8 trillion parameters.

Step 3 — Inference

When you ask it a question, the model runs those parameters against your prompt to generate one token (word-piece) at a time. This is called inference.

Why does AI need a building?

The math behind AI is simple — multiply matrices and add biases — but the scale is staggering. Training a frontier model means multiplying trillions of numbers together, billions of times. Your laptop's processor can't do this in a useful timeframe.

So companies pack thousands of specialised chips called GPUs into warehouse-sized buildings called data centers. These chips draw enormous amounts of electricity, and all that electricity becomes heat. Cooling that heat requires industrial-scale plumbing — chilled water loops, cooling towers, sometimes even outdoor ponds. A single training cluster can draw as much power as a small city.

Mental model: Imagine you need to hand-multiply a million spreadsheets per second, each one thousands of rows long. You'd need a lot of calculators (GPUs), a lot of desks (servers), a huge building (datacenter), and enough electricity and air conditioning to keep everyone working without overheating. That is the physical reality of AI.

The AI Factory

NVIDIA uses a deliberate term for these facilities: . Not "data center" — factory. A traditional data center stores and serves data. An AI factory manufactures intelligence. Raw data enters one end; trained models and inference tokens come out the other. The primary product is tokens — and throughput is the measure of output, just like any manufacturing operation.[src]

This framing changes how you think about every layer of the stack. The facility isn't just housing for servers — it's an industrial production line that manages the entire AI lifecycle:

Input: Raw Data

Data pipelines ingest, clean, and structure trillions of tokens from unstructured text, images, code, and sensor data. Data quality directly determines model quality — garbage in, garbage out, but at a trillion-token scale.

Output: Intelligence (Tokens)

Trained models generate predictions, decisions, and content in real time. Inference outputs feed back into the system as a data flywheel — improving model accuracy over time. Every token generated creates new training signal.

Full-Stack Infrastructure

GPUs, NVLink/NVSwitch fabrics, InfiniBand networking, parallel storage, liquid cooling — plus the software stack: CUDA, TensorRT, NIM microservices. Hardware and software designed as a single integrated system.

Digital Twins

NVIDIA Omniverse lets teams design, simulate, and optimize the entire facility virtually — testing layout changes, modeling failure scenarios, and validating cooling before construction begins.

Continuous Optimization

Automation tools handle hyperparameter tuning, model deployment, and performance monitoring. The factory operates 24/7 — training, fine-tuning, and serving inference at scale with minimal human intervention.

NVIDIA positions AI infrastructure as national infrastructure — as fundamental as roads, power grids, and telecommunications. Sovereign nations are building their own AI factories to cultivate local language models, protect data sovereignty, and drive economic competitiveness. Every enterprise, every government will need access to one.

In the sections ahead, we'll examine every part of this system — from the power grid and cooling systems (Infrastructure) through chip architecture (Silicon), the underlying math (Mathematics), the transformer architecture that makes language models work (Architecture), training and inference at scale (Training & Inference), and real-world considerations (Considerations). Hover any highlighted term for a technical definition.

Continue to Infrastructure

The Building Is the Computer

If you're new here

A is a large, climate-controlled building packed with thousands of computers. Most of the internet already runs from buildings like this. What makes an AI data center different is the sheer density of power and heat. Training a single frontier AI model can require thousands of specialised chips called , each drawing up to 1,000 watts — roughly the same as a microwave oven running flat out. Multiply that by 10,000+ chips and you need as much power as a small town.

All that electricity becomes heat. So data centers need industrial cooling — water loops, heat exchangers, or cooling towers — just to stop the chips from melting. The section below lets you explore exactly how much power a training cluster needs.

A frontier model is not trained on a laptop. It is trained inside a purpose-built industrial facility that converts megawatts of grid power into gradients. Site selection is dominated by three things: cheap reliable electricity, fiber backbones, and water or air cool enough to reject heat.[src]

Drag to scale a training cluster

10.8MW
94.7GWh
$757/hr
102,849gal/day

For reference: 100k Blackwell GPUs ≈ a mid-size city's continuous load.[src]

One rack, 72 GPUs, ~120 kW

  • Compute trays — 18× per rack, each holding 2 Grace CPUs + 4 Blackwell GPUs, lashed together by 5th-gen at ~1.8 TB/s per GPU.[src][src]
  • NVSwitch trays — 9× per rack form a non-blocking all-to-all fabric so any GPU can reach any other GPU at full NVLink bandwidth.
  • — direct-to-chip liquid cooling. Air alone cannot evacuate ~120 kW from a single 0.6 m² footprint.
  • Power shelves — bus-bar delivery; redundant 415 V three-phase feeds backed by and diesel/gas generators in 2N or N+1 topology.
  • Optics — 800 Gb/s or Ethernet links exit the rack toward a rail-optimized fat-tree spine.[src]
72× BLACKWELL GPU + 9× NVSWITCH

Complete electrical path — utility to server

Every AI factory converts grid power into GPU compute through a carefully engineered chain. Each stage steps down voltage, adds protection, and provides monitoring. A 2N topology means two fully independent paths — either side can carry the full facility load alone.

UTILITY FEED
12.47 – 34.5 kV medium voltage
Dual utility feeds at Tier III+
MV SWITCHGEAR
Vacuum/SF₆ breakers · Protective relaying · Revenue metering
SEL / GE Multilin relays · ATS for utility/gen switchover
STANDBY GENERATORS
2 MW diesel/gas gensets · N+1 or 2N redundancy
Paralleling switchgear · 10s start · Load bank tested
MV → LV TRANSFORMER
Step-down: 12.47 kV → 480 V, 3-phase
Cast-coil dry type (indoor) · K-rated for harmonics · Δ-Y winding
LV SWITCHBOARD / SWBD
480 V distribution · Circuit breakers · SPD
Feeds split into A + B buses for 2N downstream
── A PATH ──
UPS-A
Rotary or Li-ion · 480 V input/output
10–15 min ride-through · VRLA or LFP battery
PDU-A (Floor)
Step-down: 480 V → 208/120 V
Panel boards · breaker distribution
RPP-A
Remote Power Panel
Row-level distribution · breakers per rack
RACK PDU-A
In-rack · metered + switched
Per-outlet V, A, kW, kWh · SNMP/Modbus
── B PATH ──
UPS-B
Rotary or Li-ion · 480 V input/output
10–15 min ride-through · VRLA or LFP battery
PDU-B (Floor)
Step-down: 480 V → 208/120 V
Panel boards · breaker distribution
RPP-B
Remote Power Panel
Row-level distribution · breakers per rack
RACK PDU-B
In-rack · metered + switched
Per-outlet V, A, kW, kWh · SNMP/Modbus
GPU SERVER / TRAY
Dual PSUs — one from A path, one from B path
Auto-failover: either PSU carries full load
MV Stage (12.47–34.5 kV)

Utility entrance through MV switchgear. Vacuum or SF₆ breakers provide fault isolation. Protective relays (SEL-751, GE Multilin 489) monitor overcurrent, differential, ground fault. Revenue-grade metering (CT/PT accuracy class 0.3) for utility billing. ATS (automatic transfer switch) or paralleling switchgear coordinates utility ↔ generator transitions with <10s open-transition or <100ms closed-transition transfer.

LV Stage (480 V → 208 V)

Cast-coil dry-type transformers step down to 480 V. K-rated (K-13 or K-20) to handle harmonic distortion from server power supplies. UPS systems (rotary flywheel or lithium-ion) provide 10–15 min ride-through for generator start. Floor PDUs step down 480 → 208/120 V with integrated breakers and metering. RPPs (remote power panels) distribute at row level — each RPP feeds 4–8 racks with individually breakered circuits.

Rack Level

Intelligent rack PDUs (ServerTech, Raritan, APC) provide per-outlet monitoring: voltage, current, kW, kWh, power factor. Data reported via SNMP, Modbus TCP, or REST API to . Typical GPU rack draws 40–120+ kW with >90% power factor. At 120 kW (GB200 NVL72), each rack needs 2 × 100A 208V circuits or direct 480 V bus bar feed bypassing 208 V transformation entirely.

EPMS MONITORING AT EVERY STAGE
StageMeteringKey MeasurementsProtocol
MV SwitchgearION 9000 / SEL-735V, A, kW, kVAR, PF, THD, demandModbus TCP / DNP3
GeneratorController (DSE, DEIF)kW, fuel level, RPM, coolant temp, battery VModbus RTU/TCP
TransformerTemp sensors (winding/oil)Winding temp, oil temp, loading %4–20 mA / Modbus
UPSBuilt-in controllerInput/output V, A, kW, battery SOC, temp, runtimeSNMP / Modbus TCP
Floor PDU / STSION 7650 / PM8000V, A, kW per panel, breaker status, transfer countModbus TCP
RPPBranch circuit monitorPer-breaker A, kW, alarm on tripModbus TCP / BACnet
Rack PDUIntelligent PDUPer-outlet V, A, kW, kWh, PF, inlet tempSNMP / Modbus / REST

All metering data feeds into for real-time PUE calculation, capacity planning, and alarm management. Total facility power (metered at MV) ÷ IT load (metered at rack PDU) = .

How many paths to the server?

NSingle path, no redundancy. Maintenance means downtime. Unacceptable for mission-critical.
N+1One spare module. Can lose one UPS and still support full load. Minimum for production workloads.
2NFully mirrored: two independent paths (A+B), either side carries full load alone. Standard for Tier III/IV. Required for AI training clusters where a power glitch kills a multi-day run.
2N+12N with an additional spare module per side. Highest reliability — allows maintenance on one side while the other has N+1 protection.

GPU racks draw 40–120+ kW each (GB200 NVL72 exceeds 120 kW). This drives unique requirements for power density, UPS sizing, and breaker coordination.

Traditional approaches

CRAC — Computer Room Air Conditioning. DX (direct expansion) refrigerant-based, self-contained units. Common in smaller facilities.
CRAH — Computer Room Air Handler. Chilled water coils with fans. More efficient at scale — the chiller plant is centralized.
Containment — hot aisle/cold aisle physical separation prevents mixing. Without containment, 30–40% of cooling capacity is wasted.
Economizer — free cooling when outside air temp is low enough. Raises chilled water setpoint, disables compressors. Saves 20–40% of cooling energy annually depending on climate.

Air cooling works up to ~15–20 kW/rack. Above 50 kW/rack, air alone is physically insufficient regardless of airflow volume.

Direct-to-chip: the new standard

Cold plates — mounted directly on GPU/CPU die. Warm water cooling (30–45°C supply) enables free cooling year-round in most climates. Removes 60–80% of server heat via liquid.
CDU — Coolant Distribution Unit. Heat exchanger between facility water loop and IT water loop. IT loop uses deionized water or propylene glycol in a closed circuit. CDU provides pumps, filtration, flow/temp/pressure monitoring.
Chilled water plant — water-cooled chillers (centrifugal/screw), primary/secondary pumping, cooling towers. Typical temps: 42°F supply / 55°F return. Variable-speed drives on pumps and fans are key efficiency levers.

The GB200 NVL72 is liquid-cooled at rack level. Robust leak detection and emergency drain procedures are mandatory — a CDU leak can destroy millions in hardware in minutes.

Inside the cooling loop — component by component

AI data centers reject tens of megawatts of heat continuously. Below is a detailed look at every major mechanical component in the liquid cooling chain, from the cold plate bolted to the GPU die to the cooling tower rejecting heat to the atmosphere. Understanding these components — and how they interact — is essential for anyone involved in design, commissioning, or operations.

END-TO-END LIQUID COOLING LOOP
IT WATER LOOP(closed, deionized)
FACILITY WATER LOOP(condenser / tower water)
COLD PLATE
on GPU die
Copper micro-channel
45°C supply → 55°C return
GPU DIE ~1000W TDP
hot →← cool
CDU
Coolant Distribution Unit
Heat Exchanger
Pump (redundant pair)
Filter / DI Resin
Reservoir
Sensors (flow/temp/press/leak)
IT loop: deionized water or propylene glycol
hot →← cool
CHILLER
Centrifugal or Screw
Evaporator
Condenser
Compressor
Expansion
R-134a / R-1234ze
COP: 5–8
hot CW →← cool CW
↑ heat to atmosphere
COOLING TOWER
Evaporative
Fan (VFD)
Fill Media
Basin
CHW PUMPS
Primary + Secondary
CW PUMPS
VFD controlled

Heat flows left-to-right: GPU → cold plate → CDU → chiller → cooling tower → atmosphere. Two isolated water loops prevent contamination of IT equipment.

The bridge between IT and facility

CDU INTERIOR — TYPICAL LAYOUT
HOT from racks →
← COOL to racks
PLATE & FRAME HEAT EXCHANGER
IT side ↔ Facility side
Brazed or gasketed stainless plates
plates
A
PUMP A — Primary
Mag-drive or canned-motor
B
PUMP B — Standby
Auto-failover on fault
RESERVOIR / EXPANSION TANK
Maintains system pressure & fluid level
FILTRATION
5μm particulate + DI resin
FLOW CONTROL
Balancing valves
RELIEF VALVE
Overpressure safety
INSTRUMENTATION & SENSORS
● Flow meter (GPM)
● Supply temp (°F/°C)
● Return temp (°F/°C)
● Diff. pressure (PSI)
● Conductivity (μS/cm)
● Leak detection
CDU CONTROLLER
PLC or embedded controller
Modbus TCP / BACnet IP upstream
LEAK CONTAINMENT
Drip pan + cable sensor
Emergency drain valve (N.O.)

Purpose: The CDU is the boundary between the clean, deionized IT water loop and the facility's chilled water loop. It ensures the two never mix — contamination of the IT loop with facility water (which contains corrosion inhibitors, biocides) would damage cold plates and clog micro-channels.

Scale: A typical row-level CDU serves 4–8 racks (200–600 kW). Rack-level CDUs serve a single rack (~40–120 kW). Large deployments use centralized CDUs serving entire rows or pods.

Redundancy: Dual pumps (lead/standby), dual facility-side connections, and N+1 CDU sparing per row. Automatic failover on pump fault or low-flow alarm. Typical response time: <5 seconds.

Where heat leaves the silicon

COLD PLATE — CROSS SECTION (top to bottom)
QD fitting
(quick-disconnect)
← dry-break, non-drip, 10k+ cycles →
QD fitting
(quick-disconnect)
IN (cool supply)
OUT (warm return)
COPPER COLD PLATE BODY
MICRO-CHANNELS (0.2–0.5mm width)
Coolant flows through channels → absorbs heat from copper
↕ spring-loaded
mounting bolt
↕ spring-loaded
mounting bolt
THERMAL INTERFACE MATERIAL (TIM)
Indium foil or high-performance paste · >5 W/m·K
INTEGRATED HEAT SPREADER (IHS)
GPU DIE
~800 mm² · 700–1000W TDP
PCB SUBSTRATE
THERMAL STACK TEMPERATURES
GPU junction (Tj): ~83°C max
IHS surface: ~75°C
TIM interface: ~70°C
Cold plate base: ~55°C
Coolant inlet: 40–45°C / outlet: 50–55°C

Micro-channels maximize surface area. Copper fins as thin as 0.2 mm create hundreds of parallel flow paths. Turbulent flow at the channel level dramatically increases heat transfer coefficient vs. smooth bore.

TIM (Thermal Interface Material) fills microscopic air gaps between the IHS and cold plate. Indium foil or high-performance paste with >5 W/m·K conductivity. Poor TIM application is the #1 cause of GPU thermal throttling.

Quick-disconnects (QD) allow hot-swap of servers without draining the loop. Non-drip / dry-break QDs are mandatory in IT environments. Rated for 10,000+ connect/disconnect cycles.

Rejecting heat to atmosphere

COUNTERFLOW COOLING TOWER — CROSS SECTION
↑ ↑ ↑ ↑ ↑
Warm moist air exhaust to atmosphere
FAN
axial
6–30 ft ⌀
MOTOR + VFD
25–100 HP
gear-reduced drive
DRIFT ELIMINATORS
chevron baffles — prevent water droplet carryover into exhaust
HOT WATER DISTRIBUTION DECK
spray nozzles — distribute hot water evenly over fill
LOUVERS
AIR
FILL MEDIA
PVC / polypropylene corrugated sheets
maximizes water-to-air contact surface
Water trickles down · Air flows across · Heat transfers via evaporation
LOUVERS
AIR
COLD WATER BASIN
StrainerFloat valveBleed-off
Hot CW in →
85–95°F from chiller
Makeup water line
replaces evap + blowdown loss
← Cool CW out
75–85°F to chiller

Evaporative cooling exploits the latent heat of vaporization — water evaporating from the fill media absorbs ~1,000 BTU/lb, cooling the remaining water. Approach temp (basin temp minus wet-bulb) of 5–7°F is typical. Lower approach = larger, more expensive tower.

Water consumption: ~1.8 gal/kWh of heat rejected (evaporation + blowdown). A 100 MW AI facility can consume 300,000+ gallons/day. Water treatment (biocide, scale inhibitor, pH control) is critical to prevent Legionella, scaling, and corrosion.

Fan control: VFD modulates fan speed to maintain condenser water return temp setpoint. Multiple cells staged on/off as load changes. Fan energy is 2–5% of total cooling plant energy.

Moving heat between loops

PLATE & FRAME HEAT EXCHANGER
(most common in CDUs and free-cooling economizers)
Hot fluid IN →
← Cool fluid OUT
← Cold fluid IN
Warm fluid OUT →
Corrugated stainless plates with EPDM gaskets
Counter-flow pattern maximizes ΔT & effectiveness (ε > 0.90)
SHELL & TUBE (CHILLER CONDENSER)
CW in →
Shell: refrigerant (condenses)
Tubes: condenser water (absorbs heat)
→ CW out
Fouling factor maintenance: tube brushing or automatic ball cleaning system

Plate & frame are compact, high-effectiveness (ε > 0.90) and easily expandable — add plates to increase capacity. Used in CDUs, economizer bypass, and free-cooling heat exchangers. Brazed variants (BPHE) are smaller but not field-serviceable.

Shell & tube are used inside chillers as the condenser and evaporator. Refrigerant flows shell-side (changes phase); water flows tube-side. Fouling reduces efficiency — condenser approach temp rises 1°F per year without cleaning, costing ~2% chiller efficiency per degree.

Pump types in cooling plants

CENTRIFUGAL PUMP — CUTAWAY VIEW
SUCTION
axial inlet
IMPELLER
cast iron
or bronze
VOLUTE (spiral casing)
DISCHARGE
tangential
high pressure
SHAFT
MECH SEAL
prevents leaks
rotating + stationary face
MOTOR
10–200 HP
TEFC · 1800 RPM
VFD
speed control
Power ∝ speed³
PUMP TYPES IN AI DC COOLING
CHW PUMPS
Primary: constant flow through chillers
Secondary: VFD, modulates to demand
50–300 HP · 1000–4000 GPM
CW PUMPS
Chiller condenser ↔ cooling tower
Constant or VFD speed
Must match chiller GPM/ton spec
CDU PUMPS
Mag-drive or canned-motor
No shaft seal (leak-free)
5–30 GPM · redundant pair per CDU
AFFINITY LAWS — THE KEY EFFICIENCY PRINCIPLE
Flow ∝ Speed
linear
Head ∝ Speed²
squared
Power ∝ Speed³
cubed
80% speed = ~51% power. VFDs are the single biggest efficiency lever in a cooling plant.

Moving air through towers & AHUs

FAN TYPES IN DATA CENTER COOLING
AXIAL FAN
(cooling towers, condensers)
High volume, low pressure
6–30 ft diameter · gear-reduced motor
CENTRIFUGAL FAN
(CRAHs, AHUs)
Medium volume, higher pressure
Scroll housing · plenum or housed
EC FANS (Electronically Commutated)
Used in: server chassis, rear-door heat exchangers, small AHUs
Brushless DC motor + integrated speed control · 90%+ efficiency · PWM driven
Cooling tower fans — large axial fans (6–30 ft diameter), typically driven by a gear-reduced motor. VFDs modulate speed to maintain condenser water temperature. Fan energy follows the same affinity laws as pumps: power ∝ speed³.
CRAH/AHU fans — centrifugal (plenum or housed) fans move air across chilled water coils. EC fan arrays (fan walls) are replacing single large fans — they provide N+1 redundancy, lower noise, and better efficiency at partial loads.
Server fans — counter-rotating 40–80 mm fans inside each server or tray. Speed controlled by BMC based on inlet/exhaust temp and component thermal sensors. In liquid-cooled systems, server fans handle only the 20–40% of heat not captured by cold plates (VRMs, DIMMs, NVMe drives).
Rear-door heat exchangers (RDHx) — a hybrid approach: chilled water coils with EC fans mounted on the rack rear door. Captures exhaust heat before it enters the room. Can handle 30–50 kW/rack without liquid touching servers. Popular as a retrofit for air-cooled facilities adding GPU density.

How it all connects — primary/secondary CHW + condenser loop

CHILLER PLANT — PROCESS FLOW
Read top → bottom to follow the heat rejection path
CW supply (cool)
CW return (warm)
CHW supply (cold)
IT loop
1
HEAT REJECTION — COOLING TOWERS
Cooling Towers (3 cells, N+1)
Evaporative heat rejection to atmosphere
Fan-assisted counterflow · VFD speed control
CW IN
85–95°F
warm return
CW OUT
75–85°F
cooled supply
▼ CW supply flows down to pumps
2
CONDENSER WATER PUMPS
P
CW Pump 1
Lead
Constant speed · 50–200 HP
P
CW Pump 2
Lag / standby
Constant speed · 50–200 HP
▼ CW supply → chiller condenser
3
CHILLERS — WHERE CW LOOP MEETS CHW LOOP
Chiller 1 · 500 ton
CONDENSER
CW absorbs heat
shell & tube
EVAPORATOR
CHW is produced
shell & tube
Compressor · refrigerant cycle · R-134a
Chiller 2 · 500 ton
CONDENSER
CW absorbs heat
shell & tube
EVAPORATOR
CHW is produced
shell & tube
Compressor · refrigerant cycle · R-1234ze
↑ CW return (warm) — 85–95°F back to towers
↓ CHW supply (cold) — 42–50°F to pumps below
▼ CHW supply flows down to primary pumps
4
PRIMARY CHW PUMPS
P
Primary 1
Constant flow through evaporator
Matches chiller min flow requirement
P
Primary 2
Constant flow through evaporator
Matches chiller min flow requirement
BYPASS / DECOUPLER
Excess primary flow recirculates · isolates primary from secondary
5
SECONDARY CHW PUMPS (VFD)
P
Secondary 1 (VFD)
Variable speed · matches IT load
ΔP sensor at most-remote CDU
P
Secondary 2 (VFD)
Variable speed · matches IT load
ΔP sensor at most-remote CDU
▼ CHW supply → CDU facility side (42–50°F)
6
CDUs → IT RACKS (FACILITY/IT BOUNDARY)
CDU-1
Plate HX
IT pump pair
Sensors
4–8 racks · 200–600 kW
CDU-2
Plate HX
IT pump pair
Sensors
4–8 racks · 200–600 kW
CDU-3
Plate HX
IT pump pair
Sensors
4–8 racks · 200–600 kW
IT supply: 40–45°C to cold plates
IT return: 50–55°C back to CDU
▲ RETURN PATH
Warm CHW (55–65°F) returns from CDUs → through secondary & primary pumps → back to chiller evaporator → cycle repeats
FREE COOLING / ECONOMIZER MODE
When outdoor wet-bulb temp < CHW supply setpoint: tower water routes through a plate & frame HX directly to the CHW loop, bypassing chillers entirely. Zero compressor energy. In warm-water systems (GPU supply >40°C), free cooling operates 8,000+ hours/year in northern climates — the largest single contributor to PUE < 1.10.
DESIGN TEMPERATURES AT EACH STAGE
Tower Water
Supply: 75–85°F
Return: 85–95°F
Chilled Water
Supply: 42–50°F
Return: 55–65°F
CDU Facility
Supply: 42–50°F
Return: 55–65°F
CDU IT Side
Supply: 40–45°C
Return: 50–55°C
GPU Cold Plate
Supply: 40–45°C
Return: 50–55°C (Tj~83°C)

Total Facility Power ÷ IT Equipment Power

1.00
1.03–1.10
1.20–1.40
1.50–2.00
Optimization levers
  • • Raise chilled water supply temp → more economizer hours
  • • Liquid cooling → eliminates fan energy, enables free cooling
  • • Variable speed drives on pumps and fans
  • • Hot/cold aisle containment
  • • Efficient UPS (ECO mode, lithium-ion batteries)
  • • Higher voltage distribution (415V vs 208V) reduces I²R losses

[src] A 0.1 improvement in PUE at a 100 MW facility saves ~10 MW of cooling/overhead power — roughly $6M/year at industrial electricity rates.

Why liquid cooling at scale is hard

Moving from air to liquid sounds straightforward — but at AI-scale densities, it introduces a set of engineering challenges that the data center industry is still actively solving. These are the real-world problems that make or break a liquid-cooled deployment.

1
Leak Risk in IT Spaces

Water inside a server room is inherently risky. A single fitting failure can destroy hundreds of thousands of dollars of GPU hardware in minutes. Every connection point — quick-disconnects, manifold joints, CDU internals — is a potential leak site. Mitigation: dry-break QD fittings, leak detection cables under every pipe run, drip pans beneath CDUs, automatic isolation valves that shut down a loop segment within seconds of detection. Despite this, many operators still consider liquid cooling "high anxiety" compared to air.

2
Serviceability & Hot-Swap

Air-cooled servers slide in and out of racks freely. Liquid-cooled servers are tethered to plumbing. Replacing a GPU node means disconnecting fluid lines, managing residual coolant, and reconnecting without introducing air bubbles. Blind-mate connectors (auto-connecting on rack insertion) help, but add cost and complexity. Service technicians need new training — HVAC pipe-fitting skills meet IT operations. Mean-time-to-repair (MTTR) is longer unless the facility is designed with maintenance access and drain points from day one.

3
Hybrid Cooling — Liquid + Air

Cold plates only capture 60–80% of server heat (GPUs, CPUs). The remaining 20–40% (VRMs, DIMMs, NVMe drives, NICs) still radiates into the room as hot air. You can't eliminate CRAHs or room-level cooling entirely — you need a parallel air system for the "residual" heat. Designing the interaction between these two systems (liquid capturing the bulk, air handling the remainder) requires careful airflow modeling and control coordination. Over-cooling with air wastes energy; under-cooling risks component damage on non-liquid-cooled parts.

4
Water Quality & Corrosion

The IT loop requires ultra-pure deionized water (<1 μS/cm conductivity) to prevent galvanic corrosion and micro-channel fouling. But DI water is aggressive — it leaches metal ions from fittings, especially if mixed metals (copper cold plates + aluminum manifolds) are present. Ongoing water chemistry monitoring (conductivity, pH, dissolved O₂, particulate count) is mandatory. Glycol-based coolants resist corrosion but reduce heat transfer capacity by 10–15% and complicate leak cleanup. There is no perfect fluid — every choice is a trade-off.

5
Standardization Gap

Unlike air cooling (standardized 19" racks, ASHRAE guidelines, universal CRAH compatibility), liquid cooling lacks industry-wide standards. Every server vendor has different cold plate designs, manifold connectors, flow requirements, and CDU interfaces. OCP (Open Compute Project) is working on standardized liquid cooling specifications, but adoption is still early. This makes multi-vendor deployments painful — a Dell cold plate won't mate with an HPE manifold. Facilities must commit to a vendor ecosystem or invest heavily in adapters and custom plumbing.

6
Retrofit vs. Greenfield

Existing air-cooled facilities weren't designed for liquid cooling. Retrofitting requires: raised-floor penetrations or overhead pipe routing, structural reinforcement (water-filled pipes are heavy), new chiller/tower capacity, CDU floor space, and leak containment infrastructure. Floor loading jumps from ~150 lbs/ft² (air-cooled) to 250+ lbs/ft² (liquid-cooled with dense GPU racks). Many buildings simply can't support it without structural modifications. Greenfield builds can design for liquid from the start — but the industry is converting existing facilities faster than it can build new ones.

Submerging servers in dielectric fluid

Immersion cooling eliminates air entirely — servers are submerged in a tank filled with electrically non-conductive (dielectric) fluid. The fluid makes direct contact with every component on the board, removing heat from GPUs, CPUs, VRMs, DIMMs, and NVMe drives simultaneously. No cold plates, no fans, no CRAHs. Two variants exist: single-phase (fluid stays liquid) and two-phase (fluid boils at the chip surface).

SINGLE-PHASE IMMERSION
SINGLE-PHASE IMMERSION TANK — CROSS SECTION
HEAT EXCHANGER (in-tank or external)
Plate & frame or coil — connected to facility CHW loop
CHW in
CHW out
DIELECTRIC FLUID (mineral oil or synthetic hydrocarbon)
SERVER 1
GPU
CPU
DIMM
VRM
SERVER 2
GPU
CPU
DIMM
VRM
SERVER 3
GPU
CPU
DIMM
VRM
SERVER 4
GPU
CPU
DIMM
VRM
Servers mounted vertically on rails — slide up for service
Hot fluid rises
Cool fluid sinks
CIRCULATION PUMPforced convection through HX
FILTRATIONparticulate removal
How it works: Servers are mounted vertically on rails inside a sealed tank. The dielectric fluid (typically mineral oil at $2–5/liter, or engineered synthetic hydrocarbons at $10–30/liter) circulates via natural convection and/or a pump. Hot fluid rises to the top where an in-tank or external heat exchanger transfers heat to the facility chilled water loop. Fluid temperature is maintained at 35–45°C — the fluid never changes phase.
Heat transfer: Single-phase immersion achieves heat transfer coefficients of 50–200 W/m²·K — better than air (5–25 W/m²·K) but far below two-phase (>10,000 W/m²·K). The advantage over air is total surface contact: every component is cooled simultaneously, eliminating hot spots from poorly positioned fans.
Key vendors: GRC (Green Revolution Cooling) — largest deployed base, oil-based; Submer — synthetic fluid SmartCoolant; Asetek — hybrid immersion + cold plate; LiquidCool Solutions — chassis-level sealed enclosures.
ADVANTAGES
  • ✓ Eliminates all server fans — 10–15% IT power savings
  • ✓ No CRAHs or raised floor required
  • ✓ Every component cooled equally — no hot spots
  • ✓ Operates at higher fluid temps → more free-cooling hours
  • ✓ PUE of 1.02–1.05 achievable
  • ✓ Quieter than air-cooled — no fan noise
  • ✓ Mineral oil is cheap and widely available
  • ✓ No leak risk to room — fluid stays in sealed tank
CHALLENGES
  • ✗ Messy serviceability — servers drip when removed
  • ✗ Increased MTTR — draining & handling required
  • ✗ Material compatibility: some connectors, labels, thermal pads dissolve
  • ✗ Weight: filled tanks reach 1,500–3,000 lbs — structural reinforcement needed
  • ✗ Fluid monitoring (viscosity, particulates, moisture) adds operational overhead
  • ✗ Server OEM warranty may be voided — most OEMs don't certify for immersion
  • ✗ Fire code compliance: mineral oil is combustible (Class IIIB)
  • ✗ Limited vendor ecosystem vs. cold-plate solutions
TWO-PHASE IMMERSION COOLING
TWO-PHASE IMMERSION TANK — CROSS SECTION
CONDENSER COIL (in vapor space)
Vapor contacts cold coil → condenses back to liquid → drips down
Facility CW in (cool)
Facility CW out (warm)
VAPOR SPACE
Dielectric vapor rises from boiling surfaces
BOILING DIELECTRIC FLUID (e.g., 3M Novec 7100, boiling point ~61°C)
SERVER 1
● GPU (boiling)
● CPU (boiling)
● VRM / DIMM
SERVER 2
● GPU (boiling)
● CPU (boiling)
● VRM / DIMM
SERVER 3
● GPU (boiling)
● CPU (boiling)
● VRM / DIMM
Fluid boils on hottest surfaces (GPU die) — bubbles carry heat upward as vapor
THE PHYSICS: WHY PHASE CHANGE IS SO POWERFUL

When a liquid boils, it absorbs the latent heat of vaporization — the energy required to break intermolecular bonds. For engineered dielectric fluids, this is typically 80–120 kJ/kg. This is in addition to the sensible heat the liquid absorbs as its temperature rises. The result: 10–100× higher heat transfer coefficients vs. single-phase convection. A boiling surface can reject >20 W/cm² with only a few degrees of superheat above the fluid's boiling point. For comparison, forced-air convection maxes out at ~0.5 W/cm².

Air (forced)
5–25 W/m²·K
Single-phase liquid
200–5,000 W/m²·K
Two-phase (boiling)
10,000–100,000 W/m²·K
How it works: Servers are submerged in a low-boiling-point dielectric fluid. The fluid boils at the surface of hot components (GPU die at ~80°C causes vigorous nucleate boiling in a fluid with a 49–61°C boiling point). The generated vapor rises into a vapor space above the liquid level where it contacts a condenser coil cooled by facility water. Vapor condenses back to liquid and drips into the tank — a self-sustaining cycle with no pumps in the primary loop.
Dielectric fluids: The most common are fluorocarbon-based engineered fluids — 3M™ Novec™ 7100 (bp 61°C), Novec 649 (bp 49°C), and Solvay Galden HT fluids. These are non-conductive, non-flammable, non-toxic, and chemically inert. However, many contain PFAS (per- and polyfluoroalkyl substances) and face increasing regulatory scrutiny in the EU and US. 3M exited PFAS manufacturing entirely by end of 2025, creating supply uncertainty. Non-PFAS alternatives (hydrofluoroolefins, synthetic esters) are emerging but not yet proven at scale.
Fluid cost: Engineered fluorocarbon fluids cost $50–150/liter. A typical 48U immersion tank holds 200–500 liters. Fluid loss from vapor escape (even with condensers and freeboard) runs 2–5% annually. At scale, this is a significant operating expense compared to single-phase mineral oil ($2–5/liter) or direct-to-chip water (near-zero fluid cost).
Key vendors: LiquidCool Solutions — sealed chassis-level two-phase; ZutaCore — open-bath two-phase with HyperCool technology (deployed by Equinix, Aligned); GRC — primarily single-phase but exploring two-phase; Iceotope — precision immersion with chassis-level sealed units.
ADVANTAGES
  • ✓ Highest heat transfer of any cooling method — handles >1,000W TDP
  • ✓ No pumps in primary IT loop — self-circulating phase change
  • ✓ Uniform chip temperature regardless of load (boiling is isothermal)
  • ✓ Can handle 200+ kW/rack densities
  • ✓ Near-silent operation — no fans, no pump vibration
  • ✓ PUE of 1.01–1.03 theoretically achievable
  • ✓ Fluid is non-flammable (unlike mineral oil)
CHALLENGES
  • ✗ Extremely high fluid cost ($50–150/liter × 200–500L per tank)
  • ✗ PFAS regulatory risk — EU REACH restrictions, 3M exit
  • ✗ Vapor management — tank must be sealed; fugitive emissions are GWP concern
  • ✗ Material compatibility — aggressive solvents attack some elastomers, labels, TIMs
  • ✗ Limited production-scale deployments (still early-adopter stage)
  • ✗ Service complexity — servers must drain before removal
  • ✗ Fluid replenishment logistics and environmental disposal
  • ✗ Condenser sizing critical — undersized = vapor loss + capacity limit
COOLING TECHNOLOGY COMPARISON
AttributeAir CoolingDirect-to-Chip (Cold Plate)Single-Phase ImmersionTwo-Phase Immersion
Max rack density15–25 kW120–200 kW100–200 kW200+ kW
Heat transfer coeff.5–25 W/m²·K5,000–10,000 W/m²·K50–200 W/m²·K10,000–100,000 W/m²·K
PUE achievable1.3–1.61.03–1.151.02–1.051.01–1.03
Fluid costN/A (air)~$0 (water)$2–30/L$50–150/L
Residual heat pathAll via air20–40% via air100% via fluid100% via fluid
ServiceabilityExcellentModerate (QD fittings)Challenging (drip/drain)Challenging (drain + vapor)
MaturityDecades (standard)Production (NVIDIA std)Early productionPilot / early adopter
Regulatory riskNoneNoneLow (oil fire codes)High (PFAS)
CONTROLS & MONITORING — IMMERSION-SPECIFIC

Immersion cooling introduces a different set of monitoring requirements compared to cold-plate systems. The fluid itself becomes a critical asset to monitor and maintain.

FLUID HEALTH
  • ● Fluid temperature (bulk & stratified)
  • ● Viscosity (degradation indicator)
  • ● Dielectric breakdown voltage
  • ● Moisture content (ppm)
  • ● Particulate count (cleanliness)
  • ● Acid number (oxidation products)
  • ● Fluid level (leak/loss detection)
THERMAL MONITORING
  • ● Tank inlet / outlet ΔT
  • ● Per-server inlet temp (immersed sensor)
  • ● Condenser coil in/out temps
  • ● Vapor space temperature (two-phase)
  • ● Condenser approach temperature
  • ● Ambient air above tank
SAFETY SYSTEMS
  • ● High-temp shutdown (fluid overheat)
  • ● Low-level alarm (fluid loss)
  • ● Vapor pressure monitoring (two-phase)
  • ● Leak detection under/around tanks
  • ● Emergency drain valve (gravity-fed)
  • ● Fire suppression integration

Vaporization at the cold plate — boiling without immersion

This is the “best of both worlds” approach you may have seen: the extreme heat transfer of phase-change boiling, but contained inside a sealed cold plate bolted to the GPU — no immersion bath, no dripping servers. The dielectric fluid boils at the chip surface inside the cold plate. Vapor travels through tubing to a remote condenser where it releases heat to the facility loop and condenses back to liquid, which returns to the cold plate. This is fundamentally different from single-phase direct-to-chip (where water stays liquid and just gets warmer).

2P-DTC LOOP — HOW IT WORKS
PUMPED / THERMOSIPHON TWO-PHASE LOOP
2P COLD PLATE
sealed evaporator on GPU die
Liquid pool / wick
nucleate boiling on hot surface
GPU die ~1000W
Liquid boils @ 30–60°C
VAPOR
low pressure
insulated line
REMOTE CONDENSER
in CDU or at rack manifold
Vapor → liquid
heat → facility water
CHW in
CHW out
LIQUID
gravity or
small pump
↺ Self-circulating loop — no large mechanical pump required (thermosiphon)
or assisted by small low-pressure pump (pumped 2-phase)
THERMOSIPHON (PASSIVE)
No pump in the IT loop

Vapor rises naturally because it's less dense than liquid. Liquid returns by gravity. Requires the condenser to be physically abovethe cold plate — typically at the top of the rack or in an overhead manifold. Zero pumping energy in the IT loop. Limited by the height differential and pressure drop.

Pros: No moving parts in IT loop, near-silent, ultra-reliable
Cons: Layout constraints, capacity limited by gravity head
PUMPED 2-PHASE (ACTIVE)
Small low-pressure pump assists return

A small magnetically-coupled pump moves liquid back to the evaporator, freeing the design from gravity constraints. The pump only handles liquid return (low flow, low pressure) — the phase change still does the heavy lifting of heat transport. Condenser can be placed anywhere convenient.

Pros: Flexible layout, scales to higher densities, faster startup
Cons: Pump = failure point, requires redundancy (N+1)
THE PHYSICS — WHY VAPORIZATION AT THE CHIP CHANGES EVERYTHING
Isothermal heat removal

A boiling fluid stays at its saturation temperature regardless of how much heat you dump into it. A 700W GPU and a 1,500W GPU on the same loop will both stabilize at the fluid's boiling point (say, 55°C). This eliminates hot-spot variation between chips and gives a flat, predictable Tjunction across the entire cluster.

Massive ΔH per kg

Single-phase water carries ~4.2 kJ per kg per °C of temperature rise. A two-phase dielectric carries 80–120 kJ/kg of latent heat on phase change alone — >20× more energy per unit mass moved. Result: you need ~1/10th the coolant flow rate to remove the same heat. Smaller pipes, smaller pumps, less plumbing complexity.

Higher density possible

Conventional single-phase cold plates start running out of margin at ~1,000W GPUs because the inlet-to-outlet ΔT widens and the chip sees inconsistent cooling. 2P-DTC handles 1,500W+ TDPs with the same supply temperature and flat thermal profile. This makes it a leading candidate for NVIDIA Rubin and beyond.

Higher supply temps OK

Because phase change happens at a fixed temperature (set by fluid choice), 2P-DTC can run with facility water at 35–45°C while still keeping GPU Tj below throttle limits. This means year-round free cooling in almost any climate — no compressors, no chillers, just a dry cooler or cooling tower.

KEY VENDORS — 2P-DTC LANDSCAPE
ZutaCore — HyperCool

Pumped 2-phase using a proprietary dielectric (HC-1). Sealed evaporators (Enhanced Nucleation Evaporators) bolt directly onto GPU/CPU. Deployed by Equinix, Aligned. Among the most commercially mature 2P-DTC offerings.

Accelsius — NeuCool

Pumped 2-phase using a non-PFAS, low-GWP fluorocarbon refrigerant. Spinoff from Nokia Bell Labs technology. Focuses on drop-in cold-plate replacement for existing direct-to-chip racks.

JetCool — SmartPlate

Microconvective single-phase today, with two-phase variants in development. Uses high-velocity impinging jets inside the cold plate to maximize heat transfer coefficient. HPE OEM partnership.

Chilldyne — Cold Plate w/ Negative Pressure

Operates the IT loop at sub-atmospheric pressure so leaks pull air in rather than push fluid out. Pairs naturally with 2-phase designs where vapor pressure management is critical.

2P-DTC VS. OTHER LIQUID COOLING APPROACHES
AttributeSingle-Phase DTC (water)Two-Phase DTCTwo-Phase Immersion
Where boiling occursN/A (no phase change)Inside sealed cold plateOpen bath around servers
FluidTreated water or PG mixDielectric refrigerant (small volume)Dielectric refrigerant (large volume)
Fluid volume per rack5–20 L2–10 L200–500 L
Server form factorStandard rack-mountStandard rack-mountVertical in tank
ServiceabilityQD fittings, hot-swapQD fittings, hot-swapDrain & lift from tank
Max chip TDP~1,000–1,200 W1,500 W+1,500 W+
Facility water supply temp25–35°C (warm water)35–45°C (free cooling)25–40°C
Residual air cooling neededYes (~20–40% of heat)Yes (~20–30% of heat)No
Leak consequenceWater on electronics = damageDielectric — non-conductive, evaporatesDielectric — non-conductive
Maturity (2026)Production standard (NVL72)Early production / pilotPilot / early adopter
CONTROLS & TELEMETRY UNIQUE TO 2P-DTC
Saturation pressure monitoring

The fluid's saturation temperature is a direct function of system pressure. A pressure drift indicates fluid loss, non-condensable gas ingress, or condenser fouling. Pressure transducers on the vapor line are mandatory.

Dryout / CHF protection

If heat flux exceeds the critical heat flux (CHF) of the boiling surface, the cold plate can “dry out” — a vapor film forms between liquid and the hot surface, collapsing heat transfer. Tj will spike in seconds. PLC must trip the GPU before this happens.

Non-condensable gas (NCG) detection

Air leaking into a sub-atmospheric loop creates NCG pockets in the condenser, reducing effective area. Most 2P-DTC systems have an automatic vent or NCG purge cycle, with monitoring for purge events.

Charge / fluid inventory

Unlike water systems, you can't just top off from a city tap. Fluid loss = expensive refrigerant replacement. Continuous mass monitoring (via accumulator level or pressure-temperature correlation) catches slow leaks early.

WHERE 2P-DTC FITS IN THE ROADMAP

Today, single-phase direct-to-chip (water) is the production standard for GB200 NVL72 and equivalent 120 kW racks. 2P-DTC is the leading contender for the next density jump — 200+ kW racks with 1,500W+ GPUs (NVIDIA Rubin generation). It preserves the serviceability and familiar form factors of single-phase DTC while delivering the heat-flux handling of immersion. The main barriers are fluid cost, supply-chain constraints around non-PFAS refrigerants, and the relative newness of the vendor ecosystem compared to mature water-based cold plates.

What's next — emerging technologies pushing cooling forward

With GPU TDP heading toward 1,500W+ (NVIDIA Rubin) and rack densities exceeding 200 kW, even current liquid cooling approaches face limits. Here's where the industry and academia are pushing boundaries.

MICROFLUIDIC ON-CHIP COOLING
DARPA ICECool, Georgia Tech, EPFL

Instead of bolting a cold plate on top of the chip, etch micro-channels directly into the silicon die or interposer. Coolant flows microns away from the transistors. DARPA's ICECool program demonstrated >1 kW/cm² heat removal — 10× what conventional cold plates achieve. This eliminates TIM, IHS, and the thermal resistance stack entirely. Challenges: fabrication complexity, integrating fluid connections into chip packaging, reliability over billions of thermal cycles. Timeline: 5–10 years from production GPU integration.

THERMOELECTRIC (PELTIER) COOLING
Phononic, Laird Thermal, Intel Research

Solid-state heat pumps with no moving parts — apply voltage and one side gets cold. Current TECs are inefficient (COP ~0.5–1.5 vs mechanical chiller COP of 5–8), but new bismuth telluride nanostructures and thin-film designs are closing the gap. Use case: spot cooling for the hottest chip regions (hot-spot management) rather than bulk heat rejection. Intel has demonstrated embedded TEC layers that reduce Tj by 15°C at targeted hot spots. Could complement liquid cooling rather than replace it.

NANOFLUIDS & ADVANCED COOLANTS
MIT, Purdue, various national labs

Suspending nanoparticles (copper oxide, aluminum oxide, carbon nanotubes) in base fluids can improve thermal conductivity by 10–40%. Lab results show significant heat transfer improvements at low concentrations (1–5% by volume). Challenges: particle settling over time, abrasion of micro-channels, long-term stability, and cost. No production deployments yet in data centers, but active research funded by ARPA-E and DOE. Dielectric nanofluids could make immersion cooling significantly more effective.

WARM-WATER & WASTE HEAT REUSE
Lenovo Neptune, Meta, Nordic DCs

Running GPU supply water at 45–55°C (instead of the traditional 15–20°C) enables year-round free cooling — no chillers needed at all. The "waste" heat at 55–65°C return is hot enough for district heating. Nordic data centers (Meta Luleå, Google Hamina) already feed waste heat to municipal heating networks. Lenovo's Neptune platform runs at 50°C supply, achieving PUE <1.03. The efficiency gain is dramatic: eliminating chiller compressors removes 30–40% of total cooling plant energy. Challenge: GPU silicon must tolerate higher junction temperatures, and not all workloads perform identically at elevated temps.

REAR-DOOR HEAT EXCHANGERS (RDHx)
CoolIT, Motivair, ZutaCore

A middle-ground approach: chilled water coils with EC fans mounted on the rack rear door. Captures 60–100% of rack exhaust heat before it enters the room. Handles 30–50 kW/rack without liquid inside the server. Popular as a retrofit for air-cooled facilities adding GPU density. No cold plates, no QD fittings, no IT-loop plumbing — just chilled water to the door. Limitation: can't match direct-to-chip efficiency at 100+ kW/rack densities, and doesn't address chip-level hot spots.

THE TRAJECTORY

The industry consensus: direct-to-chip liquid cooling is the default for any new AI data center build. Air cooling is being relegated to edge, enterprise, and legacy workloads. The open questions are whether immersion or cold-plate wins for the densest deployments, whether on-chip microfluidics can reach production in time for the next GPU generation, and whether the industry can standardize fast enough to avoid vendor lock-in. Meanwhile, every 100W increase in GPU TDP makes the case for liquid cooling stronger — and the gap between air's ceiling and GPU demand wider.

When the data center leaves Earth — cooling, comms & controls in vacuum

Starcloud (formerly Lumen Orbit), Axiom Space, Lonestar Data Holdings, and several Chinese ventures are seriously proposing multi-megawatt compute clusters in low Earth orbit (LEO). The pitch: unlimited solar power, no water, no land, no NIMBY. The reality: the two hardest problems on Earth — power and cooling — get even harder in vacuum, and a third problem (communications) becomes load-bearing. Here's how each of those works without an atmosphere underneath you.

COOLING // RADIATION IS THE ONLY OPTION

Common misconception: “space is cold, so cooling is easy.” The opposite is true. With no air, there is no convection. With nothing in contact, there is no conduction. The only way to reject heat into space is by infrared radiation, and radiation is the weakest of the three heat-transfer modes by orders of magnitude.

THE STEFAN–BOLTZMANN LIMIT

Radiative heat flux: q = ε·σ·(T⁴ − T_space⁴)

Where ε ≈ 0.85–0.92 for good radiator coatings, σ is the Stefan-Boltzmann constant, and T_space ≈ 3 K (cosmic background) is effectively zero. The catch: T⁴ means radiator capacity drops fast as you try to operate cooler. A radiator at 50°C rejects roughly ~600 W/m². At 20°C it rejects only ~420 W/m². That's why orbital data center designs always want to run radiators hot.

REAL NUMBERS
  • ● ISS: 14 ammonia radiator panels, ~75 kW total rejection
  • ● ISS radiator area: ~1,560 m²
  • ● Effective rejection: ~48 W/m² (avg over orbit)
  • ● A 5 MW orbital DC would need ~100,000 m² of radiator at the same efficiency
  • ● That's a square ~315 m × 315 m — bigger than 14 football fields
  • ● Solar panels for the same 5 MW: ~25,000 m² (4× smaller than the radiators)
THE INSIGHT

In orbit, radiator area, not solar panel area, is the binding constraint. This flips terrestrial design intuition: on Earth power is expensive and cooling is cheap-ish. In orbit, power from the sun is essentially free, but every watt you generate must eventually be radiated, and radiator mass-to-orbit is the dominant cost driver.

HEAT TRANSPORT // GPU → RADIATOR
ORBITAL DATA CENTER — THERMAL ARCHITECTURE
GPU CHASSIS
Cold plates on dies
Tj ≤ 85°C target
INTERNAL FLUID LOOP
Pumped water or PG inside pressurized module
HEAT EXCHANGER
Liquid–liquid IFHX
EXTERNAL LOOP
Ammonia or 2-phase CO₂
DEPLOYABLE RADIATORS
Reject to 3K space
Two loops because the working fluid for radiators (ammonia, NH₃) is toxic to crew / damaging to electronics, but has excellent thermal properties at −30°C to +40°C. Internal water loop stays safe; external ammonia loop does the heavy lifting.
Heat pipes & LHPs

Loop heat pipes (LHPs) move heat passively from chip → radiator using capillary-driven phase change. No pump, no moving parts, indefinite lifetime. The workhorse of spacecraft thermal control.

Two-phase ammonia

NH₃ boils around −33°C at 1 atm but is operated at higher pressure in spacecraft to boil at 5–40°C. Latent heat (~1,370 kJ/kg) is ~6× water's — minimizes pump power and tubing mass.

Radiator orientation

Radiators must face deep space, not the sun or Earth. Attitude control system (ACS) constantly slews the spacecraft to keep solar arrays sun-pointed and radiators edge-on to the sun — a coupled optimization problem unique to orbital DCs.

DATA // GETTING BITS UP AND DOWN

An orbital DC is useless if you can't move data to it and results back. There are three communication tiers, each with different bandwidth, latency, and standards.

Link TypeTechBandwidthLatencyUse Case
Ground ↔ Sat (RF)Ka / Ku band10–40 Gbps5–20 ms (LEO)Bulk uplink/downlink
Ground ↔ Sat (Optical)Laser comm100 Gbps – 1 Tbps5–10 ms (LEO)High-volume training data transfer
Sat ↔ Sat (Optical ISL)Inter-satellite laser100–200 Gbps<5 msDistributed compute mesh between DCs
TT&C (control)S-band RFkbps – Mbps10–50 msTelemetry, commands (separate from payload)
CCSDS — the OPC-UA of space

The Consultative Committee for Space Data Systems defines the standards every space agency and most commercial operators follow. Key protocols: Space Packet Protocol (datagram format), AOS/TM/TC (telemetry & telecommand framing), CFDP (file transfer with delay tolerance), DTN/Bundle Protocol (store-and-forward for intermittent links).

Ground station networks

A LEO satellite sees any one ground station for only ~5–10 minutes per pass. To get continuous downlink you need a global network: AWS Ground Station, Azure Orbital, KSAT, Viasat RTE. Optical-link networks (Starlink-style) provide always-on backhaul by routing through inter-satellite links to a sat currently over a station.

CONTROLS // YOU CAN'T SEND A TECHNICIAN

On Earth, the worst case for a misbehaving server is a remote-hands ticket. In orbit, there are no remote hands. Everything from thermal control to GPU error recovery must be autonomous, with ground operators reduced to high-level commanding and forensic analysis after the fact.

FDIR — Fault Detection, Isolation & Recovery

Hierarchical autonomy: every subsystem (power, thermal, comms, GPU cluster) runs local detection of out-of-limit conditions, isolates the fault, executes a recovery action (switch to redundant unit, safe-mode the affected zone, throttle GPUs to reduce heat), and reports up. Ground intervenes only when on-board recovery exhausts its playbook. Think of it as self-healing infrastructure as a hard requirement, not a nice-to-have.

Rad-hard flight computer

The control computer (BMS equivalent) runs on radiation-hardened silicon: BAE RAD750, Cobham GR740, or newer commercial-off-the-shelf + ECC + TMR (triple modular redundancy) designs. Flight software runs on RTOSes like NASA cFS (Core Flight System), VxWorks, or RTEMS. The GPUs themselves are commercial — but every command they receive routes through the rad-hard supervisor.

SEU / SEL handling

Cosmic rays cause single-event upsets (SEUs — bit flips) and single-event latch-ups (SELs — stuck transistors) in commercial silicon. Mitigations: ECC memory everywhere, periodic scrubbing of DRAM, latch-up current sensors that power-cycle affected blocks within microseconds, and software-level checkpoint/restart for training jobs.

Internal buses

Inside the spacecraft, the control plane runs on SpaceWire (200 Mbps) or SpaceFibre (multi-Gbps) for high-speed; MIL-STD-1553 (1 Mbps, triple-redundant) for safety-critical commands; CAN bus for low-rate sensor polling. Payload (the actual GPU cluster) uses standard 400G/800G Ethernet or InfiniBand, just shielded and qualified for vibration.

CRITICAL TELEMETRY POINTS
Thermal
  • ● Cold plate inlet/outlet T
  • ● Radiator panel T (each)
  • ● Heat pipe evaporator T
  • ● Ammonia loop pressure
  • ● GPU Tj (per die)
  • ● Sun/shade angle
Power
  • ● Solar array current/voltage (per string)
  • ● Battery state of charge
  • ● Battery cell voltages
  • ● Bus voltages (28V, 100V HVDC)
  • ● Eclipse prediction (orbit-driven)
  • ● Charge regulator status
Health & environment
  • ● SEU/SEL event counter
  • ● Radiation dose accumulator
  • ● Attitude (quaternion, rates)
  • ● Reaction wheel speeds
  • ● Propellant remaining
  • ● Comm link SNR / BER
WHO'S BUILDING THIS
Starcloud (formerly Lumen Orbit)

Proposing 5 GW orbital data centers powered entirely by solar. Pitched as the only way to scale compute past Earth's power-grid limits. First demonstrator: a small GPU cluster aboard a SpaceX rideshare.

Axiom Space

Building a commercial space station; orbital compute as a planned payload. Has an MOU with several hyperscalers for in-orbit data processing of Earth observation feeds.

Lonestar Data Holdings

Lunar data centers — leverages the Moon's vacuum + regolith shielding + cold permanently-shadowed craters for radiative cooling. Successfully booted data payloads on Intuitive Machines IM-1/IM-2 missions.

Thales Alenia Space / ESA ASCEND

EU-funded feasibility study (ASCEND — Advanced Space Cloud for European Net-zero emission and Data sovereignty) concluded orbital DCs are technically feasible and could be net-positive on CO₂ by 2050 if launch emissions drop.

REALITY CHECK

Orbital DCs solve power and (theoretically) cooling but introduce three first-principles problems: launch cost (still ~$1,500/kg even on Starship — and radiators are heavy), servicing impossibility (no swap-a-bad-DIMM in LEO), and orbital debris exposure (large radiator panels are meteoroid/debris bullseyes). The economics work only if launch drops below ~$200/kg and compute density per kg jumps another 5–10×. For an AI controls engineer this is a far-horizon problem — but the thermal physics and FDIR principles are directly applicable to edge / remote / unmanned terrestrial sites today.

CONTROLS, PROTOCOLS & AUTOMATION

PLC programming, SCADA/Ignition architecture, OPC-UA, Modbus, MQTT/Sparkplug B, BMS control loops, PID tuning, VFD sequencing, chiller staging logic, and safety interlocks are covered in depth in the Controls & Automation module below.

Proving the facility works — before IT load arrives

is the systematic verification that every system performs per the and design intent. In mission-critical facilities, it follows five levels — each building on the last. Skipping levels means discovering failures under live load, which can cost millions per hour.

L1
Factory Witness Testing — verify equipment at the manufacturer before shipping. Switchgear trip tests, UPS load bank, generator rated-load run, chiller performance curves. Catches defects before they're installed in concrete.
L2
Installation Verification — confirm equipment is installed per design: torque checks on bus connections, pipe pressure tests, insulation resistance (Megger), continuity, valve tag-to-P&ID match, controller I/O point-to-point checkout. Every sensor and actuator confirmed wired to the correct controller point.
L3
Startup & Vendor Testing — the installing contractor or OEM vendor energizes and starts each system individually. Motor rotation check, VFD programming, breaker coordination study verification, UPS static bypass test, chiller startup per OEM procedure, BMS point verification (sensor reads match field measurement within tolerance). Vendor provides startup reports and certifies equipment operational.
L4
Functional Performance Testing (CxA) — after vendor startup in L3, the independent commissioning agent (CxA) verifies each system performs per the . Tests control sequences, setpoints, alarm responses, failover logic, and edge cases the vendor wouldn't test. CxA writes test scripts, witnesses execution, and documents deficiencies. Every SOO sequence gets a pass/fail.
L5
Integrated Systems Testing (IST) — the big one. Multi-system failure scenarios under simulated or real load (load banks). Utility loss → ATS transfer → generator start → UPS carry-through. Chiller trip → backup chiller staging → temp recovery. BMS alarm propagation. scripts prove that mechanical, electrical, and controls work together under stress. Typical: 2–4 weeks for a large facility.

The commissioning agent (CxA) is independent from the installing contractor — they represent the owner's interest and verify the contractor's work. All deficiencies are tracked in an issues log with severity, responsible party, and resolution deadline.

Single pane of glass — from rack to C-suite

Data sources
  • BMS — HVAC status, temps, setpoints, alarms
  • EPMS — power at every distribution stage
  • CMDB — asset inventory, serial numbers, rack positions
  • Network management — switch port mapping, bandwidth
  • Rack PDU — per-server power, environmental sensors
  • CDU / liquid cooling — flow, temp, pressure, leak status
DCIM capabilities
  • Capacity planning — power, cooling, space available per rack/row/hall
  • Real-time PUE — computed from EPMS data, trended over time
  • What-if modeling — "can I add 20 racks to Hall B?"
  • Alarm correlation — root cause across mechanical + electrical
  • Change management — track every rack move/add/change
  • Compliance — SLA uptime reporting, environmental audit trail

In AI factories, DCIM is evolving toward real-time digital twins and ML-driven optimization — adjusting cooling setpoints and power distribution automatically based on predicted GPU workload patterns.

Why training is bandwidth-bound

Every training step, gradients computed on each GPU must be summed across the entire cluster — an over billions of parameters. If the network can't keep up, GPUs sit idle. That's why hyperscalers build dedicated rail-optimized fabrics with one 800 Gb/s per GPU.[src]

The data pipeline

Trillions of pre-tokenized tokens are sharded across a parallel filesystem (, or vendor-specific). Data loaders stream shards into GPU memory while the previous batch is still being processed — overlap is everything.

Fire, leak, EPO

VESDA (laser-based air sampling) for earliest smoke detection. Clean-agent suppression (FM-200, Novec 1230) for IT rooms. Pre-action sprinklers (double interlock) for high-value spaces. EPO (Emergency Power Off) kills entire zones — controversial due to nuisance activation risk but code-required in many jurisdictions. Leak detection cables along every pipe route and under raised floors.

You Can't Manage What You Can't Measure

Why industrial monitoring matters

An AI datacenter isn't just a building with servers — it's an industrial plant running at the thermal and electrical edge of what physics allows. A single 120 kW GPU rack operates at power densities that would be classified as heavy industrial in any other context. When a chiller valve fails at 2 AM and rack inlet temperatures start climbing, you have minutes — not hours — before thermal throttling degrades a multi-million-dollar training run.

This is why every serious datacenter operator invests as heavily in , , and as they do in the compute hardware itself. The monitoring infrastructure is the operational infrastructure. Without it, you're flying blind through a thunderstorm.

The control stack in a mission-critical facility is itself a layered architecture — field devices at the bottom, supervisory systems in the middle, and at the top. Each layer serves a different time scale: PLCs react in milliseconds, BMS loops in seconds, DCIM analytics in minutes to hours.

From field device to operations center

Level 4/5 — Enterprise Network

IT network: DCIM, enterprise analytics, cloud dashboards, ERP. Ignition Cloud Edition, Cirrus Link cloud injectors (AWS/Azure/GCP), Sepasoft ERP connectors, Omniverse digital twins. Directory services, DNS, mail servers.

Level 3 — Operations Systems

Central Ignition Gateway (on-prem), SQL historian, MES, Sepasoft modules (SPC, batch, track & trace, OEE). Plant-wide operations, scheduling, and reporting.

Level 2 — Process (SCADA / HMI)

Local servers, clients (Perspective/Vision), Ignition Edge gateways (IIoT + Panel), Cirrus Link modules. Live one-line diagrams, alarming, trending, reporting. Operator commands flow down from here.

Level 1 — Control (Intelligent Devices)

s, RTUs, IEDs, power meters (Schneider ION/PM), VFDs, motor starters — any device with onboard logic or a processor. Executing closed-loop control: valve modulation, speed control, generator paralleling, UPS bypass. Communicating via TCP, OPC-UA, .

Level 0 — Field Devices (Unintelligent / Physical)

Dumb field instruments with no onboard processing: temperature sensors (RTDs, thermistors), pressure transducers, DP sensors, flow switches, limit switches, contact closures, leak detection cables, smoke detectors, control valves, damper actuators. Output is a raw signal (4–20 mA, 0–10 V, dry contact).

How many alarms does a datacenter generate?

500
20.8
5,600
6,200

Without alarm rationalization, operators drown in noise. Best practice: <1 actionable alarm per operator per 10 minutes (ISA-18.2).

The workhorse of industrial automation

A is an industrial digital computer purpose-built for real-time control. Where a server runs an OS and applications, a PLC runs a deterministic scan cycle: (1) read all inputs, (2) execute the control program, (3) update all outputs, (4) handle communications. Typical scan time: 1–20 ms — fast enough to catch a pump cavitation event before damage occurs.

CPU — executes the control program (ladder logic, structured text, function blocks)
I/O Modules — digital inputs (24 VDC switches), digital outputs (relays, solenoids), analog I/O (4–20 mA, 0–10 V)
Power Supply — converts AC to 24 VDC for the rack and field devices
Backplane — high-speed internal bus connecting CPU to I/O cards
Comms Modules — Ethernet/IP, PROFINET, EtherCAT, serial interfaces

For NVIDIA data centers, expect Beckhoff TwinCAT or Siemens TIA Portal for mechanical controls, and Schneider EcoStruxure for power/BMS.

How PLCs are programmed

PLC programming uses IEC 61131-3 languages. The two most common:

Ladder Logic

Visual language resembling electrical relay circuits. Series contacts = AND, parallel contacts = OR. Core elements: NO/NC contacts, output coils, timers (TON/TOF), counters (CTU/CTD), latch/unlatch.

  Start pump when conditions met:
  ─┤Start_PB├──┤NOT E_Stop├──┤Level_OK├──(Pump_Run)─
       NO          NC            NO         COIL

  Seal-in circuit:
  ─┬─┤Start_PB├─┬─┤NOT E_Stop├──(Pump_Run)─
   └─┤Pump_Run├─┘
        (seal)
Structured Text

High-level language similar to Pascal. Preferred for complex math, state machines, and data manipulation.

IF ChW_Supply > Setpoint + Deadband THEN
  IF NOT Chiller_1_Running THEN
    Chiller_1_Start := TRUE;
    Stage_Timer(IN:=TRUE, PT:=T#300s);
  ELSIF Stage_Timer.Q THEN
    Chiller_2_Start := TRUE;
  END_IF;
END_IF;

The operator's window into the plant

provides a centralized interface to monitor and control industrial processes. The architecture:

Gateway/Server — central hub communicating with PLCs via OPC-UA or native drivers. Manages tags, stores config, runs scripts.
Clients/HMI — operator screens showing process schematics, alarm lists, trend charts. Web-based or desktop-launched.
Historian — time-series database logging tag values at configurable intervals. Powers trending, reporting, and analytics.
Alarm Pipeline — escalation via email, SMS, voice. Alarm journaling to database for audit and analysis.
Tag Types
OPC Tag — reads/writes directly from a PLC register
Memory Tag — internal to SCADA, no device binding
Expression Tag — computed from other tags via formula
Derived Tag — rolling avg, rate-of-change, aggregates

The platform NVIDIA uses

Ignition is a modern, Java-based SCADA platform with a web gateway model. Key differentiators: unlimited licensing (one server price, unlimited clients and tags), cross-platform (Windows/Linux/macOS), modular architecture, and Python scripting via Jython.

PerspectiveHTML5 mobile-responsive HMI — the modern standard
VisionClassic Java desktop client with rich component library
HistorianLogs tag values to SQL database with deadband and scan rates
AlarmingEscalation pipelines: email, SMS, voice + journal to DB
ReportingAutomated PDF/Excel reports on schedule or event
MQTTSparkplug B edge-to-cloud data publishing
# Ignition scripting (Jython)
temp = system.tag.readBlocking(
  ["[default]Chiller_1/ChW_Supply_Temp"]
)[0].value

system.tag.writeBlocking(
  ["[default]Chiller_1/Setpoint"], [42.0]
)
Gateway Architecture

Ignition runs as a single gateway service — a web server hosting a designer IDE, client sessions, device connections, and module runtime all from one process. The gateway exposes a web UI on port 8088 (HTTP) or 8043 (HTTPS) for admin, and serves Perspective sessions to any browser. Unlike legacy SCADA, there are no seat licenses: you buy one gateway, attach unlimited screens, clients, and tags.

Gateway Network — multiple Ignition gateways can mesh together across sites. Tag data, alarm states, and historian records flow between gateways automatically. A central gateway can pull data from edge gateways at each building or campus.
Edge Edition — lightweight version for embedded PCs and edge hardware. Runs the OPC-UA server, local historian, and MQTT transmission — syncs upstream via Store & Forward if the WAN link drops.
Redundancy — active/standby gateway pairs with automatic failover. The standby mirrors tags, history, and alarm state in real time. Failover is typically <30 seconds.
Tag Model & UDTs

Tags are Ignition's core abstraction — every data point in the system is a tag. Tags live in a hierarchical folder structure and can be organized by system, building, or equipment.

UDT (User Defined Type) — the real power of Ignition at scale. A UDT is a tag template: define once (e.g., “Chiller” with 40 member tags for temps, pressures, alarms, status), then stamp out instances. Change the template → every instance updates. In a datacenter with 50 identical CRAHs, this turns 2,000 tags into 50 UDT instances with consistent naming and alarm config.
Tag History — any tag can be historized with a click. Configure scan class (100 ms to 1 hr), deadband (absolute or percentage), and storage destination (SQL, InfluxDB, or Ignition's internal DB). Partitioned tables roll automatically for long-term retention.
Expression & Script Tags — computed values (PUE = IT_kW / Total_kW) or Python-driven logic that fires on change. Useful for derived metrics, unit conversions, and cross-system calculations.
Perspective HMI

Perspective is Ignition's HTML5 visualization module — the successor to Vision. Operators open a browser tab (or a native Perspective Workstation app), authenticate via SAML/LDAP/AD, and see live plant graphics. Key concepts:

Views — individual screens (one-line, chiller plant, alarm summary)
Components — drag-and-drop: gauges, charts, power bars, SVG overlays
Bindings — wire a component property to a tag value (live update)
Styles / Themes — dark mode, responsive layouts, role-based views
Sessions — each browser tab is a session; identity-aware with RBAC
Embedding — iFrame Perspective inside DCIM or NOC dashboards

How equipment talks to each other

A single datacenter may contain equipment from 15+ vendors. Getting them to communicate reliably is one of the hardest integration challenges in the industry. These are the core protocols:

OPC-UA — The Modern Standard

replaces legacy COM/DCOM with a platform-independent, secure protocol. The server exposes a hierarchical node tree (Objects → Variables → Methods). Clients browse, read/write, and subscribe to changes — subscriptions push data on change, eliminating polling overhead. Built-in TLS encryption, certificate-based auth.

Primary standard for PLC-to-SCADA and SCADA-to-IT integration. Ignition's OPC-UA server is a key integration point.

Address Space — OPC-UA organizes all data into a unified address space of typed nodes. Each node has a NodeId (namespace + identifier), a BrowseName, and a set of attributes. Variables hold values (temperature, status), Objects group related variables (a “Chiller” object containing supply temp, return temp, status, runtime hours), and Methods expose callable actions (start, stop, reset). This rich data model means a client can discover the full structure of a PLC program just by browsing — no register map spreadsheets needed.
Subscriptions & Monitored Items — instead of polling every register every second, the client tells the server: “notify me when Chiller_1/Supply_Temp changes by more than 0.5°F.” The server batches change notifications and publishes them at a configurable interval (e.g., 500 ms). This drastically reduces network traffic — in a 10,000-point system only the ~5% of values that actually changed get transmitted each cycle.
Security Model — three layers: Transport (TLS 1.2+), Session (username/password or X.509 certificate), and Application (certificate trust lists). The server and client exchange certificates on first connection. Security policies: None (lab only), Basic256Sha256, and Aes128_Sha256_RsaOaep. In production data centers, mutual certificate auth is required — no anonymous access.
Ignition as OPC-UA Server — Ignition exposes its entire tag tree as an OPC-UA server (default port 62541). This means any OPC-UA client — a DCIM platform, a Python analytics script, a digital twin — can connect and read every tag in Ignition without any Ignition-specific SDK. It's also an OPC-UA client, connecting upstream to PLCs (Siemens, Beckhoff, Allen-Bradley) that expose their own OPC-UA servers. This dual role makes Ignition the integration hub between OT and IT.
Companion Specifications — industry groups define standard OPC-UA information models for specific equipment: PLCopen for motion, FDI for field devices, and emerging specs for data center infrastructure. When a chiller vendor implements the companion spec, their OPC-UA server exposes data with standardized names and types — true plug-and-play interoperability instead of custom register mapping per vendor.
Modbus TCP/RTU — The Universal Fallback

The simplest and most widely deployed industrial protocol (since 1979). Modbus TCP runs over Ethernet port 502; Modbus RTU runs over RS-485 serial. Data organized as registers: Coils (R/W bits), Discrete Inputs (read-only bits), Input Registers (read-only 16-bit), Holding Registers (R/W 16-bit). Common function codes: FC03 (read holding), FC04 (read input), FC06 (write single register), FC16 (write multiple).

Every power meter, VFD, and simple sensor speaks Modbus. No security features — relies on network segmentation.

MQTT — Message Queuing Telemetry Transport

is a lightweight publish/subscribe messaging protocol designed for constrained networks. Unlike request/response protocols (HTTP, Modbus), MQTT decouples producers from consumers — publishers don't need to know who's listening, and subscribers don't need to know who's sending. This makes it ideal for IIoT and data center monitoring where thousands of sensors feed data to multiple consumers (SCADA, DCIM, analytics, digital twins).

Core Architecture
Broker

Central message router. All clients connect to the broker — never directly to each other. The broker receives published messages, filters by topic, and delivers to matching subscribers. Examples: HiveMQ (enterprise), Mosquitto (open-source), EMQX (high-scale).

Topics

Hierarchical UTF-8 strings using / as delimiter. Example: site/bldg-A/mech/chiller/CH-1/chwst. Wildcards: + matches one level,# matches all remaining levels.site/bldg-A/mech/+/+/chwst gets all chiller supply temps.

Publish / Subscribe

A client publishes a message to a topic. Any client subscribed to that topic (or a matching wildcard) receives it. One publisher can feed many subscribers. Many publishers can feed one subscriber. No coupling.

Quality of Service (QoS) Levels
LevelGuaranteeHandshakeDC Use Case
QoS 0At most once (fire & forget)None — send and move onHigh-frequency sensor telemetry (temp every 1s). Losing one reading is fine — next one is 1s away.
QoS 1At least oncePUBACK from brokerAlarm notifications, setpoint changes. Must arrive but duplicates are tolerable. Most common in IIoT.
QoS 2Exactly once4-step handshake (PUBREC/PUBREL/PUBCOMP)Billing/metering data, command acknowledgments. High overhead — use sparingly.
Key Mechanisms
Retained Messages

Broker stores the last message published to a topic with the “retain” flag. When a new subscriber connects, it immediately gets the retained message — no waiting for the next publish cycle. Critical for DCIM: a new dashboard instance instantly shows current values instead of blank until the next sensor poll.

Last Will & Testament (LWT)

On connect, a client registers a “will” message with the broker. If the client disconnects unexpectedly (network drop, crash), the broker publishes the will message on its behalf. Used to signal device offline status — e.g., site/bldg-A/edge-gw-1/status → OFFLINE. Sparkplug B formalizes this as “death certificates.”

Persistent Sessions

With cleanSession=false, the broker stores a client's subscriptions and queues messages while it's offline. When the client reconnects, it receives all missed messages. Essential for edge gateways that may lose WAN connectivity — combined with Ignition's Store & Forward for zero data loss.

Topic Design — Data Center Pattern

Good topic design enables flexible subscriptions. A well-designed hierarchy lets DCIM subscribe to everything, while a cooling engineer subscribes only to their building's mechanical data.

# Topic hierarchy pattern:
{site}/{building}/{system}/{subsystem}/{equipment}/{point}

# Examples:
site-sv1/bldg-A/mech/chiller/CH-1/chwst        → 42.1°F
site-sv1/bldg-A/mech/chiller/CH-1/status        → RUNNING
site-sv1/bldg-A/mech/ct/CT-2/fan-speed          → 72%
site-sv1/bldg-A/elec/swgr/MSB-1/kw              → 4250
site-sv1/bldg-A/elec/pdu/PDU-R3-A/phase-a-amps  → 82.4
site-sv1/bldg-A/env/row-12/rack-4/inlet-temp     → 74.8°F

# Subscription examples:
site-sv1/bldg-A/mech/#           → ALL mech data, bldg A
site-sv1/+/elec/swgr/+/kw        → ALL switchgear kW, all bldgs
site-sv1/bldg-A/mech/chiller/+/chwst  → ALL chiller supply temps
#                                → EVERYTHING (DCIM firehose)
Why MQTT Beats Polling for IIoT
AspectPolling (Modbus/OPC-UA)Pub/Sub (MQTT)
Data flowSCADA asks each device on scheduleDevices publish when data changes or on interval
BandwidthConstant — polls even when nothing changesProportional to actual change rate
LatencyWorst-case = poll intervalNear-instant on change
ScaleMore devices = slower cycleBroker handles 100k+ connections
WAN friendlyFragile over unreliable linksBuilt for constrained/intermittent networks
Adding consumersNew connection per consumerJust subscribe — no impact on publisher

Note: OPC-UA also supports subscriptions (server pushes on change) — but MQTT is purpose-built for the edge-to-cloud segment where OPC-UA's rich data model isn't needed.

Sparkplug B — The IIoT Standard on MQTT

Raw MQTT is payload-agnostic — it delivers bytes without caring what they mean. Sparkplug B (maintained by the Eclipse Foundation) adds an application-layerspecification that standardizes how industrial data is structured, encoded, and managed over MQTT. It turns MQTT from a transport protocol into a complete IIoT data infrastructure.

Standardized Topic Namespace
spBv1.0/{group_id}/{msg_type}/{edge_node_id}/{device_id}

# Message types:
  NBIRTH  — Edge node comes online (all metrics)
  NDEATH  — Edge node goes offline
  NDATA   — Edge node metric updates
  DBIRTH  — Device comes online
  DDEATH  — Device goes offline
  DDATA   — Device metric updates
  NCMD    — Command to edge node
  DCMD    — Command to device

# Example:
spBv1.0/NVIDIA-SV1/DDATA/EDGE-GW-BLDG-A/CH-1
→ Chiller 1 data from Building A edge gateway
Birth & Death Certificates

When an edge node connects, it publishes an NBIRTH message containing ALL of its metrics with metadata (name, datatype, engineering units, alias). This acts as a self-describing schema — any subscriber immediately knows what data this node provides.

The node also registers an LWT with the broker: an NDEATH message. If the node drops, the broker publishes NDEATH. Every consumer knows instantly that this node's data is stale.

Why it matters: In raw MQTT, if a device just stops publishing, consumers don't know if the device is offline or if the value just hasn't changed. Sparkplug B eliminates this ambiguity.

Protobuf Encoding

Sparkplug B uses Google Protocol Buffers (Protobuf) for payload encoding instead of JSON or XML. Result: 3–10× smaller payloads, faster serialization/deserialization. Each metric carries: name, alias (numeric shorthand), timestamp, datatype, and value. After BIRTH, DDATA messages send only changed metrics using aliases — extremely efficient.

Ignition MQTT Modules

Ignition implements Sparkplug B via Cirrus Link modules:

  • MQTT Engine — On the central gateway. Subscribes to the broker and auto-creates tags from BIRTH messages.
  • MQTT Transmission — On edge gateways. Publishes OPC-UA tag data as Sparkplug B messages to the broker.
  • MQTT Distributor — Optional built-in MQTT broker within Ignition (for simpler deployments without a standalone broker).
MQTT Security — Data Center Requirements
Transport Security
  • TLS 1.2+ — Encrypts all traffic between clients and broker. Port 8883 (MQTTS) instead of 1883.
  • Mutual TLS (mTLS) — Both client and broker present certificates. Standard for IIoT.
  • Certificate rotation — Edge devices need automated cert renewal (SCEP, EST, or custom CA).
Access Control
  • Username/password — Basic auth (always over TLS).
  • ACLs — Topic-level read/write permissions per client. E.g., an edge gateway can only publish to its own subtree.
  • Client ID validation — Broker enforces unique client IDs; duplicate connection = kick the old one.
  • DMZ placement — Broker sits in the IT/OT DMZ. OT side publishes in; IT side subscribes out. No inbound connections from IT to OT.
Broker High Availability

A single broker is a single point of failure. Production deployments use clustered brokers (HiveMQ cluster, EMQX cluster) or active/standby pairs with shared persistent storage. Clients configure multiple broker endpoints and reconnect automatically on failure. Sparkplug B's birth/death mechanism ensures state is rebuilt after any failover.

BACnet IP — Building Automation Crossover

runs over UDP port 47808 (0xBAC0). Data organized as objects: Analog Input/Output/Value, Binary I/O/V, Multi-State, Trend Log, Schedule. Each object has properties (Present-Value, Status-Flags, Description). Supports COV (Change of Value) subscriptions. Many data centers still use BACnet for HVAC and BMS. Gateways translate BACnet ↔ Modbus or BACnet ↔ OPC-UA at system boundaries.

FeatureOPC-UAModbus TCPMQTTBACnet IP
ModelClient/ServerMaster/SlavePub/SubClient/Server
SecurityBuilt-in TLSNone nativeTLS optionalLimited
Data ModelRich (typed nodes)Flat registersPayload agnosticObject-based
Best ForPLC↔SCADAMeters, simple I/OIIoT, edge, cloudBuilding HVAC
ScalabilityExcellentLimited (247 dev)ExcellentGood
The protocol zoo in practice
Power meters → Modbus TCP
UPS systems + Modbus
Cooling plant → BACnet IP
Generators → Modbus RTU / CAN
Fire alarm → Proprietary serial
Leak detection → Dry contacts / Modbus
PDUs + REST API
Edge/cloud → MQTT Sparkplug B

Electrical Power Monitoring

Every branch circuit, every , every bus section has a power meter reporting voltage, current, kW, kWh, power factor, and THD in real time. This data feeds PUE calculations, capacity forecasting, and — critically — fault detection. A 2% phase imbalance caught by EPMS today prevents a busbar failure next month.

Common platforms: Schneider ION meters, SEL, Dranetz. Communication via Modbus TCP/RTU. Energy dashboards aggregate for billing, allocation, and sub-metering by tenant or department.

Cooling Plant Control

The controls chiller staging, cooling tower fan speed, pump VFDs, CRAH/CRAC units, and economizer dampers. In an AI datacenter running direct-to-chip liquid cooling, the BMS must maintain CDU supply temperature within ±1°C while dynamically responding to GPU workload changes that can swing rack power by 40% in seconds.

PID loops everywhere: supply air temp, chilled water ΔP, condenser water temperature. Tuning parameters (Kp, Ki, Kd) directly impact stability and energy efficiency.

Non-negotiable systems

VESDA (laser-based air sampling) for earliest smoke detection. Clean-agent suppression (FM-200, Novec 1230) for IT rooms. Pre-action sprinklers (double interlock) for high-value spaces.EPO (Emergency Power Off) kills power to entire zones. These systems have hard interlocks with BMS and EPMS — a fire alarm can shut down HVAC and trip power to a fire zone. Governed by NFPA 75/76.

Leak detection cables run under raised floors and along every pipe route — critical for liquid-cooled environments where a CDU leak can damage millions in hardware.

The bridge between design and code

A is the definitive document describing how a system operates under all conditions. It's the spec the PLC programmer codes from, the commissioning agent tests against, and the operations team references for troubleshooting.

System Desc.Equipment list, design conditions, capacity
Operating ModesAuto, Manual, Off, Emergency, Standby, Test
Startup Seq.Step-by-step with prerequisites and time delays
Shutdown Seq.Normal and emergency shutdown procedures
ModulatingPID loops, setpoints, control ranges, reset schedules
Staging LogicLead/lag, load-based staging, rotation schedules
Alarm MatrixEvery alarm condition, trip point, action, reset requirement
Failure ModesSensor failure, comms loss, equipment trip fallbacks

SOO excerpt: lead/lag logic

Lead Chiller Start:
  IF ChW_Return > Setpoint + 2°F
  AND Cooling_Demand > 20%
  AND No_Active_Alarms
  THEN Start Lead_Chiller
  WAIT 300s (anti-recycle timer)

Lag Chiller Stage-On:
  IF Load% > 75% on running chillers
     (sustained 10 min)
  OR ChW_Supply > Setpoint + 3°F
     (sustained 5 min)
  THEN Start next chiller in rotation

DP Setpoint Reset:
  Target: most-open valve at 90%
  DP Setpoint: 12 PSID (range 5-25)
  PID: Kp=2.0, Ki=0.5, Kd=0.0
  Output: Pump VFD 30% min - 100% max

Every setpoint, every timer, every PID gain in the SOO becomes a configurable parameter in the PLC program. The commissioning agent verifies each one during L3/L4 testing.

Complete SOO with explanations

Below is a complete, annotated Sequence of Operations for a chilled water plant typical of a hyperscale AI data center. Each section includes an explanation of why it exists and what reviewers/programmers should focus on.

§1SYSTEM DESCRIPTION
✦ Why this matters: Sets scope. Identifies every piece of equipment under this SOO's control so there is zero ambiguity about what's included. Also defines design conditions — the “rated” environment the control logic must satisfy.
SYSTEM: Central Chilled Water Plant — Data Hall A
CAPACITY: 3 × 1,000-ton centrifugal water-cooled chillers (N+1)
DESIGN CONDITIONS:
  ChW Supply Temp (ChWST):    42 °F
  ChW Return Temp (ChWRT):    55 °F
  Design ΔT:                  13 °F
  CW Supply Temp (CWST):      85 °F (summer design)
  CW Return Temp (CWRT):      95 °F

EQUIPMENT:
  CH-1, CH-2, CH-3       Chillers (York YK, VFD compressors)
  CHWP-1, -2, -3         Primary CHW Pumps (VFD, 1,500 GPM ea)
  CWP-1, -2, -3          Condenser Water Pumps (constant, 2,200 GPM)
  CT-1, CT-2, CT-3       Cooling Towers (variable-speed fans)
§2OPERATING MODES
✦ Why this matters: Prevents ambiguity — “Auto” means different things to different people unless spelled out. HOA (Hand/Off/Auto) at the local switch must always be described because it overrides BMS commands.
MODE        DESCRIPTION
────        ──────────────────────────────────────────────
AUTO        BMS/SCADA controls all staging and sequencing.
            All equipment HOA switches must be in AUTO.
MANUAL      Operator commands via HMI. Interlocks active.
            Staging logic disabled.
OFF         All equipment commanded off. Safeties monitored.
EMERGENCY   Fire alarm, EPO, or critical leak detection.
            Immediate shutdown. See §8.
STANDBY     Ready to start, waiting for cooling demand
            signal from DCIM (IT load > 50 kW threshold).
§3STARTUP SEQUENCE
✦ Why this matters: Step-by-step with prerequisites that must be TRUE before each step proceeds. Prevents starting a chiller without flow (which would damage the evaporator). Time delays protect equipment from rapid cycling. Cx tip: During L3 FPT, verify every prerequisite actually blocks startup when forced FALSE.
PREREQUISITES (all must be TRUE):
  □ System mode = AUTO or STANDBY
  □ No active critical alarms (Level 1 or 2)
  □ ChW loop pressure > 15 PSIG (system filled)
  □ All HOA switches = AUTO at MCC
  □ CW isolation valves OPEN (limit switch FB)
  □ No active fire suppression signal

SEQUENCE:
  STEP 1: Start lead CW Pump
          WAIT: Flow switch TRUE within 30s
          FAIL: Alarm "CW Pump Fail", halt

  STEP 2: Start lead CT fan(s) at 30% minimum
          WAIT: 15s for fan to reach speed

  STEP 3: Open chiller CW isolation valves
          WAIT: End-switch OPEN within 45s

  STEP 4: Start lead CHW Pump at 40% speed
          WAIT: Flow > 800 GPM within 30s

  STEP 5: Command lead Chiller to START
          WAIT: Chiller RUNNING within 120s
          NOTE: Chiller has internal start seq
                (oil pump, guide vanes). Don't bypass.

  STEP 6: Release PID control:
          - CHWP → DP setpoint (see §5)
          - CT fan → CW return temp (see §6)
          - Chiller → ChWST setpoint (42°F)

  ANTI-RECYCLE: 300s min between starts
                per chiller (compressor protection)
§4SHUTDOWN SEQUENCE
✦ Why this matters: Shutdown is the reverse of startup but with critical differences: the chiller must stop before pumps to allow the refrigerant cycle to wind down. Stopping pumps first would freeze the evaporator. The post-circulation timer ensures residual heat is removed from the chiller barrel.
NORMAL SHUTDOWN:
  1. Command chiller STOP (unload guide vanes)
     WAIT: Chiller STOPPED within 180s
  2. Run CHWP at 40% for 120s (post-circ)
  3. Stop CHW Pump
  4. Close CW isolation valves
  5. Stop CW Pump
  6. Ramp down CT fans over 30s

EMERGENCY SHUTDOWN (see §8 triggers):
  ALL equipment → IMMEDIATE STOP
  Exception: CHWP runs 30s for chiller
  protection if safe to do so
§5DIFFERENTIAL PRESSURE CONTROL
✦ Why this matters: DP control drives pump speed. The DP sensor is placed at the most hydraulically remote coil — if DP is adequate there, it's adequate everywhere. DP reset saves additional energy by lowering the setpoint when valves aren't demanding full flow.
DP SETPOINT:  12 PSID (range: 5–25 PSID)
DP SENSOR:    Most remote CRAH / coil header
OUTPUT:       VFD speed → CHWP-1/2/3

PID TUNING:
  Kp = 2.0  Ki = 0.5  Kd = 0.0
  Output range: 30% min – 100% max

DP RESET (energy optimization):
  Target: most-open CRAH valve = 85–95%
  IF all valves < 70%: DP setpoint −0.5 PSI
  IF any valve > 95%:  DP setpoint +0.5 PSI
  Rate limit: 1 PSI per 5 minutes
  Floor: never below 6 PSID
§6CW TEMPERATURE CONTROL
Lower CW temps improve chiller efficiency (lower lift), but tower fans have diminishing returns. “Approach” (CW temp − wet-bulb) is a key metric: 7°F = good, >12°F = fouled fill.
CW RETURN SETPOINT: 78°F (range 65–85°F)
OUTPUT: CT fan VFD speed (all parallel)
PID: Kp=3.0  Ki=0.8  Kd=0.0
     Output: 20%–100%

OPTIMAL RESET:
  CW SP = Wet_Bulb + 7°F (min 65°F)

FREE COOLING / WATERSIDE ECON:
  IF CW_Supply < ChW_Return − 3°F
     (sustained 10 min):
  → Enable plate HX bypass
  → Modulate valve for ChWST SP
  → Stage down chillers if HX
    handles full load
§7CHILLER STAGING (LEAD/LAG)
Staging must balance responsiveness (don't let temps rise) against efficiency (don't short-cycle). “Sustained” timers prevent hunting from temporary load spikes.
STAGE-ON (add chiller):
  Load > 80% sustained 10 min
  OR ChWST > SP+3°F sustained 5 min
  → Start next in rotation
  POST-STAGE LOCKOUT: 15 min

STAGE-OFF (remove chiller):
  Load < 30% per unit sustained 15 min
  AND >1 chiller running
  → Stop lag chiller (lowest hours)
  POST-STAGE LOCKOUT: 20 min

ROTATION:
  Lead rotates weekly (Sun 02:00)
  OR runtime delta > 500 hrs
  Sequence: CH-1→CH-2→CH-3→CH-1
§8ALARM & INTERLOCK MATRIX
✦ Why this matters: The most safety-critical part of an SOO. “Advisory” = notification only. “Warning” = operator attention. “Critical” = automatic protective action. Deadbands prevent alarm chatter. Cx tip: Every alarm must be tested during L3 FPT by simulating the condition.
AlarmConditionSPDelaySeverityAction
CHW_HI_TEMPChWST > SP+5°F60sWarningNotify, stage on lag
CHW_CRIT_TEMPChWST > critical+8°F30sCriticalAll chillers ON, page on-call
DP_LOWDP < minimum4 PSID30sWarningPump speed → 80%
CHILLER_FAULTController fault0sCriticalStart standby, page
PUMP_FAILFlow FALSE while ON30sCriticalStart standby pump
LEAK_DETECTLeak sensor active5sCriticalIsolate zone, E-shutdown
VFD_FAULTVFD controller fault0sCriticalStart standby pump
CHW_LO_PRESSLoop pressure low10 PSI60sWarningNotify — possible leak
CW_HI_TEMPCWRT > limit98°F120sWarningCT fans → 100%
CT_FAN_FAULTFan VFD / vibration0sWarningRedistribute to other CTs
§9FAILURE / FALLBACK MODES
Every sensor failure must have a defined fallback — otherwise the PID loop goes haywire. BMS comms loss: system must “fail safe” (keep running) because losing monitoring is less dangerous than losing cooling.
SENSOR FAILURE:
  ChWST fail → Use ChWRT − design ΔT
               Lock pump speed, alarm
  DP fail    → Lock pump at 70%, alarm
               Switch to backup sensor
  CW fail    → CT fans to 80% fixed

BMS/SCADA COMMS LOSS:
  PLC continues last known sequence
  Setpoints hold at last value
  No staging changes
  Alarm on comms restored
  Operator must re-enable AUTO

POWER FAILURE:
  Chillers trip (need clean power)
  Pumps on UPS: auto-restart
  Wait 30s for stable generator
  Then execute normal startup §3
§10SETPOINTS & PARAMETERS
The commissioning “cheat sheet.” During L3, verify each parameter matches design. Always include adjustable ranges — without them, operators may set dangerous values.
ParameterDefaultRange
ChWST Setpoint42°F38–50°F
DP Setpoint12 PSID5–25
CW Return SP78°F65–85°F
Stage-On Load80%60–95%
Stage-Off Load30%15–45%
Anti-Recycle300s180–600s
Post-Stage Lock15 min10–30 min
CHWP Min Speed30%20–40%
CT Fan Min20%15–30%
Lead Rotation7 days1–30 days
§11INTERFACE POINTS
✦ Why this matters: No system operates in isolation. Missing an interface point is one of the most common commissioning defects. Cx tip: Verify each interface during L3 by having both sides confirm read/write.
TO DCIM / EPMS:
  → Plant kW total
  → Cooling capacity (tons delivered)
  → PUE contribution data
  → ChWST, ChWRT, ΔT, flow rate

FROM DCIM / EPMS:
  ← IT load (kW) — standby→auto trigger
  ← Cooling demand request
TO/FROM FIRE ALARM:
  → Plant running status
  ← Suppression signal → E-shutdown

TO/FROM LEAK DETECTION:
  ← Zone leak → isolate branch
  ← System leak → E-shutdown

TO/FROM ELECTRICAL:
  ← Generator / utility status
  → Plant load (gen load mgmt)

The platform that doesn't exist yet (but should)

Today, the handoff between design, controls engineering, and commissioning is largely manual. A designer creates P&IDs, a controls engineer manually re-interprets them into PLC/SCADA configuration, and a Cx agent manually writes test scripts from the SOO. Each handoff introduces errors and delay. The industry is converging toward a model-driven approach where a single source of truth drives everything downstream.

DESIGN MODEL
P&IDs · Equipment schedules · IO lists · Control narratives
BIM / Revit / EPLAN · AutomationML (IEC 62714)
auto-parse equipment + connections
TAG DATABASE
Auto-generated from equipment model
UDTs per equipment class · hierarchy · alarm config
map to templates
CONTROL LOGIC
PLC programs from SOO templates
Structured text / FBD auto-generated
HMI / GRAPHICS
ISA-101 compliant screens
Faceplates, trends, alarms from UDTs
Cx SCRIPTS
Test procedures from SOO + tag DB
Auto-generated acceptance criteria
deploy + verify
LIVE SYSTEM + DIGITAL TWIN
Runtime verified against design intent
Omniverse 3D twin · live data overlay · auto-Cx verification
WHAT WORKS TODAY
Template-driven HMI

Define one faceplate per equipment type (valve, pump, VFD), bind to UDTs. Every instance auto-generates its own screen. Ignition Perspective, Siemens WinCC, AVEVA System Platform all support this.

EPLAN → TIA Portal

Electrical design in EPLAN exports hardware config to Siemens TIA Portal via AutomationML. IO mapping, rack layout, and network config auto-transfer.

ISA-101 high-performance HMI

Strict design rules (gray backgrounds, no 3D pipes, color = abnormal state) are codifiable. An AI system could generate compliant screens more consistently than most human designers.

WHAT'S STILL MANUAL
SOO → Control logic

A controls engineer reads the SOO and manually programs PLC logic. Same intent, re-interpreted. No standard for machine-readable SOOs.

SOO → Cx scripts

Cx agents manually create test procedures from the same SOO the programmer used. Three humans reading one document = three interpretations = defects.

P&ID → Tag database

Equipment on a P&ID becomes tags manually. BIM semantic standards (Haystack, Brick Schema) are trying to bridge this, but adoption is slow.

NVIDIA POSITIONING

NVIDIA is uniquely positioned to close this gap. Omniverse provides the 3D digital twin — the “HMI” is the model itself with live data overlay. Agentic AI can parse design documents and generate control logic, tag databases, and Cx scripts from a single source of truth. The platform that unifies design → engineering → commissioning into one model-driven pipeline will fundamentally change how fast AI factories deploy. Instead of flat 2D SCADA screens, you get a photorealistic twin where you click a chiller and its faceplate appears with live data — no one needs to manually draw pump graphics when the 3D model already exists.

PLATFORM LANDSCAPE — WHO'S CLOSEST?
PlatformDesignEngineeringAuto HMIAuto Cx
Siemens TIA + EPLAN✓ EPLAN → AML export✓ PLC + HMI in one IDE✓ Faceplate templates✗ Manual
Rockwell FT Design Hub◐ Cloud-based config✓ Studio 5000◐ Global objects✗ Manual
Ignition (Inductive)✗ No design tool✓ Excellent SCADA✓ UDT templates✗ Manual
Schneider EcoStruxure◐ Separate products✓ Multiple platforms◐ Cx wizards◐ Partial
NVIDIA Omniverse✓ 3D from BIM/CAD◐ Data connectors✓ Twin IS the HMI◐ Emerging (AI)

No single platform does all four today. The convergence of digital twins + semantic data models + agentic AI will close the gap within 2–5 years.

Trust, but verify — systematically

is the process that proves every system works as designed — not just individually, but together, under realistic load conditions and failure scenarios.

L1 FactoryWitness testing at the manufacturer — verify capacity, features, nameplate data. Critical for custom switchgear, UPS, and generators with 20–40+ week lead times.
L2 InstallVerify installation meets design: mounting, alignment, wiring terminations, labeling, meggering cables, hydrostatic testing on piping. Trace I/O wiring from sensor → panel → PLC module.
L3 FunctionalIndividual system testing against the SOO: verify startup/shutdown sequences, test every alarm, verify PID tuning, run at 25/50/75/100% loads, confirm historian logging.
L4 IntegratedMulti-system failure testing with live or simulated IT load: utility loss, generator failure, UPS bypass, chiller trip, CRAH failure, CDU leak, BMS comms loss, cascade failures.
L5 SeasonalOngoing: economizer transitions, seasonal setpoints, re-test after modifications, continuous monitoring and trend analysis for degradation.

An un-commissioned datacenter is a datacenter that will fail under load. The question is when, not if.

Where data center CX gets unique

L4/IST is performed with live IT load (or load banks). These are the failure scenarios that must be proven before a facility goes live:

Utility loss (one feed, both feeds)
Generator failure to start
Generator failure during operation
UPS bypass / UPS failure
Chiller trip under load
CRAH/CDU fan or pump failure
Cooling tower pump trip
Fire suppression activation (EPO)
BMS/SCADA communication loss
Cascade: chiller trip → temp rise → load shed

Key documentation: pre-written test scripts with expected results, pass/fail criteria, and space for actual results. Every deviation is a deficiency tracked to resolution.

The unified operational view

provides a single pane of glass across all physical infrastructure. It sits at the top of the control pyramid, consuming data from BMS, EPMS, and IT systems:

Asset Management

Every device tracked: location (building/floor/row/rack/U-position), model, serial, power connections, network connections. Change management for installs, moves, decommissions.

Capacity Planning

Power capacity per rack/row/room, cooling capacity, physical space, network ports. Power chain visualization: trace the path from utility feed to individual rack.

Integration Data Flows

EPMS → BMS: power readings for PUE calculation, load-based staging. BMS → DCIM: environmental data for capacity planning. EPMS → DCIM: actual power per rack for utilization tracking. All connected via API, SQL, or OPC-UA gateways.

Platforms: Nlyte, Sunbird dcTrack, Schneider EcoStruxure IT. Hyperscalers (including NVIDIA) often build custom DCIM in-house.

ARCHITECTURE // SENSOR TO DCIM DATA FLOW

End-to-end data architecture from physical sensors through the OT control stack (PLCs → OPC-UA → Ignition SCADA) to IT systems via MQTT, feeding DCIM, analytics, and cloud platforms.

Level 4/5 — Enterprise Network
Data Lake / Time-Series
InfluxDB
Digital Twin
Omniverse
AI / ML Analytics
Anomaly Det.
Business Dashboards
Grafana
REST API · WebSocket · SQL
DCIM PLATFORM
Asset MgmtCapacity PlanningPUEAlarmsWork Orders
LEVEL 3.5 — IT / OT DMZ
MQTT / Sparkplug B (TLS)
MQTT BROKER
HiveMQ / Mosquitto
Sparkplug B birth/death · QoS 1 · Retained
Level 3 — Operations
CENTRAL IGNITION GATEWAY (On-Prem)
Tag Historian
SQL DB
Alarm Notification
Pipeline
MQTT Transmission
Sparkplug B
Sepasoft MES
SPC / OEE
MES · Scheduling · OEE · Quality Management
OPC-UA Server (port 62541) · Gateway Network ↓
Level 2 — Process (SCADA / HMI)
Historical DB
Local SQL
Local Ignition Server
SCADA / HMI
Local Client
Perspective
Cirrus Link Modules
Distributor / Engine
Edge gateways at remote sites ↓
Ignition Edge IIoT
Bldg A Cooling
Ignition Edge Panel
Bldg B
Ignition Edge
Power
SCADA · HMI · Alarming · Reporting · Trending
OPC-UA · Modbus TCP
Level 1 — Control (Intelligent Devices)
PLC
Beckhoff / Siemens
Onboard logic
RTU
Remote sites
Onboard logic
Power Meters / IEDs
Schneider ION/PM
Processor + comms
VFDs / Motor Starters
ABB / Danfoss
Processor + comms
Hardwired · 4-20mA · RS-485 · Dry contacts
Level 0 — Field Devices (Unintelligent)
Temp Sensors
RTD · Thermistor · 4-20mA
Pressure / DP
Transducers · 4-20mA
Actuators / Valves
Control valves · Dampers
Contacts / Switches
Leak det · Smoke · Door
OT Side (Levels 0–2)

Sensors & actuators (L0, unintelligent) → PLCs/RTUs/IEDs/meters (L1, intelligent) via hardwired I/O. L1 controllers → local Ignition SCADA/HMI & Edge (L2) via OPC-UA. Central Ignition Gateway (L3) aggregates via Gateway Network. Deterministic, real-time control.

IT/OT Bridge (DMZ)

Ignition's MQTT Transmission module publishes to a broker in the DMZ using Sparkplug B. Data flows OT→IT only. TLS encrypted, certificate auth, no inbound connections from IT to OT. The broker is the single point of data egress.

IT Side (Level 4)

DCIM platform subscribes to MQTT broker or pulls from Ignition's historian via SQL/API. Data lake stores raw telemetry for ML. Omniverse digital twin consumes real-time feeds. Grafana/custom dashboards for NOC displays.

Network segmentation for industrial systems

OT (PLCs, SCADA, sensors) prioritizes availability and safety. IT (business systems, cloud) prioritizes confidentiality. The Purdue Model (ISA-95) defines the separation:

Level 4/5Enterprise Network (ERP, cloud, DCIM, analytics)
LEVEL 3.5 — IT / OT DMZ
Level 3Operations (Central Ignition, Historian, MES)
Level 2Process (Local SCADA/HMI, Ignition Edge)
Level 1Control (PLCs, RTUs, IEDs, VFDs — intelligent devices)
Level 0Field Devices (sensors, actuators — unintelligent)

OT networks must be segmented from IT with a DMZ. Data flows OT → IT via historians, OPC-UA gateways, or MQTT brokers in the DMZ. Never expose PLCs directly to the IT network. Security standard: IEC 62443.

Virtual replica, real-time data

A digital twin is a virtual replica of a physical system continuously updated with real-time sensor data. NVIDIA Omniverse for data centers provides:

3D visualization of entire facility — racks, cooling, power
Real-time sensor data overlay (temperatures, power, airflow)
CFD simulation for airflow optimization
What-if scenarios: "Add 10 racks to Row C — what happens to cooling?"
Predictive maintenance via anomaly detection

Data pipeline: Sensor → PLC → OPC-UA → Historian → API → Omniverse connector → real-time 3D updates. Uses USD (Universal Scene Description) format.

Why AI workloads break traditional monitoring

Power volatility

Traditional servers draw steady power. GPU clusters swing from idle (~30% TDP) to full load (~100% TDP) in milliseconds when a training batch launches. This creates transient power spikes that stress UPS systems, trip breakers if margins are thin, and confuse capacity planning models built for static loads.

Thermal density

At 80–120 kW per rack, the thermal time constant collapses. A cooling failure that gives you 15 minutes of runway in a 10 kW/rack enterprise hall gives you <3 minutes in an AI hall. Monitoring latency that was "fine" at 60-second poll intervals becomes dangerous — you need sub-second environmental telemetry.

Workload correlation

In an AI cluster, all GPUs in a training job start and stop together. This means power and thermal loads are highly correlated across hundreds of racks — the opposite of the statistical diversity that traditional datacenter designs depend on. Your cooling plant must handle 0-to-100% step changes.

Sample electrical device types — make, model, protocol, key data

A controls engineer must know what's on the other end of the wire. Below are representative devices found in a modern AI data center electrical distribution system — the equipment your SCADA/EPMS will monitor and control.

Device TypeExample Make / ModelProtocolKey Data PointsLocation
Revenue MeterSchneider ION 9000Modbus TCP / DNP3V (L-L, L-N), A, kW, kVAR, kVA, PF, THD per harmonic, demand (15-min), kWh, frequencyMV switchgear
Branch Circuit MonitorSchneider PM8000 / ION 7650Modbus TCPPer-circuit V, A, kW, kWh, PF, breaker status, alarm thresholdsFloor PDU / RPP
Protective RelaySEL-751 / GE Multilin 489Modbus RTU / DNP3 / IEC 61850Fault current, trip status, fault type (OC/GF/diff), event log, waveform captureMV / LV switchgear
UPSVertiv Liebert EXL S1 / Eaton 93PMSNMP v3 / Modbus TCPInput/output V, A, kW, load %, battery SOC, runtime remaining, bypass status, temp, alarm stateUPS room
Automatic Transfer SwitchASCO 7000 Series / RusselectricModbus TCP / BACnetSource 1/2 status, active source, transfer count, transfer time (ms), V per sourceMV switchgear room
Generator ControllerDEIF AGC-4 / DSE 8610Modbus RTU/TCPkW output, RPM, coolant temp, oil pressure, fuel level, battery V, run hours, start countGenerator yard
Intelligent Rack PDUServerTech PRO3X / Raritan PX3SNMP v3 / Modbus / REST APIPer-outlet V, A, kW, kWh, PF, inlet temp/humidity, outlet switching, alarm thresholdsIn-rack
Static Transfer SwitchSchneider Galaxy VS STS / Eaton STSModbus TCP / SNMPActive source, preferred source, transfer count, transfer time (<4ms), source qualityCritical distribution
Power Quality AnalyzerDranetz HDPQ / Fluke 1760Modbus TCP / proprietaryTHD (V&I), voltage sags/swells, transients, flicker, unbalance, EN 50160 complianceMV bus / critical loads

Sample mechanical device types — make, model, protocol, key data

The mechanical (cooling) side has its own ecosystem of intelligent devices and dumb field instruments. The BMS/SCADA must integrate all of them into a unified monitoring and control platform.

Device TypeExample Make / ModelProtocol / SignalKey Data PointsLocation
Centrifugal ChillerYork YK / Trane CVHF / Carrier 19XRBACnet IP / Modbus TCPChWST, ChWRT, ΔT, loading %, kW, COP, compressor RPM, refrigerant press/temp, oil temp, alarm codesChiller plant
Cooling TowerEvapco AT / BAC Series 3000BACnet / ModbusFan speed %, CW supply/return temp, basin temp, vibration, fan status, cell enable/disableRoof / yard
VFD (Variable Freq Drive)ABB ACS880 / Danfoss VLT / Yaskawa GA800Modbus TCP / PROFINET / EtherNet/IPSpeed (Hz/RPM), motor A, kW, torque %, run status, fault code, PID setpoint/feedback, drive tempPump/fan motors
CDU (Coolant Dist Unit)CoolIT DCLC / Vertiv XDU / Motivair ChilledDoorModbus TCP / BACnet IPIT supply/return temp, facility supply/return temp, flow (GPM), ΔP, leak detect, pump status, conductivityRow-end / in-row
CRAH / AHUSchneider Uniflair / Vertiv Liebert CWBACnet IP / ModbusSupply/return air temp, fan speed, valve position, filter ΔP, coil temp, humidity, unit statusData hall perimeter
Control Valve (2-way)Belimo / Siemens ACVATIX4–20 mA / 0–10 V / BACnetPosition feedback (%), command %, actuator status (open/closed/fault), torqueChiller/CRAH coils
Temp Sensor (RTD/Thermistor)Siemens QAM2120 / TE Connectivity4–20 mA / RTD (Pt100/Pt1000)Temperature °F/°C (pipe, duct, ambient, immersion)Pipes, ducts, racks, ambient
Differential Pressure SensorSetra 231 / Dwyer MS-1114–20 mA / 0–10 VΔP across filter, coil, pump, or chiller (PSID / Pa)Filter, coils, headers
Flow MeterBadger Meter ModMAG / Siemens SITRANS FM4–20 mA / Modbus / HARTFlow rate (GPM), totalizer (gallons), flow velocity, directionCHW/CW mains, CDU loops
Leak DetectionRLE Technologies / TraceTek TT-FFSDry contact / Modbus RTULeak presence (Y/N), leak location (distance along cable), zone IDUnder floor, pipe routes, CDU
Smoke / Fire DetectionXtralis VESDA-E VEA / Honeywell FSL100Proprietary / relay / ModbusSmoke level (obscuration), alert/action/fire thresholds, sampling pipe zone, flow statusAbove/below rack, ceiling

What DCIM needs at the telemetry level — and why

A platform aggregates thousands of data points into actionable intelligence. But not all data is equal. The points below are the most critical telemetry a DCIM consumes — the data that drives capacity decisions, triggers alarms, calculates efficiency metrics, and keeps the facility running.

POWER
ELECTRICAL TELEMETRY
FACILITY-LEVEL (MV METERING)
  • Total facility kW — real-time total site power; denominator for PUE
  • kWh (cumulative) — energy billing, carbon footprint, trend analysis
  • Power factor — utility penalty if PF < 0.95; indicates harmonic issues
  • Peak demand (15-min avg) — drives utility demand charges; capacity trigger
  • THD (voltage & current) — harmonic distortion from switch-mode PSUs; transformer derating
  • Frequency (Hz) — grid stability indicator; triggers generator start
IT LOAD (RACK / PDU LEVEL)
  • Total IT kW — numerator for PUE; sum of all rack PDU readings
  • Per-rack kW — capacity utilization per position; stranding detection
  • Per-outlet amps — breaker trip risk; load balancing across phases
  • UPS load % — headroom for transients; triggers capacity alarm at 80%
  • UPS battery SOC & runtime — ride-through availability; replacement scheduling
  • Generator run hours & fuel % — maintenance scheduling; fuel delivery trigger
PUE CALCULATION (REAL-TIME)
PUE = Total Facility Power (MV meter) ÷ IT Load Power (sum of rack PDU kW)
Trended at 15-min intervals. Dashboard target: 1.10–1.20 for liquid-cooled AI facilities.
THERMAL
COOLING & ENVIRONMENTAL TELEMETRY
ENVIRONMENTAL (DATA HALL)
  • Rack inlet temp (per-rack) — ASHRAE A1 limit: 64–80°F; alarms at 82°F; GPU throttling risk above 85°F
  • Rack exhaust temp — ΔT across rack = proxy for load; abnormal ΔT = airflow issue
  • Supply air temp (CRAH output) — control variable for CRAH PID loop
  • Return air temp (CRAH intake) — determines cooling demand and valve modulation
  • Humidity (%RH) — ASHRAE recommends 8–60% RH; too low = ESD risk; too high = condensation
  • Dew point — condensation risk for cold pipes; critical for liquid cooling environments
COOLING PLANT
  • ChW supply/return temp — primary control target; ±0.5°F stability critical for AI loads
  • ChW ΔT — design ΔT (13°F typ.); low ΔT syndrome wastes energy and reduces chiller capacity
  • ChW ΔP (header) — secondary pump VFD control variable; reset based on valve positions
  • Chiller loading % & COP — staging trigger; efficiency optimization
  • CW supply/return temp — cooling tower fan speed control variable
  • Wet-bulb temp (outdoor) — economizer enable/disable; cooling tower approach calculation
CDU / LIQUID COOLING
  • IT loop supply/return temp — GPU thermal margin; alarm at >50°C supply
  • IT loop flow rate (GPM) — low flow = insufficient heat removal; pump fault indicator
  • IT loop ΔP — filter clogging indicator; pump curve verification
  • Coolant conductivity (μS/cm) — DI water quality; >5 μS/cm = resin exhausted
  • Leak detection status — zone/distance; highest-priority alarm in liquid-cooled DCs
MECHANICAL EQUIPMENT STATUS
  • Pump/fan VFD speed (%) — efficiency tracking; affinity law verification
  • Motor current (A) — overload detection; bearing failure early warning
  • Valve position (%) — hunting detection; DP reset input (most-open valve algorithm)
  • Filter ΔP — replacement scheduling; airflow restriction alarm
  • Vibration (mm/s) — rotating equipment health; bearing/impeller failure prediction
SAFETY
LIFE SAFETY & SECURITY TELEMETRY
FIRE / SMOKE
  • ● VESDA smoke level (alert/action/fire)
  • ● Fire panel zone status
  • ● Suppression system armed/discharged
  • ● Pre-action valve status
  • ● EPO button status (armed/tripped)
WATER / LEAK
  • ● Leak detection cable alarm + location
  • ● Drip pan water-sense contacts
  • ● CDU reservoir level
  • ● Makeup water flow (unexpected = leak)
  • ● Under-floor flood sensors
PHYSICAL SECURITY
  • ● Door contacts (open/closed/forced)
  • ● Access control events (badge in/out)
  • ● Camera feed status (online/offline)
  • ● Motion detection zones
  • ● Mantrap interlock status
TELEMETRY ARCHITECTURE — DATA RATES & STORAGE
Data CategoryTypical Poll RateHistorian DeadbandRetentionWhy It Matters
Power (kW, A, V)1–5 sec1–2%3+ yearsPUE, billing, capacity trending, fault detection
Temperature5–15 sec0.5°F / 0.3°C1–3 yearsThermal compliance, SLA, cooling optimization
Flow / Pressure5–30 sec2–5%1–2 yearsPump performance, filter health, balancing
Equipment status (on/off)On changeN/A (digital)5+ yearsRuntime tracking, PM scheduling, failure analysis
Alarms & eventsOn changeN/A7+ yearsRoot cause analysis, compliance audit trail
Security / accessOn eventN/A1–7 yearsCompliance (SOC 2, ISO 27001), incident response

A 1,000-rack AI facility with ~30 points per rack + plant-level instrumentation generates 50,000–100,000 monitored points. At 5-second scan rates, that's 10,000–20,000 writes/second to the historian. Proper deadband configuration reduces actual storage volume by 80–90% without losing operationally significant changes.

From Transistor to Tensor Core

Zoom in far enough and a GPU is just billions of switches — transistors patterned at TSMC's 4–3 nm class nodes.[src] Zoom out and they form a hierarchy purpose-built for one operation: matrix multiplication.

A GPU is a memory machine

  1. Registers — per-thread, ~tens of KB, fastest path.
  2. — per-SM scratchpad, ~hundreds of KB, where tiles its work.[src]
  3. — on-package stacked DRAM. ~3–8 TB/s, 80–192 GB per GPU on Hopper / Blackwell class parts.[src][src]
  4. CPU DRAM — DDR5, ~hundreds of GB/s, used as overflow.
  5. NVMe + network — datasets, checkpoints, inter-node traffic.

For most inference workloads, throughput is bounded not by but by how fast weights can be streamed from into the SRAM-resident kernel.

CPU · GPU · TPU · NPU

ClassStyleStrengthLimit
CPULatency, scalarBranchy codeFew ops/cycle
GPUSIMT, throughputDense matmulMemory bandwidth
TPUSystolic arrayBig matmul, low overheadLess flexible
NPUOn-device, INT8/4Power-efficient inferenceLimited memory

All four converge on the same insight: is ~90% of the work, so dedicate the silicon to it.

Fewer bits, more throughput, more risk

16
16×
Training-grade
2.0× smaller

Same exponent range as FP32; the modern training default.

FP8 E4M3/E5M2 are now standard for forward-pass training on Hopper / Blackwell.[src][src]

Matrix Multiplication, All The Way Down

A neural network is, mechanically, a long composition of linear maps separated by simple non-linearities. During training, calculus tells us how to nudge each weight to reduce a scalar . That's it. The intelligence emerges from composition and scale.

y = σ(W · x + b)

For each layer, multiply the input vector x by a learned weight matrix W, add a bias b, then apply a non-linearity σ — usually in modern transformers.[src]

xW=y≈ 90% of all FLOPsin a transformer

Backpropagation in one breath

  1. Compute loss L between prediction and target.
  2. Apply the chain rule from output back to input, accumulating ∂L/∂W for every weight.
  3. Update each weight: W ← W − η · ∂L/∂W.
  4. Repeat for trillions of tokens.

is just the chain rule, executed efficiently as a reverse-mode automatic differentiation pass over the computational graph.

Loss Landscape Visualization

The vector field is a synthetic 2D loss; modern training uses AdamW with cosine learning-rate schedules and warmup. The real loss landscape lives in billions of dimensions, where most local minima behave similarly well.

Attention Is The Engine

Every modern frontier model — GPT, Claude, Gemini, Llama — is a stack of identical transformer blocks.[src] Each block does two things: lets every token look at every other token (), then transforms each token independently (MLP). Stack 60–120 of those, train on trillions of tokens, and you get an LLM.

Type to see the merges fire

·the·strawberry·weighs·30·grams.

▁ marks a word boundary (sentencepiece convention). Real BPE learns ~50k merges from corpus statistics.[src] Notice how rare words like “strawberry” fragment — that's why some models miscount its letters.

Each token decides who to listen to

recency
ThecatsatonthematbecauseitwastiredThecatsatonthematbecauseitwastired

Each row = a query token; brightness = how much it attends to each key token. Real heads in trained models specialize spontaneously into induction heads, name-mover heads, etc. Pattern shown is heuristic for clarity.[src]

One transformer block, repeated N times

Input tokensRMSNormMulti-head Attn (RoPE, GQA)+ ResidualRMSNormSwiGLU MLPresidual×N (typically 32 – 120)
  • instead of LayerNorm — fewer params, same stability.[src]
  • inject position into Q/K via rotation — extends cleanly to long contexts.[src]
  • share K/V heads across multiple Q heads to shrink the during inference.[src]
  • SwiGLU MLP is the de facto feed-forward in modern LLMs.[src]
  • swaps the dense MLP for a router + many experts; only k experts fire per token.[src]
  • tiles attention into -resident blocks — same math, far less traffic.[src]

Sampling: turning a probability vector into text

The model outputs a probability distribution over the entire vocabulary for the next token. Temperature sharpens or flattens it; top-p truncates the long tail. Then we draw one token, append it, and feed the new sequence back in.

prompt: "The system processes the input data, and"
thelogiclightlatticelatticelogicwitheverybytedownthecircuit

Brighter = higher probability; magenta = the token actually sampled. At temperature 0 the same prompt always picks the argmax.

Trillions of Tokens, Months of Wall Time

Pretraining is conceptually simple: predict the next token, average the loss across a batch, backprop, update. The hard parts are data quality, distributed orchestration, and not crashing for 90 days.[src]

C ≈ 6 · N · D

6.3e+24
5.9days
2,333MWh
$2.8M

Chinchilla-optimal D ≈ 20 × N. GPT-4 class is in the 1e25–1e26 FLOP regime.[src][src] Llama 3 405B used ~3.8e25 FLOPs.[src]

How a forward pass shards across 16k GPUs

  • — each GPU holds the full model, processes a different micro-batch, gradients at the end.
  • — slice individual across GPUs along the hidden dim (Megatron-style).[src]
  • — assign different layer ranges to different GPUs; micro-batches flow through like an assembly line.
  • — shard parameters, gradients, and optimizer states across the data-parallel group; gather just-in-time.[src][src]
  • Expert parallel — for MoE, route tokens to expert shards living on different GPUs.[src]

Real frontier runs combine all five along orthogonal axes ("3D" or "4D" parallelism) to keep every GPU saturated.

Crawl → dedup → filter → mix

Web crawls (Common Crawl), code (GitHub), books, math, multilingual corpora. Deduplicated at document and substring level, quality-filtered by classifiers, then mixed by domain. Quality > quantity past a point.[src]

→ RM → /

Supervised fine-tuning on demonstrations, then preference data shapes a reward model, then PPO or — increasingly — Direct Preference Optimization aligns the policy.[src][src] Anthropic's Constitutional AI automates the preference signal.[src]

Benchmarks vs. reality

MMLU, GPQA, SWE-bench, HumanEval, ARC-AGI.[src][src][src] Watch for — if eval data leaked into pretraining, the score is meaningless. Real-world capability often lags benchmarks.

Serving the Trained Mind

Inference has two phases. processes your prompt in parallel — it is compute-bound. generates one token at a time, streaming weights from on every step — it is memory-bandwidth bound. Modern serving stacks (vLLM, TensorRT-LLM, SGLang) exist to keep both phases saturated through continuous batching and paged KV cache management.[src]

Configure a deployment

PrecisionGPU
2
1.42s
459.4tok/s
6.6GB
$8.35

Specs sourced from NVIDIA H100 / Blackwell briefs[src][src] and HBM3e standard.[src]FLOPs assume 40% MFU; HBM at 60% effective.

  • — new requests join an in-flight batch each decode step.[src]
  • stored in fixed-size pages, like an OS virtual memory.
  • — a tiny draft model proposes K tokens; the big model verifies them in one pass.[src]
  • Prefix caching — reuse KV across requests sharing a system prompt.

Post-training quantization to 4-bit () shrinks weights 4× and roughly 4× inference throughput, with single-digit % accuracy loss on most tasks.[src][src]

in modern phones and laptops run 3–8B models at . Local means private, low-latency, and free per query — at the cost of capability. Apple Intelligence, Phi-class, Gemma 2B all live here.

Loops, tools, retrieval

while not done:
    response = model(messages, tools=[search, code, ...])
    if response.tool_calls:
        for call in response.tool_calls:
            result = run(call)
            messages += [tool_result(result)]
    else:
        done = True
        return response.text

Agents are LLMs in a control loop, calling functions and reading the results back into context. is the same shape with one tool: vector search over your documents. standardizes how tools are exposed to models.

Considerations & Open Questions

Modern AI is genuinely useful and genuinely fragile. The failures are not bugs to be patched — they are direct consequences of the training objective and the architecture.

  • Hallucination. The model is trained to maximize next-token likelihood, not truth. When the prompt enters a region of weight-space with low evidence, it interpolates plausibly. There is no internal "I don't know" signal unless one was explicitly trained in.
  • Prompt injection. The model cannot distinguish instructions from data. Anything that reaches its context — a webpage, an email, a tool result — can hijack behavior.[src]
  • Jailbreaks. Safety training is a thin shell over a much larger base model. Adversarial prompts find the seams.
  • Distribution shift. Performance degrades on inputs unlike the training distribution — long context, niche domains, low-resource languages.
  • Reward hacking. RLHF optimizes for human-rater approval, which is correlated with — but not identical to — being correct or helpful.
  • Energy. A single frontier training run consumes ~10–50 GWh — comparable to a small town for a year. Aggregate inference is now larger than training for major providers.[src]
  • Water. Evaporative cooling can use 1–2 L per kWh of IT load; closed-loop direct-to-chip designs use far less. Site choice matters more than model choice.
  • Grid impact. Hyperscalers are now signing multi-gigawatt PPAs and reviving nuclear capacity to keep up.[src]
  • Synthetic data feedback. If models train on outputs of earlier models, distributions narrow and rare phenomena vanish — "model collapse."[src]

What we still don't know

Interpretability.

We can read individual activations and trace small circuits, but we cannot, for any frontier model, fully explain why a given output was produced.

Alignment.

Specifying what we want a powerful optimizer to do — precisely enough that it won't satisfy the letter while violating the spirit — remains unsolved.

Scaling limits.

Loss continues to fall predictably with compute, but capability jumps are discontinuous and hard to forecast. Data may bind before compute does.

Generalization vs. memorization.

How much of model behavior is learned algorithm vs. retrieved training data is an active research question with real legal and scientific stakes.

Quick Reference

Key acronyms and critical concepts across infrastructure, controls, silicon, and AI systems — organized for rapid review and team learning.

Every abbreviation you need to know

AbbrFull NameQuick Note
PUEPower Usage EffectivenessTotal facility power ÷ IT power. Target: 1.10–1.20 for hyperscale.
TDPThermal Design PowerMax sustained chip power draw (watts). Sets cooling requirement.
UPSUninterruptible Power SupplyBattery backup bridging 10–15 s gap until generators reach speed.
PDUPower Distribution UnitDistributes power from facility feed to rack-level outlets.
ATSAutomatic Transfer SwitchSwitches load from utility to generator on failure (100–500 ms).
CDUCoolant Distribution UnitHeat exchanger between facility water loop and IT liquid loop.
CRACComputer Room Air ConditioningDX refrigerant-based cooling unit. Good up to ~15 kW/rack.
CRAHComputer Room Air HandlerChilled-water based. More efficient at scale than CRAC.
EPOEmergency Power OffKills power to entire zone. Code-required, controversial (nuisance trips).
VFDVariable Frequency DriveControls motor speed (pumps, fans). Key PUE optimization lever.
EPMSElectrical Power Monitoring SystemNetwork of power meters — V, I, kW, PF, THD at every distribution point.
BMSBuilding Management SystemSupervisory control for HVAC, cooling plant, environmental monitoring.
SCADASupervisory Control and Data AcquisitionIndustrial control + HMI layer. Single pane of glass for operators.
DCIMData Center Infrastructure ManagementUnified view of assets, capacity, power chains, and environment.
PLCProgrammable Logic ControllerDeterministic real-time controller. Scan cycle <10 ms.
DDCDirect Digital ControlMicroprocessor-based HVAC loop control (replaces pneumatic).
HMIHuman-Machine InterfaceOperator screens: one-line diagrams, alarm dashboards, schematics.
SOOSequence of OperationsThe spec document PLC programmers code from and Cx agents test against.
CxCommissioningSystematic verification: L1 factory → L2 install → L3 functional → L4 integrated → L5 seasonal.
ISTIntegrated Systems TestingL4 commissioning: multi-system failure testing with live IT load.
OPC-UAOPC Unified ArchitectureModern, secure, platform-independent PLC-to-SCADA protocol.
BACnetBuilding Automation and Control NetworksASHRAE/ISO standard for building automation interoperability.
MQTTMessage Queuing Telemetry TransportLightweight pub/sub for IIoT. Sparkplug B adds standardized namespace.
SNMPSimple Network Management ProtocolUDP-based monitoring for UPS, PDU, CRAC. v3 adds encryption.
GPUGraphics Processing UnitMassively parallel processor. Primary AI compute engine.
HBMHigh Bandwidth Memory3D-stacked DRAM via TSVs. 3–8 TB/s bandwidth per GPU.
SMStreaming MultiprocessorGPU execution unit containing CUDA cores + Tensor Cores.
FLOPSFloating Point Operations/SecondCompute throughput. B200: ~2.25 PFLOPS at FP8.
MFUModel FLOPs UtilizationActual vs. peak utilization. Good training: 30–45%.
BF16Brain Floating Point 1616-bit, same exponent range as FP32. Default training precision.
FP88-bit Floating PointE4M3/E5M2 variants. 2× Tensor Core throughput vs. BF16.
NICNetwork Interface Card400/800 Gb/s per GPU in AI clusters.
SFTSupervised Fine-TuningPost-training on curated (prompt, response) pairs.
RLHFReinforcement Learning from Human FeedbackReward model + PPO to align outputs with human preference.
DPODirect Preference OptimizationSimpler RLHF alternative — no separate reward model needed.
FSDPFully Sharded Data ParallelShards model across GPUs. Each gathers params just-in-time.
DPData ParallelFull model copy per GPU, AllReduce gradients.
TPTensor ParallelSplits matmuls across GPUs along hidden dim. Needs NVLink.
PPPipeline ParallelDifferent layers on different GPUs. Assembly line approach.
RAGRetrieval-Augmented GenerationInject retrieved docs into prompt. Reduces hallucination.
KV CacheKey-Value CacheCached K/V from prior tokens. Grows linearly with sequence length.
TTFTTime to First TokenPrefill latency. Users perceive this as responsiveness.
GQAGrouped Query AttentionMultiple Q heads share K/V heads → smaller KV cache.
MoEMixture of ExpertsRouter selects top-k of N expert MLPs per token. More params, same FLOPs.
BPEByte Pair EncodingSubword tokenizer. Iteratively merges frequent adjacent pairs.

Core definitions

AI Factory

NVIDIA's term: a purpose-built facility that manufactures intelligence (tokens), not just stores data. Manages entire AI lifecycle: data ingestion → training → fine-tuning → high-volume inference. Positioned as national-scale critical infrastructure.

2N Redundancy

Two independent power paths (A+B), each carrying 100% load. Standard for Tier III/IV mission-critical facilities.

Purdue Model

ISA-95 network segmentation: Level 0 (physical) → Level 5 (enterprise). IT/OT DMZ between Level 3 and 4.

PID Loop

Proportional-Integral-Derivative control. Kp (reacts to error), Ki (eliminates steady-state error), Kd (dampens oscillation).

Scan Cycle

PLC execution: read inputs → execute program → write outputs → comms. Repeats every 1–20 ms.

IEC 61131-3

PLC programming standard: Ladder Diagram, Structured Text, Function Block Diagram, Instruction List, SFC.

Alarm Rationalization

ISA-18.2 best practice: <1 actionable alarm per operator per 10 min. Prevents alarm fatigue.

Hot/Cold Aisle Containment

Physical separation preventing hot exhaust from mixing with cold supply. Without it, 30–40% cooling wasted.

Free Cooling (Economizer)

Using outside air or raised chiller setpoints when ambient is cool enough. Saves 20–40% cooling energy.

Direct-to-Chip Liquid Cooling

Cold plates on CPU/GPU die. Warm water (30–45°C) enables year-round free cooling. Required above ~50 kW/rack.

ASHRAE TC 9.9

Thermal guidelines: A1 class recommends 18–27°C dry-bulb at rack inlet. Outside range = accelerated failures.

AllReduce

Collective op: every GPU contributes a tensor, all are summed, every GPU gets the result. Ring topology minimizes waste.

FlashAttention

IO-aware attention: tiles QKV into SRAM-sized blocks. Avoids N² HBM materialization. 2–4× speedup.

Chinchilla Scaling

Compute-optimal: D ≈ 20 × N tokens. 70B model → 1.4T tokens. FLOPs ≈ 6ND.

Continuous Batching

New requests join in-flight batch at each decode step. Keeps GPU utilization high.

Speculative Decoding

Small draft model proposes K tokens, large model verifies in one pass. ~2–3× speedup, same quality.

Residual Connection

output = layer(x) + x. Gradient highway enabling 100+ layer training. Without it, gradients vanish.

Softmax Temperature

τ applied to logits: softmax(logits/τ). Low τ → deterministic. High τ → creative. τ→0 = argmax.

VESDA

Very Early Smoke Detection Apparatus. Laser-based air sampling — detects smoke before visible particles form.

Clean Agent Suppression

FM-200, Novec 1230 — gaseous fire suppression for IT rooms. Leaves no residue.

Digital Twin

Virtual replica with real-time sensor data. NVIDIA Omniverse for CFD, what-if scenarios, predictive maintenance.

Key numbers to know

Power Chain
  • • Utility: 12.47–34.5 kV (medium voltage)
  • • Step-down: 480V (US) / 415V (intl)
  • • GPU rack (NVL72): ~120 kW
  • • Single B200 GPU: ~1,000W TDP
  • • 10k GPU cluster: ~12–15 MW facility
Cooling
  • • Air cooling limit: ~15–20 kW/rack
  • • Liquid cooling: 50–120+ kW/rack
  • • ASHRAE A1 inlet: 18–27°C
  • • Typical ChW supply: 42°F / 5.5°C
  • • PUE 0.1 improvement at 100 MW ≈ $6M/yr saved
AI Training
  • • Chinchilla: D ≈ 20N tokens
  • • FLOPs ≈ 6 × N × D
  • • Good MFU: 30–45%
  • • NVLink 5: ~1.8 TB/s per GPU
  • • HBM3e: 3–8 TB/s bandwidth

Common questions — Infrastructure

"Walk me through the power distribution from utility to server."

Utility power arrives at 12.47–34.5 kV medium voltage. Main switchgear meters and routes it through an ATS (automatic transfer switch) that can flip to generator feed. Step-down transformers bring it to 480V (US) or 415V (international). From there it splits into redundant A/B paths through UPS systems (battery backup for the 10–15 second generator start gap). UPS output feeds floor PDUs (power distribution units) with static transfer switches. Floor PDUs step down to rack-level bus bars or rack PDUs, which distribute to individual servers via dual power supplies — so each server draws from both the A and B path simultaneously. Every link in this chain has EPMS power meters reporting voltage, current, kW, power factor, and THD in real time.

"What's the difference between 2N and N+1 redundancy?"

N+1: You have N modules needed to carry full load, plus 1 spare. If one fails, the spare picks up the slack. Cheaper, but a second failure means downtime. Example: 4 UPS modules where you only need 3, so you can lose one.

2N: Two completely independent, fully redundant power paths (A and B). Each path alone can carry 100% of the load. There is no shared component — separate utility feeds, separate transformers, separate UPS, separate PDUs. If the entire A side goes down, B carries everything. This is the standard for Tier III/IV mission-critical facilities and is mandatory for AI training clusters where a power glitch kills a multi-day training run. 2N+1 adds an extra spare module per side for even higher reliability.

"How does commissioning L4/IST differ from L3?"

L3 (Functional) tests individual systems in isolation against the Sequence of Operations. You verify one chiller starts, ramps, alarms, and shuts down correctly. You test one UPS on bypass. One generator on load. Each system is proven independently.

L4/IST (Integrated Systems Testing) tests multi-system failure scenarios with live IT load (or load banks). You simulate utility loss and verify the entire chain responds: ATS transfers, generators start and sync, UPS bridges the gap, BMS adjusts cooling. Then you cascade failures — chiller trip under load → temperature rise → BMS starts lag chiller → if it fails, load shedding kicks in. L4 proves the systems work together under stress, not just individually. Every deviation is a tracked deficiency requiring resolution before the facility goes live.

"What happens when a chiller trips under full load?"

Immediate: chilled water supply temperature begins rising because remaining capacity can't match the heat load. The BMS detects the trip and initiates the lag chiller start sequence (anti-recycle timer permitting — typically 300 seconds). During this gap, CHW supply temp may rise 3–5°F. CRAH/CDU units see reduced delta-T and increase fan/pump speeds to compensate. If the lag chiller also fails or takes too long, rack inlet temps cross the high-temp alarm threshold (typically 95°F/35°C), triggering a warning alarm. If temps continue rising, the critical alarm (100°F/38°C) triggers IT load shedding — the BMS or DCIM shuts down non-essential compute to reduce heat load. In a liquid-cooled AI hall at 120 kW/rack, you have about 2–3 minutes of thermal runway before throttling begins, versus 10–15 minutes in a traditional 10 kW/rack enterprise hall. This is why proper chiller staging logic, anti-recycle bypass, and N+1 cooling capacity are non-negotiable.

"How do you optimize PUE in a liquid-cooled facility?"

Six primary levers: (1) Raise chilled water supply temperature — warm-water cooling (30–45°C) enables free cooling via dry coolers year-round in most climates, eliminating compressor energy. (2) VFDs on all pumps and fans — variable speed drives match motor speed to actual load instead of running at 100% constant. (3) Eliminate CRAH/CRAC units — liquid cooling removes 60–80% of heat at the chip, so air-side cooling can be minimal or eliminated. (4) Economizer modes — use outside air or raised condenser water setpoints when ambient conditions allow. (5) Higher voltage distribution (415V vs 208V) — reduces I²R distribution losses. (6) Efficient UPS — ECO mode, lithium-ion batteries, and right-sizing UPS capacity to avoid running at low load factors. A well-designed liquid-cooled AI facility can achieve PUE 1.03–1.10; each 0.1 improvement at 100 MW saves ~$6M/year.

"Explain the Purdue Model and why OT networks must be segmented."

The Purdue Model (ISA-95) defines 6 levels: Level 0 (physical process — pipes, wires, air), Level 1 (field I/O — sensors, drives, VFDs), Level 2 (control — PLCs, DDC, BMS controllers), Level 3 (site operations — SCADA, historian), Level 4 (business — DCIM, MES, IT systems), Level 5 (enterprise — ERP, email, cloud). A strict IT/OT DMZ with firewalls sits between Level 3 and Level 4. OT networks prioritize availability and safety — a PLC controlling a fire suppression interlock must never be disrupted by a software update, a vulnerability scan, or an IT policy change. IT networks prioritize confidentiality. Mixing them means an IT compromise (phishing, ransomware) could reach PLCs that control physical safety systems. Data flows OT→IT only, via historians, OPC-UA gateways, or MQTT brokers in the DMZ. Security standard: IEC 62443. Never expose PLCs directly to the IT or internet network.

"What protocols do power meters use? How about HVAC?"

Power meters: Modbus TCP (Ethernet, port 502) or Modbus RTU (RS-485 serial). Some advanced meters also expose data via SNMP or REST APIs. Modbus uses a simple register-based addressing scheme — FC03 reads holding registers, FC04 reads input registers. No native security, so it relies on network segmentation.

HVAC / BMS: BACnet IP (UDP port 47808) is the ASHRAE/ISO standard for building automation. Data is organized as objects (Analog Input, Analog Output, Binary Value, Schedule, Trend Log) with properties (Present-Value, Status-Flags). Supports COV (Change of Value) subscriptions. Some legacy systems use LON or proprietary serial. Modern integration uses OPC-UA as the unifying layer, with gateways translating BACnet↔OPC-UA and Modbus↔OPC-UA at system boundaries.

Common questions — AI/ML

"Why can't you air-cool a GPU training cluster?"

Physics. A single GB200 NVL72 rack dissipates ~120 kW in a 0.6 m² footprint. Air cooling works by blowing cold air across heat sinks and exhausting hot air — but air has very low thermal capacity (specific heat ~1 kJ/kg·K vs water at ~4.2 kJ/kg·K). To remove 120 kW with air, you'd need airflow volumes that are physically impossible to route through a rack — the fans alone would consume massive power and create deafening noise. The practical ceiling for air cooling is ~15–20 kW/rack. Above 50 kW/rack, direct-to-chip liquid cooling is mandatory: cold plates mounted on each GPU/CPU transfer heat to a water loop via a CDU (Coolant Distribution Unit). The CDU rejects heat to the building chilled water plant. Liquid removes 60–80% of the server heat, leaving only residual air cooling for memory, storage, and fans.

"What's the difference between prefill and decode?"

LLM inference has two distinct phases. Prefill processes the entire input prompt in parallel — one big forward pass through all layers, populating the KV cache for every token in the prompt. Prefill is compute-bound (lots of matmuls on many tokens at once). It determines TTFT (time to first token).

Decode generates output tokens one at a time. Each step reads the full model weights from HBM but processes only one new token, using the KV cache from all previous tokens. Decode is memory-bandwidth bound — the GPU spends most of its time loading weights, not computing. Throughput ≈ model_size_bytes / HBM_bandwidth. This is why techniques like continuous batching (amortize weight loading across many concurrent requests) and speculative decoding (verify K draft tokens in one pass) matter so much for serving efficiency.

"How does tensor parallelism differ from data parallelism?"

Data parallelism (DP): Every GPU holds a complete copy of the model. Each GPU processes a different mini-batch of data. After computing local gradients, all GPUs AllReduce them to stay synchronized. Simple, but every GPU needs enough memory for the full model + optimizer states. Works across nodes with moderate bandwidth.

Tensor parallelism (TP): A single matmul is split across GPUs along the hidden dimension (Megatron-style). For a weight matrix W of shape [h, 4h], GPU 0 gets W[:, :2h] and GPU 1 gets W[:, 2h:]. Each GPU computes its slice, then they AllReduce the result. This requires high-bandwidth links (NVLink at 1.8 TB/s) because activations are communicated every layer. TP reduces per-GPU memory by the number of TP ranks. Typically used within a node (2–8 GPUs on NVLink), while DP is used across nodes (over InfiniBand/Ethernet).

"What is FlashAttention and why does it matter?"

Standard attention computes softmax(QKT/√d)·V, which requires materializing the full N×N attention matrix in HBM (GPU main memory). For a 128K context, that's 128K² = 16 billion elements — massive memory and bandwidth cost.

FlashAttention (Dao et al., 2022) tiles the computation into small blocks that fit entirely in GPU SRAM (fast on-chip memory, ~20 MB). It computes exact attention — no approximation — but never materializes the full N×N matrix. This reduces HBM reads/writes from O(N²) to O(N²/M) where M is SRAM size. Result: 2–4× wallclock speedup, dramatically lower memory usage, and the ability to train with much longer contexts without running out of memory. It's now the default in virtually every major training and inference framework.

"How do you reduce KV cache memory usage?"

The KV cache stores Key and Value tensors for all previous tokens across all layers. For a 70B model with 80 layers, 8 KV heads, 128-dim heads, at FP16, a single sequence of 4K tokens uses ~2.5 GB. At batch size 64, that's 160 GB — more than the model weights themselves. Three main strategies:

(1) GQA (Grouped Query Attention) — share KV heads across multiple Q heads. If 32 Q heads share 8 KV heads, the KV cache is 4× smaller. Llama 2 70B, Mistral, and most modern models use GQA. (2) KV cache quantization — store cached K/V in FP8 or INT8 instead of FP16, halving or quartering memory. (3) PagedAttention (vLLM) — manage KV cache in fixed-size pages like OS virtual memory. Eliminates fragmentation that previously wasted 60–80% of allocated KV space. Pages can be shared (prefix caching) or freed independently.

"What makes AI datacenter power loads different from enterprise?"

Three fundamental differences: (1) Power volatility — enterprise servers draw relatively steady power. GPU clusters swing from ~30% TDP at idle to 100% TDP in milliseconds when a training batch launches. This creates transient spikes that stress UPS systems and confuse capacity planning models built for static loads. (2) Thermal density — at 80–120 kW/rack vs enterprise 5–15 kW/rack, the thermal time constant collapses. A cooling failure that gives you 15 minutes in an enterprise hall gives you <3 minutes in an AI hall. Monitoring latency that was fine at 60-second intervals becomes dangerous. (3) Workload correlation — in an AI cluster, all GPUs in a training job start and stop together, so power and thermal loads are highly correlated across hundreds of racks. Traditional designs assume statistical diversity (some racks busy, some idle, averaging out). AI training breaks that assumption — your cooling plant must handle near-instantaneous 0-to-100% step changes.

"What is MoE and how does it affect inference?"

Mixture of Experts (MoE) replaces the dense MLP (feed-forward) block in each transformer layer with N parallel "expert" MLPs plus a learned router. For each token, the router selects the top-k experts (typically k=2 of N=8–64). Total model parameters are much larger (since all N experts exist), but per-token FLOPs are the same as a dense model (only k experts fire per token).

Impact on inference: The full model must be in memory (all experts loaded), so MoE models need more GPU memory than a dense model of equivalent quality — a 47B-active-parameter MoE might have 140B total parameters. However, per-token compute cost equals the active parameter count, so inference is faster than a dense 140B model. The challenge is expert load balancing — if the router sends most tokens to the same few experts, you get hotspots on some GPUs and idle capacity on others. Auxiliary load-balancing losses during training and expert parallelism (spreading experts across GPUs) mitigate this.

Further Reading

Every technical claim in this guide traces to a primary source. The field continues to evolve rapidly — interpretability, mechanistic analysis, training dynamics, and hardware co-design all remain active areas of research.

Source Material

End of guide.

← Back to Top