AI DATACENTER ENGINEERING

From Power Grid to Inference

A team learning guide

AI is not magic. It is electricity, linear algebra, and scale — disciplined into prediction. This guide walks your team through every layer of the system, from a 130-megawatt facility to the softmax function that selects the next token.

Get Started

↓ scroll to begin

PRIMER

What is AI, really?

When you ask ChatGPT a question and it answers, it feels like thinking. It isn't. What's actually happening is a very fast, very expensive math problem: the system has read billions of sentences and learned statistical patterns between words. Given a prompt, it predicts the most probable next word, then the next, then the next — hundreds of times per second.

That "prediction engine" is called a neural network. Think of it as a massive spreadsheet with billions of adjustable numbers (called parameters). During training, the system reads data, makes predictions, checks its mistakes, and tweaks those numbers to be slightly less wrong next time. Do this trillions of times and the spreadsheet gets eerily good at predicting language, images, code, or music.

Step 1 — Training

Feed the network terabytes of text. It reads, predicts, gets corrected, and adjusts billions of parameters over weeks or months.

Step 2 — The Model

After training, you have a giant file of tuned parameters. This is the AI — a frozen snapshot of everything it learned. GPT-4 is ~1.8 trillion parameters.

Step 3 — Inference

When you ask it a question, the model runs those parameters against your prompt to generate one token (word-piece) at a time. This is called inference.

Why does AI need a building?

The math behind AI is simple — multiply matrices and add biases — but the scale is staggering. Training a frontier model means multiplying trillions of numbers together, billions of times. Your laptop's processor can't do this in a useful timeframe.

So companies pack thousands of specialised chips called GPUs into warehouse-sized buildings called data centers. These chips draw enormous amounts of electricity, and all that electricity becomes heat. Cooling that heat requires industrial-scale plumbing — chilled water loops, cooling towers, sometimes even outdoor ponds. A single training cluster can draw as much power as a small city.

Mental model: Imagine you need to hand-multiply a million spreadsheets per second, each one thousands of rows long. You'd need a lot of calculators (GPUs), a lot of desks (servers), a huge building (datacenter), and enough electricity and air conditioning to keep everyone working without overheating. That is the physical reality of AI.

The AI Factory

NVIDIA uses a deliberate term for these facilities: . Not "data center" — factory. A traditional data center stores and serves data. An AI factory manufactures intelligence. Raw data enters one end; trained models and inference tokens come out the other. The primary product is tokens — and throughput is the measure of output, just like any manufacturing operation.[src]

This framing changes how you think about every layer of the stack. The facility isn't just housing for servers — it's an industrial production line that manages the entire AI lifecycle:

Input: Raw Data

Data pipelines ingest, clean, and structure trillions of tokens from unstructured text, images, code, and sensor data. Data quality directly determines model quality — garbage in, garbage out, but at a trillion-token scale.

Output: Intelligence (Tokens)

Trained models generate predictions, decisions, and content in real time. Inference outputs feed back into the system as a data flywheel — improving model accuracy over time. Every token generated creates new training signal.

Full-Stack Infrastructure

GPUs, NVLink/NVSwitch fabrics, InfiniBand networking, parallel storage, liquid cooling — plus the software stack: CUDA, TensorRT, NIM microservices. Hardware and software designed as a single integrated system.

Digital Twins

NVIDIA Omniverse lets teams design, simulate, and optimize the entire facility virtually — testing layout changes, modeling failure scenarios, and validating cooling before construction begins.

Continuous Optimization

Automation tools handle hyperparameter tuning, model deployment, and performance monitoring. The factory operates 24/7 — training, fine-tuning, and serving inference at scale with minimal human intervention.

NVIDIA positions AI infrastructure as national infrastructure — as fundamental as roads, power grids, and telecommunications. Sovereign nations are building their own AI factories to cultivate local language models, protect data sovereignty, and drive economic competitiveness. Every enterprise, every government will need access to one.

In the sections ahead, we'll examine every part of this system — from the power grid and cooling systems (Infrastructure) through chip architecture (Silicon), the underlying math (Mathematics), the transformer architecture that makes language models work (Architecture), training and inference at scale (Training & Inference), and real-world considerations (Considerations). Hover any highlighted term for a technical definition.

Continue to Infrastructure

INFRASTRUCTURE

The Building Is the Computer

If you're new here

A is a large, climate-controlled building packed with thousands of computers. Most of the internet already runs from buildings like this. What makes an AI data center different is the sheer density of power and heat. Training a single frontier AI model can require thousands of specialised chips called , each drawing up to 1,000 watts — roughly the same as a microwave oven running flat out. Multiply that by 10,000+ chips and you need as much power as a small town.

All that electricity becomes heat. So data centers need industrial cooling — water loops, heat exchangers, or cooling towers — just to stop the chips from melting. The section below lets you explore exactly how much power a training cluster needs.

A frontier model is not trained on a laptop. It is trained inside a purpose-built industrial facility that converts megawatts of grid power into gradients. Site selection is dominated by three things: cheap reliable electricity, fiber backbones, and water or air cool enough to reject heat.[src]

SIM // CLUSTER POWER

Drag to scale a training cluster

GPUs (Blackwell-class, ~1 kW each)8,192

PUE (1.10 hyperscale → 1.56 global avg)1.20

Facility load

10.8MW

Annual energy

94.7GWh

Power cost

$757/hr

Cooling water (evap, max)

102,849gal/day

For reference: 100k Blackwell GPUs ≈ a mid-size city's continuous load.[src]

RACK // GB200 NVL72 CLASS

One rack, 72 GPUs, ~120 kW

Compute trays — 18× per rack, each holding 2 Grace CPUs + 4 Blackwell GPUs, lashed together by 5th-gen at ~1.8 TB/s per GPU.[src][src]
NVSwitch trays — 9× per rack form a non-blocking all-to-all fabric so any GPU can reach any other GPU at full NVLink bandwidth.
— direct-to-chip liquid cooling. Air alone cannot evacuate ~120 kW from a single 0.6 m² footprint.
Power shelves — bus-bar delivery; redundant 415 V three-phase feeds backed by and diesel/gas generators in 2N or N+1 topology.
Optics — 800 Gb/s or Ethernet links exit the rack toward a rail-optimized fat-tree spine.[src]

POWER // MV-TO-RACK DISTRIBUTION

Complete electrical path — utility to server

Every AI factory converts grid power into GPU compute through a carefully engineered chain. Each stage steps down voltage, adds protection, and provides monitoring. A 2N topology means two fully independent paths — either side can carry the full facility load alone.

UTILITY FEED

12.47 – 34.5 kV medium voltage

Dual utility feeds at Tier III+

MV SWITCHGEAR

Vacuum/SF₆ breakers · Protective relaying · Revenue metering

SEL / GE Multilin relays · ATS for utility/gen switchover

STANDBY GENERATORS

2 MW diesel/gas gensets · N+1 or 2N redundancy

Paralleling switchgear · 10s start · Load bank tested

MV → LV TRANSFORMER

Step-down: 12.47 kV → 480 V, 3-phase

Cast-coil dry type (indoor) · K-rated for harmonics · Δ-Y winding

LV SWITCHBOARD / SWBD

480 V distribution · Circuit breakers · SPD

Feeds split into A + B buses for 2N downstream

── A PATH ──

UPS-A

Rotary or Li-ion · 480 V input/output

10–15 min ride-through · VRLA or LFP battery

PDU-A (Floor)

Step-down: 480 V → 208/120 V

Panel boards · breaker distribution

RPP-A

Remote Power Panel

Row-level distribution · breakers per rack

RACK PDU-A

In-rack · metered + switched

Per-outlet V, A, kW, kWh · SNMP/Modbus

── B PATH ──

UPS-B

Rotary or Li-ion · 480 V input/output

10–15 min ride-through · VRLA or LFP battery

PDU-B (Floor)

Step-down: 480 V → 208/120 V

Panel boards · breaker distribution

RPP-B

Remote Power Panel

Row-level distribution · breakers per rack

RACK PDU-B

In-rack · metered + switched

Per-outlet V, A, kW, kWh · SNMP/Modbus

GPU SERVER / TRAY

Dual PSUs — one from A path, one from B path

Auto-failover: either PSU carries full load

MV Stage (12.47–34.5 kV)

Utility entrance through MV switchgear. Vacuum or SF₆ breakers provide fault isolation. Protective relays (SEL-751, GE Multilin 489) monitor overcurrent, differential, ground fault. Revenue-grade metering (CT/PT accuracy class 0.3) for utility billing. ATS (automatic transfer switch) or paralleling switchgear coordinates utility ↔ generator transitions with <10s open-transition or <100ms closed-transition transfer.

LV Stage (480 V → 208 V)

Cast-coil dry-type transformers step down to 480 V. K-rated (K-13 or K-20) to handle harmonic distortion from server power supplies. UPS systems (rotary flywheel or lithium-ion) provide 10–15 min ride-through for generator start. Floor PDUs step down 480 → 208/120 V with integrated breakers and metering. RPPs (remote power panels) distribute at row level — each RPP feeds 4–8 racks with individually breakered circuits.

Rack Level

Intelligent rack PDUs (ServerTech, Raritan, APC) provide per-outlet monitoring: voltage, current, kW, kWh, power factor. Data reported via SNMP, Modbus TCP, or REST API to . Typical GPU rack draws 40–120+ kW with >90% power factor. At 120 kW (GB200 NVL72), each rack needs 2 × 100A 208V circuits or direct 480 V bus bar feed bypassing 208 V transformation entirely.

EPMS MONITORING AT EVERY STAGE

Stage	Metering	Key Measurements	Protocol
MV Switchgear	ION 9000 / SEL-735	V, A, kW, kVAR, PF, THD, demand	Modbus TCP / DNP3
Generator	Controller (DSE, DEIF)	kW, fuel level, RPM, coolant temp, battery V	Modbus RTU/TCP
Transformer	Temp sensors (winding/oil)	Winding temp, oil temp, loading %	4–20 mA / Modbus
UPS	Built-in controller	Input/output V, A, kW, battery SOC, temp, runtime	SNMP / Modbus TCP
Floor PDU / STS	ION 7650 / PM8000	V, A, kW per panel, breaker status, transfer count	Modbus TCP
RPP	Branch circuit monitor	Per-breaker A, kW, alarm on trip	Modbus TCP / BACnet
Rack PDU	Intelligent PDU	Per-outlet V, A, kW, kWh, PF, inlet temp	SNMP / Modbus / REST

All metering data feeds into for real-time PUE calculation, capacity planning, and alarm management. Total facility power (metered at MV) ÷ IT load (metered at rack PDU) = .

REDUNDANCY // UPS TOPOLOGIES

How many paths to the server?

NSingle path, no redundancy. Maintenance means downtime. Unacceptable for mission-critical.

N+1One spare module. Can lose one UPS and still support full load. Minimum for production workloads.

2NFully mirrored: two independent paths (A+B), either side carries full load alone. Standard for Tier III/IV. Required for AI training clusters where a power glitch kills a multi-day run.

2N+12N with an additional spare module per side. Highest reliability — allows maintenance on one side while the other has N+1 protection.

GPU racks draw 40–120+ kW each (GB200 NVL72 exceeds 120 kW). This drives unique requirements for power density, UPS sizing, and breaker coordination.

COOLING // AIR-BASED

Traditional approaches

CRAC — Computer Room Air Conditioning. DX (direct expansion) refrigerant-based, self-contained units. Common in smaller facilities.

CRAH — Computer Room Air Handler. Chilled water coils with fans. More efficient at scale — the chiller plant is centralized.

Containment — hot aisle/cold aisle physical separation prevents mixing. Without containment, 30–40% of cooling capacity is wasted.

Economizer — free cooling when outside air temp is low enough. Raises chilled water setpoint, disables compressors. Saves 20–40% of cooling energy annually depending on climate.

Air cooling works up to ~15–20 kW/rack. Above 50 kW/rack, air alone is physically insufficient regardless of airflow volume.

COOLING // LIQUID FOR AI/GPU

Direct-to-chip: the new standard

Cold plates — mounted directly on GPU/CPU die. Warm water cooling (30–45°C supply) enables free cooling year-round in most climates. Removes 60–80% of server heat via liquid.

CDU — Coolant Distribution Unit. Heat exchanger between facility water loop and IT water loop. IT loop uses deionized water or propylene glycol in a closed circuit. CDU provides pumps, filtration, flow/temp/pressure monitoring.

Chilled water plant — water-cooled chillers (centrifugal/screw), primary/secondary pumping, cooling towers. Typical temps: 42°F supply / 55°F return. Variable-speed drives on pumps and fans are key efficiency levers.

The GB200 NVL72 is liquid-cooled at rack level. Robust leak detection and emergency drain procedures are mandatory — a CDU leak can destroy millions in hardware in minutes.

COOLING // MECHANICAL DEEP DIVE

Inside the cooling loop — component by component

AI data centers reject tens of megawatts of heat continuously. Below is a detailed look at every major mechanical component in the liquid cooling chain, from the cold plate bolted to the GPU die to the cooling tower rejecting heat to the atmosphere. Understanding these components — and how they interact — is essential for anyone involved in design, commissioning, or operations.

END-TO-END LIQUID COOLING LOOP

IT WATER LOOP(closed, deionized)

FACILITY WATER LOOP(condenser / tower water)

COLD PLATE

on GPU die

Copper micro-channel

45°C supply → 55°C return

GPU DIE ~1000W TDP

hot →← cool

CDU

Coolant Distribution Unit

Heat Exchanger

Pump (redundant pair)

Filter / DI Resin

Reservoir

Sensors (flow/temp/press/leak)

IT loop: deionized water or propylene glycol

hot →← cool

CHILLER

Centrifugal or Screw

Evaporator

Condenser

Compressor

Expansion

R-134a / R-1234ze

COP: 5–8

hot CW →← cool CW

↑ heat to atmosphere

COOLING TOWER

Evaporative

Fan (VFD)

Fill Media

Basin

CHW PUMPS

Primary + Secondary

CW PUMPS

VFD controlled

Heat flows left-to-right: GPU → cold plate → CDU → chiller → cooling tower → atmosphere. Two isolated water loops prevent contamination of IT equipment.

CDU // COOLANT DISTRIBUTION UNIT

The bridge between IT and facility

CDU INTERIOR — TYPICAL LAYOUT

HOT from racks →

← COOL to racks

PLATE & FRAME HEAT EXCHANGER

IT side ↔ Facility side

Brazed or gasketed stainless plates

plates

PUMP A — Primary

Mag-drive or canned-motor

PUMP B — Standby

Auto-failover on fault

RESERVOIR / EXPANSION TANK

Maintains system pressure & fluid level

FILTRATION

5μm particulate + DI resin

FLOW CONTROL

Balancing valves

RELIEF VALVE

Overpressure safety

INSTRUMENTATION & SENSORS

● Flow meter (GPM)

● Supply temp (°F/°C)

● Return temp (°F/°C)

● Diff. pressure (PSI)

● Conductivity (μS/cm)

● Leak detection

CDU CONTROLLER

PLC or embedded controller

Modbus TCP / BACnet IP upstream

LEAK CONTAINMENT

Drip pan + cable sensor

Emergency drain valve (N.O.)

Purpose: The CDU is the boundary between the clean, deionized IT water loop and the facility's chilled water loop. It ensures the two never mix — contamination of the IT loop with facility water (which contains corrosion inhibitors, biocides) would damage cold plates and clog micro-channels.

Scale: A typical row-level CDU serves 4–8 racks (200–600 kW). Rack-level CDUs serve a single rack (~40–120 kW). Large deployments use centralized CDUs serving entire rows or pods.

Redundancy: Dual pumps (lead/standby), dual facility-side connections, and N+1 CDU sparing per row. Automatic failover on pump fault or low-flow alarm. Typical response time: <5 seconds.

COLD PLATE // GPU THERMAL INTERFACE

Where heat leaves the silicon

COLD PLATE — CROSS SECTION (top to bottom)

QD fitting

(quick-disconnect)

← dry-break, non-drip, 10k+ cycles →

QD fitting

(quick-disconnect)

IN (cool supply)

OUT (warm return)

COPPER COLD PLATE BODY

MICRO-CHANNELS (0.2–0.5mm width)

Coolant flows through channels → absorbs heat from copper

↕ spring-loaded
mounting bolt

THERMAL INTERFACE MATERIAL (TIM)

Indium foil or high-performance paste · >5 W/m·K

INTEGRATED HEAT SPREADER (IHS)

GPU DIE

~800 mm² · 700–1000W TDP

PCB SUBSTRATE

THERMAL STACK TEMPERATURES

GPU junction (Tj): ~83°C max

IHS surface: ~75°C

TIM interface: ~70°C

Cold plate base: ~55°C

Coolant inlet: 40–45°C / outlet: 50–55°C

Micro-channels maximize surface area. Copper fins as thin as 0.2 mm create hundreds of parallel flow paths. Turbulent flow at the channel level dramatically increases heat transfer coefficient vs. smooth bore.

TIM (Thermal Interface Material) fills microscopic air gaps between the IHS and cold plate. Indium foil or high-performance paste with >5 W/m·K conductivity. Poor TIM application is the #1 cause of GPU thermal throttling.

Quick-disconnects (QD) allow hot-swap of servers without draining the loop. Non-drip / dry-break QDs are mandatory in IT environments. Rated for 10,000+ connect/disconnect cycles.

COOLING TOWER // HEAT REJECTION

Rejecting heat to atmosphere

COUNTERFLOW COOLING TOWER — CROSS SECTION

↑ ↑ ↑ ↑ ↑

Warm moist air exhaust to atmosphere

FAN

axial

6–30 ft ⌀

—

MOTOR + VFD

25–100 HP

gear-reduced drive

DRIFT ELIMINATORS

chevron baffles — prevent water droplet carryover into exhaust

HOT WATER DISTRIBUTION DECK

▼

spray nozzles — distribute hot water evenly over fill

→

LOUVERS

AIR

FILL MEDIA

PVC / polypropylene corrugated sheets

maximizes water-to-air contact surface

Water trickles down · Air flows across · Heat transfers via evaporation

←

LOUVERS

AIR

COLD WATER BASIN

StrainerFloat valveBleed-off

Hot CW in →

85–95°F from chiller

Makeup water line

replaces evap + blowdown loss

← Cool CW out

75–85°F to chiller

Evaporative cooling exploits the latent heat of vaporization — water evaporating from the fill media absorbs ~1,000 BTU/lb, cooling the remaining water. Approach temp (basin temp minus wet-bulb) of 5–7°F is typical. Lower approach = larger, more expensive tower.

Water consumption: ~1.8 gal/kWh of heat rejected (evaporation + blowdown). A 100 MW AI facility can consume 300,000+ gallons/day. Water treatment (biocide, scale inhibitor, pH control) is critical to prevent Legionella, scaling, and corrosion.

Fan control: VFD modulates fan speed to maintain condenser water return temp setpoint. Multiple cells staged on/off as load changes. Fan energy is 2–5% of total cooling plant energy.

HEAT EXCHANGERS // TYPES IN AI DCs

Moving heat between loops

PLATE & FRAME HEAT EXCHANGER

(most common in CDUs and free-cooling economizers)

Hot fluid IN →

← Cool fluid OUT

↓

↑

↓

↑

↓

↑

↓

↑

← Cold fluid IN

Warm fluid OUT →

Corrugated stainless plates with EPDM gaskets

Counter-flow pattern maximizes ΔT & effectiveness (ε > 0.90)

SHELL & TUBE (CHILLER CONDENSER)

CW in →

Shell: refrigerant (condenses)

Tubes: condenser water (absorbs heat)

→ CW out

Fouling factor maintenance: tube brushing or automatic ball cleaning system

Plate & frame are compact, high-effectiveness (ε > 0.90) and easily expandable — add plates to increase capacity. Used in CDUs, economizer bypass, and free-cooling heat exchangers. Brazed variants (BPHE) are smaller but not field-serviceable.

Shell & tube are used inside chillers as the condenser and evaporator. Refrigerant flows shell-side (changes phase); water flows tube-side. Fouling reduces efficiency — condenser approach temp rises 1°F per year without cleaning, costing ~2% chiller efficiency per degree.

PUMPS // MOVING THE COOLANT

Pump types in cooling plants

CENTRIFUGAL PUMP — CUTAWAY VIEW

SUCTION

axial inlet

→

IMPELLER

cast iron

or bronze

VOLUTE (spiral casing)

DISCHARGE

tangential

high pressure

SHAFT

—

MECH SEAL

prevents leaks

rotating + stationary face

—

MOTOR

10–200 HP

TEFC · 1800 RPM

—

VFD

speed control

Power ∝ speed³

PUMP TYPES IN AI DC COOLING

CHW PUMPS

Primary: constant flow through chillers

Secondary: VFD, modulates to demand

50–300 HP · 1000–4000 GPM

CW PUMPS

Chiller condenser ↔ cooling tower

Constant or VFD speed

Must match chiller GPM/ton spec

CDU PUMPS

Mag-drive or canned-motor

No shaft seal (leak-free)

5–30 GPM · redundant pair per CDU

AFFINITY LAWS — THE KEY EFFICIENCY PRINCIPLE

Flow ∝ Speed

linear

Head ∝ Speed²

squared

Power ∝ Speed³

cubed

80% speed = ~51% power. VFDs are the single biggest efficiency lever in a cooling plant.

FANS // AIRFLOW IN COOLING SYSTEMS

Moving air through towers & AHUs

FAN TYPES IN DATA CENTER COOLING

AXIAL FAN

(cooling towers, condensers)

High volume, low pressure

6–30 ft diameter · gear-reduced motor

CENTRIFUGAL FAN

(CRAHs, AHUs)

Medium volume, higher pressure

Scroll housing · plenum or housed

EC FANS (Electronically Commutated)

Used in: server chassis, rear-door heat exchangers, small AHUs

Brushless DC motor + integrated speed control · 90%+ efficiency · PWM driven

Cooling tower fans — large axial fans (6–30 ft diameter), typically driven by a gear-reduced motor. VFDs modulate speed to maintain condenser water temperature. Fan energy follows the same affinity laws as pumps: power ∝ speed³.

CRAH/AHU fans — centrifugal (plenum or housed) fans move air across chilled water coils. EC fan arrays (fan walls) are replacing single large fans — they provide N+1 redundancy, lower noise, and better efficiency at partial loads.

Server fans — counter-rotating 40–80 mm fans inside each server or tray. Speed controlled by BMC based on inlet/exhaust temp and component thermal sensors. In liquid-cooled systems, server fans handle only the 20–40% of heat not captured by cold plates (VRMs, DIMMs, NVMe drives).

Rear-door heat exchangers (RDHx) — a hybrid approach: chilled water coils with EC fans mounted on the rack rear door. Captures exhaust heat before it enters the room. Can handle 30–50 kW/rack without liquid touching servers. Popular as a retrofit for air-cooled facilities adding GPU density.

CHILLER PLANT // COMPLETE PIPING SCHEMATIC

How it all connects — primary/secondary CHW + condenser loop

CHILLER PLANT — PROCESS FLOW

Read top → bottom to follow the heat rejection path

CW supply (cool)

CW return (warm)

CHW supply (cold)

IT loop

HEAT REJECTION — COOLING TOWERS

Cooling Towers (3 cells, N+1)

Evaporative heat rejection to atmosphere

Fan-assisted counterflow · VFD speed control

CW IN

85–95°F

warm return

→

CW OUT

75–85°F

cooled supply

▼ CW supply flows down to pumps

CONDENSER WATER PUMPS

CW Pump 1

Lead

Constant speed · 50–200 HP

CW Pump 2

Lag / standby

Constant speed · 50–200 HP

▼ CW supply → chiller condenser

CHILLERS — WHERE CW LOOP MEETS CHW LOOP

Chiller 1 · 500 ton

CONDENSER

CW absorbs heat

shell & tube

EVAPORATOR

CHW is produced

shell & tube

Compressor · refrigerant cycle · R-134a

Chiller 2 · 500 ton

CONDENSER

CW absorbs heat

shell & tube

EVAPORATOR

CHW is produced

shell & tube

Compressor · refrigerant cycle · R-1234ze

↑ CW return (warm) — 85–95°F back to towers

↓ CHW supply (cold) — 42–50°F to pumps below

▼ CHW supply flows down to primary pumps

PRIMARY CHW PUMPS

Primary 1

Constant flow through evaporator

Matches chiller min flow requirement

Primary 2

Constant flow through evaporator

Matches chiller min flow requirement

▼

BYPASS / DECOUPLER

Excess primary flow recirculates · isolates primary from secondary

▼

SECONDARY CHW PUMPS (VFD)

Secondary 1 (VFD)

Variable speed · matches IT load

ΔP sensor at most-remote CDU

Secondary 2 (VFD)

Variable speed · matches IT load

ΔP sensor at most-remote CDU

▼ CHW supply → CDU facility side (42–50°F)

CDUs → IT RACKS (FACILITY/IT BOUNDARY)

CDU-1

Plate HX

IT pump pair

Sensors

4–8 racks · 200–600 kW

CDU-2

Plate HX

IT pump pair

Sensors

4–8 racks · 200–600 kW

CDU-3

Plate HX

IT pump pair

Sensors

4–8 racks · 200–600 kW

IT supply: 40–45°C to cold plates

IT return: 50–55°C back to CDU

▲ RETURN PATH

Warm CHW (55–65°F) returns from CDUs → through secondary & primary pumps → back to chiller evaporator → cycle repeats

FREE COOLING / ECONOMIZER MODE

When outdoor wet-bulb temp < CHW supply setpoint: tower water routes through a plate & frame HX directly to the CHW loop, bypassing chillers entirely. Zero compressor energy. In warm-water systems (GPU supply >40°C), free cooling operates 8,000+ hours/year in northern climates — the largest single contributor to PUE < 1.10.

DESIGN TEMPERATURES AT EACH STAGE

Tower Water

Supply: 75–85°F

Return: 85–95°F

→

Chilled Water

Supply: 42–50°F

Return: 55–65°F

→

CDU Facility

Supply: 42–50°F

Return: 55–65°F

→

CDU IT Side

Supply: 40–45°C

Return: 50–55°C

→

GPU Cold Plate

Supply: 40–45°C

Return: 50–55°C (Tj~83°C)

PUE // POWER USAGE EFFECTIVENESS

Total Facility Power ÷ IT Equipment Power

Perfect (impossible)

1.00

Hyperscale target (liquid)

1.03–1.10

Good modern DC

1.20–1.40

Older / inefficient

1.50–2.00

Optimization levers

• Raise chilled water supply temp → more economizer hours
• Liquid cooling → eliminates fan energy, enables free cooling
• Variable speed drives on pumps and fans
• Hot/cold aisle containment
• Efficient UPS (ECO mode, lithium-ion batteries)
• Higher voltage distribution (415V vs 208V) reduces I²R losses

[src] A 0.1 improvement in PUE at a 100 MW facility saves ~10 MW of cooling/overhead power — roughly $6M/year at industrial electricity rates.

COOLING // DESIGN CHALLENGES

Why liquid cooling at scale is hard

Moving from air to liquid sounds straightforward — but at AI-scale densities, it introduces a set of engineering challenges that the data center industry is still actively solving. These are the real-world problems that make or break a liquid-cooled deployment.

Leak Risk in IT Spaces

Water inside a server room is inherently risky. A single fitting failure can destroy hundreds of thousands of dollars of GPU hardware in minutes. Every connection point — quick-disconnects, manifold joints, CDU internals — is a potential leak site. Mitigation: dry-break QD fittings, leak detection cables under every pipe run, drip pans beneath CDUs, automatic isolation valves that shut down a loop segment within seconds of detection. Despite this, many operators still consider liquid cooling "high anxiety" compared to air.

Serviceability & Hot-Swap

Air-cooled servers slide in and out of racks freely. Liquid-cooled servers are tethered to plumbing. Replacing a GPU node means disconnecting fluid lines, managing residual coolant, and reconnecting without introducing air bubbles. Blind-mate connectors (auto-connecting on rack insertion) help, but add cost and complexity. Service technicians need new training — HVAC pipe-fitting skills meet IT operations. Mean-time-to-repair (MTTR) is longer unless the facility is designed with maintenance access and drain points from day one.

Hybrid Cooling — Liquid + Air

Cold plates only capture 60–80% of server heat (GPUs, CPUs). The remaining 20–40% (VRMs, DIMMs, NVMe drives, NICs) still radiates into the room as hot air. You can't eliminate CRAHs or room-level cooling entirely — you need a parallel air system for the "residual" heat. Designing the interaction between these two systems (liquid capturing the bulk, air handling the remainder) requires careful airflow modeling and control coordination. Over-cooling with air wastes energy; under-cooling risks component damage on non-liquid-cooled parts.

Water Quality & Corrosion

The IT loop requires ultra-pure deionized water (<1 μS/cm conductivity) to prevent galvanic corrosion and micro-channel fouling. But DI water is aggressive — it leaches metal ions from fittings, especially if mixed metals (copper cold plates + aluminum manifolds) are present. Ongoing water chemistry monitoring (conductivity, pH, dissolved O₂, particulate count) is mandatory. Glycol-based coolants resist corrosion but reduce heat transfer capacity by 10–15% and complicate leak cleanup. There is no perfect fluid — every choice is a trade-off.

Standardization Gap

Unlike air cooling (standardized 19" racks, ASHRAE guidelines, universal CRAH compatibility), liquid cooling lacks industry-wide standards. Every server vendor has different cold plate designs, manifold connectors, flow requirements, and CDU interfaces. OCP (Open Compute Project) is working on standardized liquid cooling specifications, but adoption is still early. This makes multi-vendor deployments painful — a Dell cold plate won't mate with an HPE manifold. Facilities must commit to a vendor ecosystem or invest heavily in adapters and custom plumbing.

Retrofit vs. Greenfield

Existing air-cooled facilities weren't designed for liquid cooling. Retrofitting requires: raised-floor penetrations or overhead pipe routing, structural reinforcement (water-filled pipes are heavy), new chiller/tower capacity, CDU floor space, and leak containment infrastructure. Floor loading jumps from ~150 lbs/ft² (air-cooled) to 250+ lbs/ft² (liquid-cooled with dense GPU racks). Many buildings simply can't support it without structural modifications. Greenfield builds can design for liquid from the start — but the industry is converting existing facilities faster than it can build new ones.

COOLING // IMMERSION (SUBMERSION) COOLING

Submerging servers in dielectric fluid

Immersion cooling eliminates air entirely — servers are submerged in a tank filled with electrically non-conductive (dielectric) fluid. The fluid makes direct contact with every component on the board, removing heat from GPUs, CPUs, VRMs, DIMMs, and NVMe drives simultaneously. No cold plates, no fans, no CRAHs. Two variants exist: single-phase (fluid stays liquid) and two-phase (fluid boils at the chip surface).

SINGLE-PHASE IMMERSION

SINGLE-PHASE IMMERSION TANK — CROSS SECTION

HEAT EXCHANGER (in-tank or external)

Plate & frame or coil — connected to facility CHW loop

CHW in

CHW out

DIELECTRIC FLUID (mineral oil or synthetic hydrocarbon)

SERVER 1

GPU

CPU

DIMM

VRM

SERVER 2

GPU

CPU

DIMM

VRM

SERVER 3

GPU

CPU

DIMM

VRM

SERVER 4

GPU

CPU

DIMM

VRM

Servers mounted vertically on rails — slide up for service

Hot fluid rises

Cool fluid sinks

CIRCULATION PUMPforced convection through HX

FILTRATIONparticulate removal

How it works: Servers are mounted vertically on rails inside a sealed tank. The dielectric fluid (typically mineral oil at $2–5/liter, or engineered synthetic hydrocarbons at $10–30/liter) circulates via natural convection and/or a pump. Hot fluid rises to the top where an in-tank or external heat exchanger transfers heat to the facility chilled water loop. Fluid temperature is maintained at 35–45°C — the fluid never changes phase.

Heat transfer: Single-phase immersion achieves heat transfer coefficients of 50–200 W/m²·K — better than air (5–25 W/m²·K) but far below two-phase (>10,000 W/m²·K). The advantage over air is total surface contact: every component is cooled simultaneously, eliminating hot spots from poorly positioned fans.

Key vendors: GRC (Green Revolution Cooling) — largest deployed base, oil-based; Submer — synthetic fluid SmartCoolant; Asetek — hybrid immersion + cold plate; LiquidCool Solutions — chassis-level sealed enclosures.

ADVANTAGES

✓ Eliminates all server fans — 10–15% IT power savings
✓ No CRAHs or raised floor required
✓ Every component cooled equally — no hot spots
✓ Operates at higher fluid temps → more free-cooling hours
✓ PUE of 1.02–1.05 achievable
✓ Quieter than air-cooled — no fan noise
✓ Mineral oil is cheap and widely available
✓ No leak risk to room — fluid stays in sealed tank

CHALLENGES

✗ Messy serviceability — servers drip when removed
✗ Increased MTTR — draining & handling required
✗ Material compatibility: some connectors, labels, thermal pads dissolve
✗ Weight: filled tanks reach 1,500–3,000 lbs — structural reinforcement needed
✗ Fluid monitoring (viscosity, particulates, moisture) adds operational overhead
✗ Server OEM warranty may be voided — most OEMs don't certify for immersion
✗ Fire code compliance: mineral oil is combustible (Class IIIB)
✗ Limited vendor ecosystem vs. cold-plate solutions

TWO-PHASE IMMERSION COOLING

TWO-PHASE IMMERSION TANK — CROSS SECTION

CONDENSER COIL (in vapor space)

Vapor contacts cold coil → condenses back to liquid → drips down

Facility CW in (cool)

Facility CW out (warm)

VAPOR SPACE

↑

Dielectric vapor rises from boiling surfaces

BOILING DIELECTRIC FLUID (e.g., 3M Novec 7100, boiling point ~61°C)

SERVER 1

● GPU (boiling)

● CPU (boiling)

● VRM / DIMM

SERVER 2

● GPU (boiling)

● CPU (boiling)

● VRM / DIMM

SERVER 3

● GPU (boiling)

● CPU (boiling)

● VRM / DIMM

Fluid boils on hottest surfaces (GPU die) — bubbles carry heat upward as vapor

THE PHYSICS: WHY PHASE CHANGE IS SO POWERFUL

When a liquid boils, it absorbs the latent heat of vaporization — the energy required to break intermolecular bonds. For engineered dielectric fluids, this is typically 80–120 kJ/kg. This is in addition to the sensible heat the liquid absorbs as its temperature rises. The result: 10–100× higher heat transfer coefficients vs. single-phase convection. A boiling surface can reject >20 W/cm² with only a few degrees of superheat above the fluid's boiling point. For comparison, forced-air convection maxes out at ~0.5 W/cm².

Air (forced)

5–25 W/m²·K

Single-phase liquid

200–5,000 W/m²·K

Two-phase (boiling)

10,000–100,000 W/m²·K

How it works: Servers are submerged in a low-boiling-point dielectric fluid. The fluid boils at the surface of hot components (GPU die at ~80°C causes vigorous nucleate boiling in a fluid with a 49–61°C boiling point). The generated vapor rises into a vapor space above the liquid level where it contacts a condenser coil cooled by facility water. Vapor condenses back to liquid and drips into the tank — a self-sustaining cycle with no pumps in the primary loop.

Dielectric fluids: The most common are fluorocarbon-based engineered fluids — 3M™ Novec™ 7100 (bp 61°C), Novec 649 (bp 49°C), and Solvay Galden HT fluids. These are non-conductive, non-flammable, non-toxic, and chemically inert. However, many contain PFAS (per- and polyfluoroalkyl substances) and face increasing regulatory scrutiny in the EU and US. 3M exited PFAS manufacturing entirely by end of 2025, creating supply uncertainty. Non-PFAS alternatives (hydrofluoroolefins, synthetic esters) are emerging but not yet proven at scale.

Fluid cost: Engineered fluorocarbon fluids cost $50–150/liter. A typical 48U immersion tank holds 200–500 liters. Fluid loss from vapor escape (even with condensers and freeboard) runs 2–5% annually. At scale, this is a significant operating expense compared to single-phase mineral oil ($2–5/liter) or direct-to-chip water (near-zero fluid cost).

Key vendors: LiquidCool Solutions — sealed chassis-level two-phase; ZutaCore — open-bath two-phase with HyperCool technology (deployed by Equinix, Aligned); GRC — primarily single-phase but exploring two-phase; Iceotope — precision immersion with chassis-level sealed units.

ADVANTAGES

✓ Highest heat transfer of any cooling method — handles >1,000W TDP
✓ No pumps in primary IT loop — self-circulating phase change
✓ Uniform chip temperature regardless of load (boiling is isothermal)
✓ Can handle 200+ kW/rack densities
✓ Near-silent operation — no fans, no pump vibration
✓ PUE of 1.01–1.03 theoretically achievable
✓ Fluid is non-flammable (unlike mineral oil)

CHALLENGES

✗ Extremely high fluid cost ($50–150/liter × 200–500L per tank)
✗ PFAS regulatory risk — EU REACH restrictions, 3M exit
✗ Vapor management — tank must be sealed; fugitive emissions are GWP concern
✗ Material compatibility — aggressive solvents attack some elastomers, labels, TIMs
✗ Limited production-scale deployments (still early-adopter stage)
✗ Service complexity — servers must drain before removal
✗ Fluid replenishment logistics and environmental disposal
✗ Condenser sizing critical — undersized = vapor loss + capacity limit

COOLING TECHNOLOGY COMPARISON

Attribute	Air Cooling	Direct-to-Chip (Cold Plate)	Single-Phase Immersion	Two-Phase Immersion
Max rack density	15–25 kW	120–200 kW	100–200 kW	200+ kW
Heat transfer coeff.	5–25 W/m²·K	5,000–10,000 W/m²·K	50–200 W/m²·K	10,000–100,000 W/m²·K
PUE achievable	1.3–1.6	1.03–1.15	1.02–1.05	1.01–1.03
Fluid cost	N/A (air)	~$0 (water)	$2–30/L	$50–150/L
Residual heat path	All via air	20–40% via air	100% via fluid	100% via fluid
Serviceability	Excellent	Moderate (QD fittings)	Challenging (drip/drain)	Challenging (drain + vapor)
Maturity	Decades (standard)	Production (NVIDIA std)	Early production	Pilot / early adopter
Regulatory risk	None	None	Low (oil fire codes)	High (PFAS)

CONTROLS & MONITORING — IMMERSION-SPECIFIC

Immersion cooling introduces a different set of monitoring requirements compared to cold-plate systems. The fluid itself becomes a critical asset to monitor and maintain.

FLUID HEALTH

● Fluid temperature (bulk & stratified)
● Viscosity (degradation indicator)
● Dielectric breakdown voltage
● Moisture content (ppm)
● Particulate count (cleanliness)
● Acid number (oxidation products)
● Fluid level (leak/loss detection)

THERMAL MONITORING

● Tank inlet / outlet ΔT
● Per-server inlet temp (immersed sensor)
● Condenser coil in/out temps
● Vapor space temperature (two-phase)
● Condenser approach temperature
● Ambient air above tank

SAFETY SYSTEMS

● High-temp shutdown (fluid overheat)
● Low-level alarm (fluid loss)
● Vapor pressure monitoring (two-phase)
● Leak detection under/around tanks
● Emergency drain valve (gravity-fed)
● Fire suppression integration

COOLING // TWO-PHASE DIRECT-TO-CHIP (2P-DTC)

Vaporization at the cold plate — boiling without immersion

This is the “best of both worlds” approach you may have seen: the extreme heat transfer of phase-change boiling, but contained inside a sealed cold plate bolted to the GPU — no immersion bath, no dripping servers. The dielectric fluid boils at the chip surface inside the cold plate. Vapor travels through tubing to a remote condenser where it releases heat to the facility loop and condenses back to liquid, which returns to the cold plate. This is fundamentally different from single-phase direct-to-chip (where water stays liquid and just gets warmer).

2P-DTC LOOP — HOW IT WORKS

PUMPED / THERMOSIPHON TWO-PHASE LOOP

2P COLD PLATE

sealed evaporator on GPU die

Liquid pool / wick

○

nucleate boiling on hot surface

GPU die ~1000W

Liquid boils @ 30–60°C

VAPOR

→

low pressure
insulated line

REMOTE CONDENSER

in CDU or at rack manifold

Vapor → liquid

↓

heat → facility water

CHW in

CHW out

LIQUID

←

gravity or
small pump

↺ Self-circulating loop — no large mechanical pump required (thermosiphon)

or assisted by small low-pressure pump (pumped 2-phase)

THERMOSIPHON (PASSIVE)

No pump in the IT loop

Vapor rises naturally because it's less dense than liquid. Liquid returns by gravity. Requires the condenser to be physically abovethe cold plate — typically at the top of the rack or in an overhead manifold. Zero pumping energy in the IT loop. Limited by the height differential and pressure drop.

Pros: No moving parts in IT loop, near-silent, ultra-reliable

Cons: Layout constraints, capacity limited by gravity head

PUMPED 2-PHASE (ACTIVE)

Small low-pressure pump assists return

A small magnetically-coupled pump moves liquid back to the evaporator, freeing the design from gravity constraints. The pump only handles liquid return (low flow, low pressure) — the phase change still does the heavy lifting of heat transport. Condenser can be placed anywhere convenient.

Pros: Flexible layout, scales to higher densities, faster startup

Cons: Pump = failure point, requires redundancy (N+1)

THE PHYSICS — WHY VAPORIZATION AT THE CHIP CHANGES EVERYTHING

Isothermal heat removal

A boiling fluid stays at its saturation temperature regardless of how much heat you dump into it. A 700W GPU and a 1,500W GPU on the same loop will both stabilize at the fluid's boiling point (say, 55°C). This eliminates hot-spot variation between chips and gives a flat, predictable Tjunction across the entire cluster.

Massive ΔH per kg

Single-phase water carries ~4.2 kJ per kg per °C of temperature rise. A two-phase dielectric carries 80–120 kJ/kg of latent heat on phase change alone — >20× more energy per unit mass moved. Result: you need ~1/10th the coolant flow rate to remove the same heat. Smaller pipes, smaller pumps, less plumbing complexity.

Higher density possible

Conventional single-phase cold plates start running out of margin at ~1,000W GPUs because the inlet-to-outlet ΔT widens and the chip sees inconsistent cooling. 2P-DTC handles 1,500W+ TDPs with the same supply temperature and flat thermal profile. This makes it a leading candidate for NVIDIA Rubin and beyond.

Higher supply temps OK

Because phase change happens at a fixed temperature (set by fluid choice), 2P-DTC can run with facility water at 35–45°C while still keeping GPU Tj below throttle limits. This means year-round free cooling in almost any climate — no compressors, no chillers, just a dry cooler or cooling tower.

KEY VENDORS — 2P-DTC LANDSCAPE

ZutaCore — HyperCool

Pumped 2-phase using a proprietary dielectric (HC-1). Sealed evaporators (Enhanced Nucleation Evaporators) bolt directly onto GPU/CPU. Deployed by Equinix, Aligned. Among the most commercially mature 2P-DTC offerings.

Accelsius — NeuCool

Pumped 2-phase using a non-PFAS, low-GWP fluorocarbon refrigerant. Spinoff from Nokia Bell Labs technology. Focuses on drop-in cold-plate replacement for existing direct-to-chip racks.

JetCool — SmartPlate

Microconvective single-phase today, with two-phase variants in development. Uses high-velocity impinging jets inside the cold plate to maximize heat transfer coefficient. HPE OEM partnership.

Chilldyne — Cold Plate w/ Negative Pressure

Operates the IT loop at sub-atmospheric pressure so leaks pull air in rather than push fluid out. Pairs naturally with 2-phase designs where vapor pressure management is critical.

2P-DTC VS. OTHER LIQUID COOLING APPROACHES

Attribute	Single-Phase DTC (water)	Two-Phase DTC	Two-Phase Immersion
Where boiling occurs	N/A (no phase change)	Inside sealed cold plate	Open bath around servers
Fluid	Treated water or PG mix	Dielectric refrigerant (small volume)	Dielectric refrigerant (large volume)
Fluid volume per rack	5–20 L	2–10 L	200–500 L
Server form factor	Standard rack-mount	Standard rack-mount	Vertical in tank
Serviceability	QD fittings, hot-swap	QD fittings, hot-swap	Drain & lift from tank
Max chip TDP	~1,000–1,200 W	1,500 W+	1,500 W+
Facility water supply temp	25–35°C (warm water)	35–45°C (free cooling)	25–40°C
Residual air cooling needed	Yes (~20–40% of heat)	Yes (~20–30% of heat)	No
Leak consequence	Water on electronics = damage	Dielectric — non-conductive, evaporates	Dielectric — non-conductive
Maturity (2026)	Production standard (NVL72)	Early production / pilot	Pilot / early adopter

CONTROLS & TELEMETRY UNIQUE TO 2P-DTC

Saturation pressure monitoring

The fluid's saturation temperature is a direct function of system pressure. A pressure drift indicates fluid loss, non-condensable gas ingress, or condenser fouling. Pressure transducers on the vapor line are mandatory.

Dryout / CHF protection

If heat flux exceeds the critical heat flux (CHF) of the boiling surface, the cold plate can “dry out” — a vapor film forms between liquid and the hot surface, collapsing heat transfer. Tj will spike in seconds. PLC must trip the GPU before this happens.

Non-condensable gas (NCG) detection

Air leaking into a sub-atmospheric loop creates NCG pockets in the condenser, reducing effective area. Most 2P-DTC systems have an automatic vent or NCG purge cycle, with monitoring for purge events.

Charge / fluid inventory

Unlike water systems, you can't just top off from a city tap. Fluid loss = expensive refrigerant replacement. Continuous mass monitoring (via accumulator level or pressure-temperature correlation) catches slow leaks early.

WHERE 2P-DTC FITS IN THE ROADMAP

Today, single-phase direct-to-chip (water) is the production standard for GB200 NVL72 and equivalent 120 kW racks. 2P-DTC is the leading contender for the next density jump — 200+ kW racks with 1,500W+ GPUs (NVIDIA Rubin generation). It preserves the serviceability and familiar form factors of single-phase DTC while delivering the heat-flux handling of immersion. The main barriers are fluid cost, supply-chain constraints around non-PFAS refrigerants, and the relative newness of the vendor ecosystem compared to mature water-based cold plates.

COOLING // CUTTING-EDGE RESEARCH

What's next — emerging technologies pushing cooling forward

With GPU TDP heading toward 1,500W+ (NVIDIA Rubin) and rack densities exceeding 200 kW, even current liquid cooling approaches face limits. Here's where the industry and academia are pushing boundaries.

MICROFLUIDIC ON-CHIP COOLING

DARPA ICECool, Georgia Tech, EPFL

Instead of bolting a cold plate on top of the chip, etch micro-channels directly into the silicon die or interposer. Coolant flows microns away from the transistors. DARPA's ICECool program demonstrated >1 kW/cm² heat removal — 10× what conventional cold plates achieve. This eliminates TIM, IHS, and the thermal resistance stack entirely. Challenges: fabrication complexity, integrating fluid connections into chip packaging, reliability over billions of thermal cycles. Timeline: 5–10 years from production GPU integration.

THERMOELECTRIC (PELTIER) COOLING

Phononic, Laird Thermal, Intel Research

Solid-state heat pumps with no moving parts — apply voltage and one side gets cold. Current TECs are inefficient (COP ~0.5–1.5 vs mechanical chiller COP of 5–8), but new bismuth telluride nanostructures and thin-film designs are closing the gap. Use case: spot cooling for the hottest chip regions (hot-spot management) rather than bulk heat rejection. Intel has demonstrated embedded TEC layers that reduce Tj by 15°C at targeted hot spots. Could complement liquid cooling rather than replace it.

NANOFLUIDS & ADVANCED COOLANTS

MIT, Purdue, various national labs

Suspending nanoparticles (copper oxide, aluminum oxide, carbon nanotubes) in base fluids can improve thermal conductivity by 10–40%. Lab results show significant heat transfer improvements at low concentrations (1–5% by volume). Challenges: particle settling over time, abrasion of micro-channels, long-term stability, and cost. No production deployments yet in data centers, but active research funded by ARPA-E and DOE. Dielectric nanofluids could make immersion cooling significantly more effective.

WARM-WATER & WASTE HEAT REUSE

Lenovo Neptune, Meta, Nordic DCs

Running GPU supply water at 45–55°C (instead of the traditional 15–20°C) enables year-round free cooling — no chillers needed at all. The "waste" heat at 55–65°C return is hot enough for district heating. Nordic data centers (Meta Luleå, Google Hamina) already feed waste heat to municipal heating networks. Lenovo's Neptune platform runs at 50°C supply, achieving PUE <1.03. The efficiency gain is dramatic: eliminating chiller compressors removes 30–40% of total cooling plant energy. Challenge: GPU silicon must tolerate higher junction temperatures, and not all workloads perform identically at elevated temps.

REAR-DOOR HEAT EXCHANGERS (RDHx)

CoolIT, Motivair, ZutaCore

A middle-ground approach: chilled water coils with EC fans mounted on the rack rear door. Captures 60–100% of rack exhaust heat before it enters the room. Handles 30–50 kW/rack without liquid inside the server. Popular as a retrofit for air-cooled facilities adding GPU density. No cold plates, no QD fittings, no IT-loop plumbing — just chilled water to the door. Limitation: can't match direct-to-chip efficiency at 100+ kW/rack densities, and doesn't address chip-level hot spots.

THE TRAJECTORY

The industry consensus: direct-to-chip liquid cooling is the default for any new AI data center build. Air cooling is being relegated to edge, enterprise, and legacy workloads. The open questions are whether immersion or cold-plate wins for the densest deployments, whether on-chip microfluidics can reach production in time for the next GPU generation, and whether the industry can standardize fast enough to avoid vendor lock-in. Meanwhile, every 100W increase in GPU TDP makes the case for liquid cooling stronger — and the gap between air's ceiling and GPU demand wider.

FRONTIER // ORBITAL DATA CENTERS

When the data center leaves Earth — cooling, comms & controls in vacuum

Starcloud (formerly Lumen Orbit), Axiom Space, Lonestar Data Holdings, and several Chinese ventures are seriously proposing multi-megawatt compute clusters in low Earth orbit (LEO). The pitch: unlimited solar power, no water, no land, no NIMBY. The reality: the two hardest problems on Earth — power and cooling — get even harder in vacuum, and a third problem (communications) becomes load-bearing. Here's how each of those works without an atmosphere underneath you.

COOLING // RADIATION IS THE ONLY OPTION

Common misconception: “space is cold, so cooling is easy.” The opposite is true. With no air, there is no convection. With nothing in contact, there is no conduction. The only way to reject heat into space is by infrared radiation, and radiation is the weakest of the three heat-transfer modes by orders of magnitude.

THE STEFAN–BOLTZMANN LIMIT

Radiative heat flux: q = ε·σ·(T⁴ − T_space⁴)

Where ε ≈ 0.85–0.92 for good radiator coatings, σ is the Stefan-Boltzmann constant, and T_space ≈ 3 K (cosmic background) is effectively zero. The catch: T⁴ means radiator capacity drops fast as you try to operate cooler. A radiator at 50°C rejects roughly ~600 W/m². At 20°C it rejects only ~420 W/m². That's why orbital data center designs always want to run radiators hot.

REAL NUMBERS

● ISS: 14 ammonia radiator panels, ~75 kW total rejection
● ISS radiator area: ~1,560 m²
● Effective rejection: ~48 W/m² (avg over orbit)
● A 5 MW orbital DC would need ~100,000 m² of radiator at the same efficiency
● That's a square ~315 m × 315 m — bigger than 14 football fields
● Solar panels for the same 5 MW: ~25,000 m² (4× smaller than the radiators)

THE INSIGHT

In orbit, radiator area, not solar panel area, is the binding constraint. This flips terrestrial design intuition: on Earth power is expensive and cooling is cheap-ish. In orbit, power from the sun is essentially free, but every watt you generate must eventually be radiated, and radiator mass-to-orbit is the dominant cost driver.

HEAT TRANSPORT // GPU → RADIATOR

ORBITAL DATA CENTER — THERMAL ARCHITECTURE

GPU CHASSIS

Cold plates on dies

Tj ≤ 85°C target

→

INTERNAL FLUID LOOP

Pumped water or PG inside pressurized module

→

HEAT EXCHANGER

Liquid–liquid IFHX

→

EXTERNAL LOOP

Ammonia or 2-phase CO₂

→

DEPLOYABLE RADIATORS

Reject to 3K space

Two loops because the working fluid for radiators (ammonia, NH₃) is toxic to crew / damaging to electronics, but has excellent thermal properties at −30°C to +40°C. Internal water loop stays safe; external ammonia loop does the heavy lifting.

Heat pipes & LHPs

Loop heat pipes (LHPs) move heat passively from chip → radiator using capillary-driven phase change. No pump, no moving parts, indefinite lifetime. The workhorse of spacecraft thermal control.

Two-phase ammonia

NH₃ boils around −33°C at 1 atm but is operated at higher pressure in spacecraft to boil at 5–40°C. Latent heat (~1,370 kJ/kg) is ~6× water's — minimizes pump power and tubing mass.

Radiator orientation

Radiators must face deep space, not the sun or Earth. Attitude control system (ACS) constantly slews the spacecraft to keep solar arrays sun-pointed and radiators edge-on to the sun — a coupled optimization problem unique to orbital DCs.

DATA // GETTING BITS UP AND DOWN

An orbital DC is useless if you can't move data to it and results back. There are three communication tiers, each with different bandwidth, latency, and standards.

Link Type	Tech	Bandwidth	Latency	Use Case
Ground ↔ Sat (RF)	Ka / Ku band	10–40 Gbps	5–20 ms (LEO)	Bulk uplink/downlink
Ground ↔ Sat (Optical)	Laser comm	100 Gbps – 1 Tbps	5–10 ms (LEO)	High-volume training data transfer
Sat ↔ Sat (Optical ISL)	Inter-satellite laser	100–200 Gbps	<5 ms	Distributed compute mesh between DCs
TT&C (control)	S-band RF	kbps – Mbps	10–50 ms	Telemetry, commands (separate from payload)

CCSDS — the OPC-UA of space

The Consultative Committee for Space Data Systems defines the standards every space agency and most commercial operators follow. Key protocols: Space Packet Protocol (datagram format), AOS/TM/TC (telemetry & telecommand framing), CFDP (file transfer with delay tolerance), DTN/Bundle Protocol (store-and-forward for intermittent links).

Ground station networks

A LEO satellite sees any one ground station for only ~5–10 minutes per pass. To get continuous downlink you need a global network: AWS Ground Station, Azure Orbital, KSAT, Viasat RTE. Optical-link networks (Starlink-style) provide always-on backhaul by routing through inter-satellite links to a sat currently over a station.

CONTROLS // YOU CAN'T SEND A TECHNICIAN

On Earth, the worst case for a misbehaving server is a remote-hands ticket. In orbit, there are no remote hands. Everything from thermal control to GPU error recovery must be autonomous, with ground operators reduced to high-level commanding and forensic analysis after the fact.

FDIR — Fault Detection, Isolation & Recovery

Hierarchical autonomy: every subsystem (power, thermal, comms, GPU cluster) runs local detection of out-of-limit conditions, isolates the fault, executes a recovery action (switch to redundant unit, safe-mode the affected zone, throttle GPUs to reduce heat), and reports up. Ground intervenes only when on-board recovery exhausts its playbook. Think of it as self-healing infrastructure as a hard requirement, not a nice-to-have.

Rad-hard flight computer

The control computer (BMS equivalent) runs on radiation-hardened silicon: BAE RAD750, Cobham GR740, or newer commercial-off-the-shelf + ECC + TMR (triple modular redundancy) designs. Flight software runs on RTOSes like NASA cFS (Core Flight System), VxWorks, or RTEMS. The GPUs themselves are commercial — but every command they receive routes through the rad-hard supervisor.

SEU / SEL handling

Cosmic rays cause single-event upsets (SEUs — bit flips) and single-event latch-ups (SELs — stuck transistors) in commercial silicon. Mitigations: ECC memory everywhere, periodic scrubbing of DRAM, latch-up current sensors that power-cycle affected blocks within microseconds, and software-level checkpoint/restart for training jobs.

Internal buses

Inside the spacecraft, the control plane runs on SpaceWire (200 Mbps) or SpaceFibre (multi-Gbps) for high-speed; MIL-STD-1553 (1 Mbps, triple-redundant) for safety-critical commands; CAN bus for low-rate sensor polling. Payload (the actual GPU cluster) uses standard 400G/800G Ethernet or InfiniBand, just shielded and qualified for vibration.

CRITICAL TELEMETRY POINTS

Thermal

● Cold plate inlet/outlet T
● Radiator panel T (each)
● Heat pipe evaporator T
● Ammonia loop pressure
● GPU Tj (per die)
● Sun/shade angle

Power

● Solar array current/voltage (per string)
● Battery state of charge
● Battery cell voltages
● Bus voltages (28V, 100V HVDC)
● Eclipse prediction (orbit-driven)
● Charge regulator status

Health & environment

● SEU/SEL event counter
● Radiation dose accumulator
● Attitude (quaternion, rates)
● Reaction wheel speeds
● Propellant remaining
● Comm link SNR / BER

WHO'S BUILDING THIS

Starcloud (formerly Lumen Orbit)

Proposing 5 GW orbital data centers powered entirely by solar. Pitched as the only way to scale compute past Earth's power-grid limits. First demonstrator: a small GPU cluster aboard a SpaceX rideshare.

Axiom Space

Building a commercial space station; orbital compute as a planned payload. Has an MOU with several hyperscalers for in-orbit data processing of Earth observation feeds.

Lonestar Data Holdings

Lunar data centers — leverages the Moon's vacuum + regolith shielding + cold permanently-shadowed craters for radiative cooling. Successfully booted data payloads on Intuitive Machines IM-1/IM-2 missions.

Thales Alenia Space / ESA ASCEND

EU-funded feasibility study (ASCEND — Advanced Space Cloud for European Net-zero emission and Data sovereignty) concluded orbital DCs are technically feasible and could be net-positive on CO₂ by 2050 if launch emissions drop.

REALITY CHECK

Orbital DCs solve power and (theoretically) cooling but introduce three first-principles problems: launch cost (still ~$1,500/kg even on Starship — and radiators are heavy), servicing impossibility (no swap-a-bad-DIMM in LEO), and orbital debris exposure (large radiator panels are meteoroid/debris bullseyes). The economics work only if launch drops below ~$200/kg and compute density per kg jumps another 5–10×. For an AI controls engineer this is a far-horizon problem — but the thermal physics and FDIR principles are directly applicable to edge / remote / unmanned terrestrial sites today.

CONTROLS, PROTOCOLS & AUTOMATION

PLC programming, SCADA/Ignition architecture, OPC-UA, Modbus, MQTT/Sparkplug B, BMS control loops, PID tuning, VFD sequencing, chiller staging logic, and safety interlocks are covered in depth in the Controls & Automation module below.

COMMISSIONING // L1–L5

Proving the facility works — before IT load arrives

is the systematic verification that every system performs per the and design intent. In mission-critical facilities, it follows five levels — each building on the last. Skipping levels means discovering failures under live load, which can cost millions per hour.

Factory Witness Testing — verify equipment at the manufacturer before shipping. Switchgear trip tests, UPS load bank, generator rated-load run, chiller performance curves. Catches defects before they're installed in concrete.

Installation Verification — confirm equipment is installed per design: torque checks on bus connections, pipe pressure tests, insulation resistance (Megger), continuity, valve tag-to-P&ID match, controller I/O point-to-point checkout. Every sensor and actuator confirmed wired to the correct controller point.

Startup & Vendor Testing — the installing contractor or OEM vendor energizes and starts each system individually. Motor rotation check, VFD programming, breaker coordination study verification, UPS static bypass test, chiller startup per OEM procedure, BMS point verification (sensor reads match field measurement within tolerance). Vendor provides startup reports and certifies equipment operational.

Functional Performance Testing (CxA) — after vendor startup in L3, the independent commissioning agent (CxA) verifies each system performs per the . Tests control sequences, setpoints, alarm responses, failover logic, and edge cases the vendor wouldn't test. CxA writes test scripts, witnesses execution, and documents deficiencies. Every SOO sequence gets a pass/fail.

Integrated Systems Testing (IST) — the big one. Multi-system failure scenarios under simulated or real load (load banks). Utility loss → ATS transfer → generator start → UPS carry-through. Chiller trip → backup chiller staging → temp recovery. BMS alarm propagation. scripts prove that mechanical, electrical, and controls work together under stress. Typical: 2–4 weeks for a large facility.

The commissioning agent (CxA) is independent from the installing contractor — they represent the owner's interest and verify the contractor's work. All deficiencies are tracked in an issues log with severity, responsible party, and resolution deadline.

DCIM // THE INTEGRATION LAYER

Single pane of glass — from rack to C-suite

Data sources

• BMS — HVAC status, temps, setpoints, alarms
• EPMS — power at every distribution stage
• CMDB — asset inventory, serial numbers, rack positions
• Network management — switch port mapping, bandwidth
• Rack PDU — per-server power, environmental sensors
• CDU / liquid cooling — flow, temp, pressure, leak status

DCIM capabilities

• Capacity planning — power, cooling, space available per rack/row/hall
• Real-time PUE — computed from EPMS data, trended over time
• What-if modeling — "can I add 20 racks to Hall B?"
• Alarm correlation — root cause across mechanical + electrical
• Change management — track every rack move/add/change
• Compliance — SLA uptime reporting, environmental audit trail

In AI factories, DCIM is evolving toward real-time digital twins and ML-driven optimization — adjusting cooling setpoints and power distribution automatically based on predicted GPU workload patterns.

FABRIC

Why training is bandwidth-bound

Every training step, gradients computed on each GPU must be summed across the entire cluster — an over billions of parameters. If the network can't keep up, GPUs sit idle. That's why hyperscalers build dedicated rail-optimized fabrics with one 800 Gb/s per GPU.[src]

STORAGE

The data pipeline

Trillions of pre-tokenized tokens are sharded across a parallel filesystem (, or vendor-specific). Data loaders stream shards into GPU memory while the previous batch is still being processed — overlap is everything.

LIFE SAFETY

Fire, leak, EPO

VESDA (laser-based air sampling) for earliest smoke detection. Clean-agent suppression (FM-200, Novec 1230) for IT rooms. Pre-action sprinklers (double interlock) for high-value spaces. EPO (Emergency Power Off) kills entire zones — controversial due to nuisance activation risk but code-required in many jurisdictions. Leak detection cables along every pipe route and under raised floors.

INDUSTRIAL CONTROLS & MONITORING

You Can't Manage What You Can't Measure

Why industrial monitoring matters

An AI datacenter isn't just a building with servers — it's an industrial plant running at the thermal and electrical edge of what physics allows. A single 120 kW GPU rack operates at power densities that would be classified as heavy industrial in any other context. When a chiller valve fails at 2 AM and rack inlet temperatures start climbing, you have minutes — not hours — before thermal throttling degrades a multi-million-dollar training run.

This is why every serious datacenter operator invests as heavily in , , and as they do in the compute hardware itself. The monitoring infrastructure is the operational infrastructure. Without it, you're flying blind through a thunderstorm.

The control stack in a mission-critical facility is itself a layered architecture — field devices at the bottom, supervisory systems in the middle, and at the top. Each layer serves a different time scale: PLCs react in milliseconds, BMS loops in seconds, DCIM analytics in minutes to hours.

HIERARCHY // THE CONTROL PYRAMID

From field device to operations center

Level 4/5 — Enterprise Network

IT network: DCIM, enterprise analytics, cloud dashboards, ERP. Ignition Cloud Edition, Cirrus Link cloud injectors (AWS/Azure/GCP), Sepasoft ERP connectors, Omniverse digital twins. Directory services, DNS, mail servers.

Level 3 — Operations Systems

Central Ignition Gateway (on-prem), SQL historian, MES, Sepasoft modules (SPC, batch, track & trace, OEE). Plant-wide operations, scheduling, and reporting.

Level 2 — Process (SCADA / HMI)

Local servers, clients (Perspective/Vision), Ignition Edge gateways (IIoT + Panel), Cirrus Link modules. Live one-line diagrams, alarming, trending, reporting. Operator commands flow down from here.

Level 1 — Control (Intelligent Devices)

s, RTUs, IEDs, power meters (Schneider ION/PM), VFDs, motor starters — any device with onboard logic or a processor. Executing closed-loop control: valve modulation, speed control, generator paralleling, UPS bypass. Communicating via TCP, OPC-UA, .

Level 0 — Field Devices (Unintelligent / Physical)

Dumb field instruments with no onboard processing: temperature sensors (RTDs, thermistors), pressure transducers, DP sensors, flow switches, limit switches, contact closures, leak detection cables, smoke detectors, control valves, damper actuators. Output is a raw signal (4–20 mA, 0–10 V, dry contact).

SIM // ALARM VOLUME

How many alarms does a datacenter generate?

Rack count200

Alarms per rack per day2.5

Alarms / day

500

Alarms / hour

20.8

Sensor points

5,600

Total monitored points

6,200

Without alarm rationalization, operators drown in noise. Best practice: <1 actionable alarm per operator per 10 minutes (ISA-18.2).

PLC // PROGRAMMABLE LOGIC CONTROLLERS

The workhorse of industrial automation

A is an industrial digital computer purpose-built for real-time control. Where a server runs an OS and applications, a PLC runs a deterministic scan cycle: (1) read all inputs, (2) execute the control program, (3) update all outputs, (4) handle communications. Typical scan time: 1–20 ms — fast enough to catch a pump cavitation event before damage occurs.

CPU — executes the control program (ladder logic, structured text, function blocks)

I/O Modules — digital inputs (24 VDC switches), digital outputs (relays, solenoids), analog I/O (4–20 mA, 0–10 V)

Power Supply — converts AC to 24 VDC for the rack and field devices

Backplane — high-speed internal bus connecting CPU to I/O cards

Comms Modules — Ethernet/IP, PROFINET, EtherCAT, serial interfaces

For NVIDIA data centers, expect Beckhoff TwinCAT or Siemens TIA Portal for mechanical controls, and Schneider EcoStruxure for power/BMS.

PROGRAMMING // IEC 61131-3

How PLCs are programmed

PLC programming uses IEC 61131-3 languages. The two most common:

Ladder Logic

Visual language resembling electrical relay circuits. Series contacts = AND, parallel contacts = OR. Core elements: NO/NC contacts, output coils, timers (TON/TOF), counters (CTU/CTD), latch/unlatch.

  Start pump when conditions met:
  ─┤Start_PB├──┤NOT E_Stop├──┤Level_OK├──(Pump_Run)─
       NO          NC            NO         COIL

  Seal-in circuit:
  ─┬─┤Start_PB├─┬─┤NOT E_Stop├──(Pump_Run)─
   └─┤Pump_Run├─┘
        (seal)

Structured Text

High-level language similar to Pascal. Preferred for complex math, state machines, and data manipulation.

IF ChW_Supply > Setpoint + Deadband THEN
  IF NOT Chiller_1_Running THEN
    Chiller_1_Start := TRUE;
    Stage_Timer(IN:=TRUE, PT:=T#300s);
  ELSIF Stage_Timer.Q THEN
    Chiller_2_Start := TRUE;
  END_IF;
END_IF;

SCADA // SUPERVISORY CONTROL

The operator's window into the plant

provides a centralized interface to monitor and control industrial processes. The architecture:

Gateway/Server — central hub communicating with PLCs via OPC-UA or native drivers. Manages tags, stores config, runs scripts.

Clients/HMI — operator screens showing process schematics, alarm lists, trend charts. Web-based or desktop-launched.

Historian — time-series database logging tag values at configurable intervals. Powers trending, reporting, and analytics.

Alarm Pipeline — escalation via email, SMS, voice. Alarm journaling to database for audit and analysis.

Tag Types

OPC Tag — reads/writes directly from a PLC register

Memory Tag — internal to SCADA, no device binding

Expression Tag — computed from other tags via formula

Derived Tag — rolling avg, rate-of-change, aggregates

IGNITION // BY INDUCTIVE AUTOMATION

The platform NVIDIA uses

Ignition is a modern, Java-based SCADA platform with a web gateway model. Key differentiators: unlimited licensing (one server price, unlimited clients and tags), cross-platform (Windows/Linux/macOS), modular architecture, and Python scripting via Jython.

PerspectiveHTML5 mobile-responsive HMI — the modern standard

VisionClassic Java desktop client with rich component library

HistorianLogs tag values to SQL database with deadband and scan rates

AlarmingEscalation pipelines: email, SMS, voice + journal to DB

ReportingAutomated PDF/Excel reports on schedule or event

MQTTSparkplug B edge-to-cloud data publishing

# Ignition scripting (Jython)
temp = system.tag.readBlocking(
  ["[default]Chiller_1/ChW_Supply_Temp"]
)[0].value

system.tag.writeBlocking(
  ["[default]Chiller_1/Setpoint"], [42.0]
)

Gateway Architecture

Ignition runs as a single gateway service — a web server hosting a designer IDE, client sessions, device connections, and module runtime all from one process. The gateway exposes a web UI on port 8088 (HTTP) or 8043 (HTTPS) for admin, and serves Perspective sessions to any browser. Unlike legacy SCADA, there are no seat licenses: you buy one gateway, attach unlimited screens, clients, and tags.

Gateway Network — multiple Ignition gateways can mesh together across sites. Tag data, alarm states, and historian records flow between gateways automatically. A central gateway can pull data from edge gateways at each building or campus.

Edge Edition — lightweight version for embedded PCs and edge hardware. Runs the OPC-UA server, local historian, and MQTT transmission — syncs upstream via Store & Forward if the WAN link drops.

Redundancy — active/standby gateway pairs with automatic failover. The standby mirrors tags, history, and alarm state in real time. Failover is typically <30 seconds.

Tag Model & UDTs

Tags are Ignition's core abstraction — every data point in the system is a tag. Tags live in a hierarchical folder structure and can be organized by system, building, or equipment.

UDT (User Defined Type) — the real power of Ignition at scale. A UDT is a tag template: define once (e.g., “Chiller” with 40 member tags for temps, pressures, alarms, status), then stamp out instances. Change the template → every instance updates. In a datacenter with 50 identical CRAHs, this turns 2,000 tags into 50 UDT instances with consistent naming and alarm config.

Tag History — any tag can be historized with a click. Configure scan class (100 ms to 1 hr), deadband (absolute or percentage), and storage destination (SQL, InfluxDB, or Ignition's internal DB). Partitioned tables roll automatically for long-term retention.

Expression & Script Tags — computed values (PUE = IT_kW / Total_kW) or Python-driven logic that fires on change. Useful for derived metrics, unit conversions, and cross-system calculations.

Perspective HMI

Perspective is Ignition's HTML5 visualization module — the successor to Vision. Operators open a browser tab (or a native Perspective Workstation app), authenticate via SAML/LDAP/AD, and see live plant graphics. Key concepts:

Views — individual screens (one-line, chiller plant, alarm summary)

Components — drag-and-drop: gauges, charts, power bars, SVG overlays

Bindings — wire a component property to a tag value (live update)

Styles / Themes — dark mode, responsive layouts, role-based views

Sessions — each browser tab is a session; identity-aware with RBAC

Embedding — iFrame Perspective inside DCIM or NOC dashboards

PROTOCOLS // THE LANGUAGE OF INDUSTRIAL SYSTEMS

How equipment talks to each other

A single datacenter may contain equipment from 15+ vendors. Getting them to communicate reliably is one of the hardest integration challenges in the industry. These are the core protocols:

OPC-UA — The Modern Standard

replaces legacy COM/DCOM with a platform-independent, secure protocol. The server exposes a hierarchical node tree (Objects → Variables → Methods). Clients browse, read/write, and subscribe to changes — subscriptions push data on change, eliminating polling overhead. Built-in TLS encryption, certificate-based auth.

Primary standard for PLC-to-SCADA and SCADA-to-IT integration. Ignition's OPC-UA server is a key integration point.

Address Space — OPC-UA organizes all data into a unified address space of typed nodes. Each node has a NodeId (namespace + identifier), a BrowseName, and a set of attributes. Variables hold values (temperature, status), Objects group related variables (a “Chiller” object containing supply temp, return temp, status, runtime hours), and Methods expose callable actions (start, stop, reset). This rich data model means a client can discover the full structure of a PLC program just by browsing — no register map spreadsheets needed.

Subscriptions & Monitored Items — instead of polling every register every second, the client tells the server: “notify me when Chiller_1/Supply_Temp changes by more than 0.5°F.” The server batches change notifications and publishes them at a configurable interval (e.g., 500 ms). This drastically reduces network traffic — in a 10,000-point system only the ~5% of values that actually changed get transmitted each cycle.

Security Model — three layers: Transport (TLS 1.2+), Session (username/password or X.509 certificate), and Application (certificate trust lists). The server and client exchange certificates on first connection. Security policies: None (lab only), Basic256Sha256, and Aes128_Sha256_RsaOaep. In production data centers, mutual certificate auth is required — no anonymous access.

Ignition as OPC-UA Server — Ignition exposes its entire tag tree as an OPC-UA server (default port 62541). This means any OPC-UA client — a DCIM platform, a Python analytics script, a digital twin — can connect and read every tag in Ignition without any Ignition-specific SDK. It's also an OPC-UA client, connecting upstream to PLCs (Siemens, Beckhoff, Allen-Bradley) that expose their own OPC-UA servers. This dual role makes Ignition the integration hub between OT and IT.

Companion Specifications — industry groups define standard OPC-UA information models for specific equipment: PLCopen for motion, FDI for field devices, and emerging specs for data center infrastructure. When a chiller vendor implements the companion spec, their OPC-UA server exposes data with standardized names and types — true plug-and-play interoperability instead of custom register mapping per vendor.

Modbus TCP/RTU — The Universal Fallback

The simplest and most widely deployed industrial protocol (since 1979). Modbus TCP runs over Ethernet port 502; Modbus RTU runs over RS-485 serial. Data organized as registers: Coils (R/W bits), Discrete Inputs (read-only bits), Input Registers (read-only 16-bit), Holding Registers (R/W 16-bit). Common function codes: FC03 (read holding), FC04 (read input), FC06 (write single register), FC16 (write multiple).

Every power meter, VFD, and simple sensor speaks Modbus. No security features — relies on network segmentation.

MQTT — Message Queuing Telemetry Transport

is a lightweight publish/subscribe messaging protocol designed for constrained networks. Unlike request/response protocols (HTTP, Modbus), MQTT decouples producers from consumers — publishers don't need to know who's listening, and subscribers don't need to know who's sending. This makes it ideal for IIoT and data center monitoring where thousands of sensors feed data to multiple consumers (SCADA, DCIM, analytics, digital twins).

Core Architecture

Broker

Central message router. All clients connect to the broker — never directly to each other. The broker receives published messages, filters by topic, and delivers to matching subscribers. Examples: HiveMQ (enterprise), Mosquitto (open-source), EMQX (high-scale).

Topics

Hierarchical UTF-8 strings using / as delimiter. Example: site/bldg-A/mech/chiller/CH-1/chwst. Wildcards: + matches one level,# matches all remaining levels.site/bldg-A/mech/+/+/chwst gets all chiller supply temps.

Publish / Subscribe

A client publishes a message to a topic. Any client subscribed to that topic (or a matching wildcard) receives it. One publisher can feed many subscribers. Many publishers can feed one subscriber. No coupling.

Quality of Service (QoS) Levels

Level	Guarantee	Handshake	DC Use Case
QoS 0	At most once (fire & forget)	None — send and move on	High-frequency sensor telemetry (temp every 1s). Losing one reading is fine — next one is 1s away.
QoS 1	At least once	PUBACK from broker	Alarm notifications, setpoint changes. Must arrive but duplicates are tolerable. Most common in IIoT.
QoS 2	Exactly once	4-step handshake (PUBREC/PUBREL/PUBCOMP)	Billing/metering data, command acknowledgments. High overhead — use sparingly.

Key Mechanisms

Retained Messages

Broker stores the last message published to a topic with the “retain” flag. When a new subscriber connects, it immediately gets the retained message — no waiting for the next publish cycle. Critical for DCIM: a new dashboard instance instantly shows current values instead of blank until the next sensor poll.

Last Will & Testament (LWT)

On connect, a client registers a “will” message with the broker. If the client disconnects unexpectedly (network drop, crash), the broker publishes the will message on its behalf. Used to signal device offline status — e.g., site/bldg-A/edge-gw-1/status → OFFLINE. Sparkplug B formalizes this as “death certificates.”

Persistent Sessions

With cleanSession=false, the broker stores a client's subscriptions and queues messages while it's offline. When the client reconnects, it receives all missed messages. Essential for edge gateways that may lose WAN connectivity — combined with Ignition's Store & Forward for zero data loss.

Topic Design — Data Center Pattern

Good topic design enables flexible subscriptions. A well-designed hierarchy lets DCIM subscribe to everything, while a cooling engineer subscribes only to their building's mechanical data.

# Topic hierarchy pattern:
{site}/{building}/{system}/{subsystem}/{equipment}/{point}

# Examples:
site-sv1/bldg-A/mech/chiller/CH-1/chwst        → 42.1°F
site-sv1/bldg-A/mech/chiller/CH-1/status        → RUNNING
site-sv1/bldg-A/mech/ct/CT-2/fan-speed          → 72%
site-sv1/bldg-A/elec/swgr/MSB-1/kw              → 4250
site-sv1/bldg-A/elec/pdu/PDU-R3-A/phase-a-amps  → 82.4
site-sv1/bldg-A/env/row-12/rack-4/inlet-temp     → 74.8°F

# Subscription examples:
site-sv1/bldg-A/mech/#           → ALL mech data, bldg A
site-sv1/+/elec/swgr/+/kw        → ALL switchgear kW, all bldgs
site-sv1/bldg-A/mech/chiller/+/chwst  → ALL chiller supply temps
#                                → EVERYTHING (DCIM firehose)

Why MQTT Beats Polling for IIoT

Aspect	Polling (Modbus/OPC-UA)	Pub/Sub (MQTT)
Data flow	SCADA asks each device on schedule	Devices publish when data changes or on interval
Bandwidth	Constant — polls even when nothing changes	Proportional to actual change rate
Latency	Worst-case = poll interval	Near-instant on change
Scale	More devices = slower cycle	Broker handles 100k+ connections
WAN friendly	Fragile over unreliable links	Built for constrained/intermittent networks
Adding consumers	New connection per consumer	Just subscribe — no impact on publisher

Note: OPC-UA also supports subscriptions (server pushes on change) — but MQTT is purpose-built for the edge-to-cloud segment where OPC-UA's rich data model isn't needed.

Sparkplug B — The IIoT Standard on MQTT

Raw MQTT is payload-agnostic — it delivers bytes without caring what they mean. Sparkplug B (maintained by the Eclipse Foundation) adds an application-layerspecification that standardizes how industrial data is structured, encoded, and managed over MQTT. It turns MQTT from a transport protocol into a complete IIoT data infrastructure.

Standardized Topic Namespace

spBv1.0/{group_id}/{msg_type}/{edge_node_id}/{device_id}

# Message types:
  NBIRTH  — Edge node comes online (all metrics)
  NDEATH  — Edge node goes offline
  NDATA   — Edge node metric updates
  DBIRTH  — Device comes online
  DDEATH  — Device goes offline
  DDATA   — Device metric updates
  NCMD    — Command to edge node
  DCMD    — Command to device

# Example:
spBv1.0/NVIDIA-SV1/DDATA/EDGE-GW-BLDG-A/CH-1
→ Chiller 1 data from Building A edge gateway

Birth & Death Certificates

When an edge node connects, it publishes an NBIRTH message containing ALL of its metrics with metadata (name, datatype, engineering units, alias). This acts as a self-describing schema — any subscriber immediately knows what data this node provides.

The node also registers an LWT with the broker: an NDEATH message. If the node drops, the broker publishes NDEATH. Every consumer knows instantly that this node's data is stale.

Why it matters: In raw MQTT, if a device just stops publishing, consumers don't know if the device is offline or if the value just hasn't changed. Sparkplug B eliminates this ambiguity.

Protobuf Encoding

Sparkplug B uses Google Protocol Buffers (Protobuf) for payload encoding instead of JSON or XML. Result: 3–10× smaller payloads, faster serialization/deserialization. Each metric carries: name, alias (numeric shorthand), timestamp, datatype, and value. After BIRTH, DDATA messages send only changed metrics using aliases — extremely efficient.

Ignition MQTT Modules

Ignition implements Sparkplug B via Cirrus Link modules:

• MQTT Engine — On the central gateway. Subscribes to the broker and auto-creates tags from BIRTH messages.
• MQTT Transmission — On edge gateways. Publishes OPC-UA tag data as Sparkplug B messages to the broker.
• MQTT Distributor — Optional built-in MQTT broker within Ignition (for simpler deployments without a standalone broker).

MQTT Security — Data Center Requirements

Transport Security

• TLS 1.2+ — Encrypts all traffic between clients and broker. Port 8883 (MQTTS) instead of 1883.
• Mutual TLS (mTLS) — Both client and broker present certificates. Standard for IIoT.
• Certificate rotation — Edge devices need automated cert renewal (SCEP, EST, or custom CA).

Access Control

• Username/password — Basic auth (always over TLS).
• ACLs — Topic-level read/write permissions per client. E.g., an edge gateway can only publish to its own subtree.
• Client ID validation — Broker enforces unique client IDs; duplicate connection = kick the old one.
• DMZ placement — Broker sits in the IT/OT DMZ. OT side publishes in; IT side subscribes out. No inbound connections from IT to OT.

Broker High Availability

A single broker is a single point of failure. Production deployments use clustered brokers (HiveMQ cluster, EMQX cluster) or active/standby pairs with shared persistent storage. Clients configure multiple broker endpoints and reconnect automatically on failure. Sparkplug B's birth/death mechanism ensures state is rebuilt after any failover.

BACnet IP — Building Automation Crossover

runs over UDP port 47808 (0xBAC0). Data organized as objects: Analog Input/Output/Value, Binary I/O/V, Multi-State, Trend Log, Schedule. Each object has properties (Present-Value, Status-Flags, Description). Supports COV (Change of Value) subscriptions. Many data centers still use BACnet for HVAC and BMS. Gateways translate BACnet ↔ Modbus or BACnet ↔ OPC-UA at system boundaries.

Feature	OPC-UA	Modbus TCP	MQTT	BACnet IP
Model	Client/Server	Master/Slave	Pub/Sub	Client/Server
Security	Built-in TLS	None native	TLS optional	Limited
Data Model	Rich (typed nodes)	Flat registers	Payload agnostic	Object-based
Best For	PLC↔SCADA	Meters, simple I/O	IIoT, edge, cloud	Building HVAC
Scalability	Excellent	Limited (247 dev)	Excellent	Good

The protocol zoo in practice

Power meters → Modbus TCP

UPS systems → + Modbus

Cooling plant → BACnet IP

Generators → Modbus RTU / CAN

Fire alarm → Proprietary serial

Leak detection → Dry contacts / Modbus

PDUs → + REST API

Edge/cloud → MQTT Sparkplug B

EPMS

Electrical Power Monitoring

Every branch circuit, every , every bus section has a power meter reporting voltage, current, kW, kWh, power factor, and THD in real time. This data feeds PUE calculations, capacity forecasting, and — critically — fault detection. A 2% phase imbalance caught by EPMS today prevents a busbar failure next month.

Common platforms: Schneider ION meters, SEL, Dranetz. Communication via Modbus TCP/RTU. Energy dashboards aggregate for billing, allocation, and sub-metering by tenant or department.

BMS / MECHANICAL

Cooling Plant Control

The controls chiller staging, cooling tower fan speed, pump VFDs, CRAH/CRAC units, and economizer dampers. In an AI datacenter running direct-to-chip liquid cooling, the BMS must maintain CDU supply temperature within ±1°C while dynamically responding to GPU workload changes that can swing rack power by 40% in seconds.

PID loops everywhere: supply air temp, chilled water ΔP, condenser water temperature. Tuning parameters (Kp, Ki, Kd) directly impact stability and energy efficiency.

FIRE & LIFE SAFETY

Non-negotiable systems

VESDA (laser-based air sampling) for earliest smoke detection. Clean-agent suppression (FM-200, Novec 1230) for IT rooms. Pre-action sprinklers (double interlock) for high-value spaces.EPO (Emergency Power Off) kills power to entire zones. These systems have hard interlocks with BMS and EPMS — a fire alarm can shut down HVAC and trip power to a fire zone. Governed by NFPA 75/76.

Leak detection cables run under raised floors and along every pipe route — critical for liquid-cooled environments where a CDU leak can damage millions in hardware.

SOO // SEQUENCE OF OPERATIONS

The bridge between design and code

A is the definitive document describing how a system operates under all conditions. It's the spec the PLC programmer codes from, the commissioning agent tests against, and the operations team references for troubleshooting.

System Desc.Equipment list, design conditions, capacity

Operating ModesAuto, Manual, Off, Emergency, Standby, Test

Startup Seq.Step-by-step with prerequisites and time delays

Shutdown Seq.Normal and emergency shutdown procedures

ModulatingPID loops, setpoints, control ranges, reset schedules

Staging LogicLead/lag, load-based staging, rotation schedules

Alarm MatrixEvery alarm condition, trip point, action, reset requirement

Failure ModesSensor failure, comms loss, equipment trip fallbacks

EXAMPLE // CHILLER STAGING

SOO excerpt: lead/lag logic

Lead Chiller Start:
  IF ChW_Return > Setpoint + 2°F
  AND Cooling_Demand > 20%
  AND No_Active_Alarms
  THEN Start Lead_Chiller
  WAIT 300s (anti-recycle timer)

Lag Chiller Stage-On:
  IF Load% > 75% on running chillers
     (sustained 10 min)
  OR ChW_Supply > Setpoint + 3°F
     (sustained 5 min)
  THEN Start next chiller in rotation

DP Setpoint Reset:
  Target: most-open valve at 90%
  DP Setpoint: 12 PSID (range 5-25)
  PID: Kp=2.0, Ki=0.5, Kd=0.0
  Output: Pump VFD 30% min - 100% max

Every setpoint, every timer, every PID gain in the SOO becomes a configurable parameter in the PLC program. The commissioning agent verifies each one during L3/L4 testing.

SAMPLE SOO // ANNOTATED CHILLED WATER PLANT

Complete SOO with explanations

Below is a complete, annotated Sequence of Operations for a chilled water plant typical of a hyperscale AI data center. Each section includes an explanation of why it exists and what reviewers/programmers should focus on.

§1SYSTEM DESCRIPTION

✦ Why this matters: Sets scope. Identifies every piece of equipment under this SOO's control so there is zero ambiguity about what's included. Also defines design conditions — the “rated” environment the control logic must satisfy.

SYSTEM: Central Chilled Water Plant — Data Hall A
CAPACITY: 3 × 1,000-ton centrifugal water-cooled chillers (N+1)
DESIGN CONDITIONS:
  ChW Supply Temp (ChWST):    42 °F
  ChW Return Temp (ChWRT):    55 °F
  Design ΔT:                  13 °F
  CW Supply Temp (CWST):      85 °F (summer design)
  CW Return Temp (CWRT):      95 °F

EQUIPMENT:
  CH-1, CH-2, CH-3       Chillers (York YK, VFD compressors)
  CHWP-1, -2, -3         Primary CHW Pumps (VFD, 1,500 GPM ea)
  CWP-1, -2, -3          Condenser Water Pumps (constant, 2,200 GPM)
  CT-1, CT-2, CT-3       Cooling Towers (variable-speed fans)

§2OPERATING MODES

✦ Why this matters: Prevents ambiguity — “Auto” means different things to different people unless spelled out. HOA (Hand/Off/Auto) at the local switch must always be described because it overrides BMS commands.

MODE        DESCRIPTION
────        ──────────────────────────────────────────────
AUTO        BMS/SCADA controls all staging and sequencing.
            All equipment HOA switches must be in AUTO.
MANUAL      Operator commands via HMI. Interlocks active.
            Staging logic disabled.
OFF         All equipment commanded off. Safeties monitored.
EMERGENCY   Fire alarm, EPO, or critical leak detection.
            Immediate shutdown. See §8.
STANDBY     Ready to start, waiting for cooling demand
            signal from DCIM (IT load > 50 kW threshold).

§3STARTUP SEQUENCE

✦ Why this matters: Step-by-step with prerequisites that must be TRUE before each step proceeds. Prevents starting a chiller without flow (which would damage the evaporator). Time delays protect equipment from rapid cycling. Cx tip: During L3 FPT, verify every prerequisite actually blocks startup when forced FALSE.

PREREQUISITES (all must be TRUE):
  □ System mode = AUTO or STANDBY
  □ No active critical alarms (Level 1 or 2)
  □ ChW loop pressure > 15 PSIG (system filled)
  □ All HOA switches = AUTO at MCC
  □ CW isolation valves OPEN (limit switch FB)
  □ No active fire suppression signal

SEQUENCE:
  STEP 1: Start lead CW Pump
          WAIT: Flow switch TRUE within 30s
          FAIL: Alarm "CW Pump Fail", halt

  STEP 2: Start lead CT fan(s) at 30% minimum
          WAIT: 15s for fan to reach speed

  STEP 3: Open chiller CW isolation valves
          WAIT: End-switch OPEN within 45s

  STEP 4: Start lead CHW Pump at 40% speed
          WAIT: Flow > 800 GPM within 30s

  STEP 5: Command lead Chiller to START
          WAIT: Chiller RUNNING within 120s
          NOTE: Chiller has internal start seq
                (oil pump, guide vanes). Don't bypass.

  STEP 6: Release PID control:
          - CHWP → DP setpoint (see §5)
          - CT fan → CW return temp (see §6)
          - Chiller → ChWST setpoint (42°F)

  ANTI-RECYCLE: 300s min between starts
                per chiller (compressor protection)

§4SHUTDOWN SEQUENCE

✦ Why this matters: Shutdown is the reverse of startup but with critical differences: the chiller must stop before pumps to allow the refrigerant cycle to wind down. Stopping pumps first would freeze the evaporator. The post-circulation timer ensures residual heat is removed from the chiller barrel.

NORMAL SHUTDOWN:
  1. Command chiller STOP (unload guide vanes)
     WAIT: Chiller STOPPED within 180s
  2. Run CHWP at 40% for 120s (post-circ)
  3. Stop CHW Pump
  4. Close CW isolation valves
  5. Stop CW Pump
  6. Ramp down CT fans over 30s

EMERGENCY SHUTDOWN (see §8 triggers):
  ALL equipment → IMMEDIATE STOP
  Exception: CHWP runs 30s for chiller
  protection if safe to do so

§5DIFFERENTIAL PRESSURE CONTROL

✦ Why this matters: DP control drives pump speed. The DP sensor is placed at the most hydraulically remote coil — if DP is adequate there, it's adequate everywhere. DP reset saves additional energy by lowering the setpoint when valves aren't demanding full flow.

DP SETPOINT:  12 PSID (range: 5–25 PSID)
DP SENSOR:    Most remote CRAH / coil header
OUTPUT:       VFD speed → CHWP-1/2/3

PID TUNING:
  Kp = 2.0  Ki = 0.5  Kd = 0.0
  Output range: 30% min – 100% max

DP RESET (energy optimization):
  Target: most-open CRAH valve = 85–95%
  IF all valves < 70%: DP setpoint −0.5 PSI
  IF any valve > 95%:  DP setpoint +0.5 PSI
  Rate limit: 1 PSI per 5 minutes
  Floor: never below 6 PSID

§6CW TEMPERATURE CONTROL

✦ Lower CW temps improve chiller efficiency (lower lift), but tower fans have diminishing returns. “Approach” (CW temp − wet-bulb) is a key metric: 7°F = good, >12°F = fouled fill.

CW RETURN SETPOINT: 78°F (range 65–85°F)
OUTPUT: CT fan VFD speed (all parallel)
PID: Kp=3.0  Ki=0.8  Kd=0.0
     Output: 20%–100%

OPTIMAL RESET:
  CW SP = Wet_Bulb + 7°F (min 65°F)

FREE COOLING / WATERSIDE ECON:
  IF CW_Supply < ChW_Return − 3°F
     (sustained 10 min):
  → Enable plate HX bypass
  → Modulate valve for ChWST SP
  → Stage down chillers if HX
    handles full load

§7CHILLER STAGING (LEAD/LAG)

✦ Staging must balance responsiveness (don't let temps rise) against efficiency (don't short-cycle). “Sustained” timers prevent hunting from temporary load spikes.

STAGE-ON (add chiller):
  Load > 80% sustained 10 min
  OR ChWST > SP+3°F sustained 5 min
  → Start next in rotation
  POST-STAGE LOCKOUT: 15 min

STAGE-OFF (remove chiller):
  Load < 30% per unit sustained 15 min
  AND >1 chiller running
  → Stop lag chiller (lowest hours)
  POST-STAGE LOCKOUT: 20 min

ROTATION:
  Lead rotates weekly (Sun 02:00)
  OR runtime delta > 500 hrs
  Sequence: CH-1→CH-2→CH-3→CH-1

§8ALARM & INTERLOCK MATRIX

✦ Why this matters: The most safety-critical part of an SOO. “Advisory” = notification only. “Warning” = operator attention. “Critical” = automatic protective action. Deadbands prevent alarm chatter. Cx tip: Every alarm must be tested during L3 FPT by simulating the condition.

Alarm	Condition	SP	Delay	Severity	Action
CHW_HI_TEMP	ChWST > SP	+5°F	60s	Warning	Notify, stage on lag
CHW_CRIT_TEMP	ChWST > critical	+8°F	30s	Critical	All chillers ON, page on-call
DP_LOW	DP < minimum	4 PSID	30s	Warning	Pump speed → 80%
CHILLER_FAULT	Controller fault	—	0s	Critical	Start standby, page
PUMP_FAIL	Flow FALSE while ON	—	30s	Critical	Start standby pump
LEAK_DETECT	Leak sensor active	—	5s	Critical	Isolate zone, E-shutdown
VFD_FAULT	VFD controller fault	—	0s	Critical	Start standby pump
CHW_LO_PRESS	Loop pressure low	10 PSI	60s	Warning	Notify — possible leak
CW_HI_TEMP	CWRT > limit	98°F	120s	Warning	CT fans → 100%
CT_FAN_FAULT	Fan VFD / vibration	—	0s	Warning	Redistribute to other CTs

§9FAILURE / FALLBACK MODES

✦ Every sensor failure must have a defined fallback — otherwise the PID loop goes haywire. BMS comms loss: system must “fail safe” (keep running) because losing monitoring is less dangerous than losing cooling.

SENSOR FAILURE:
  ChWST fail → Use ChWRT − design ΔT
               Lock pump speed, alarm
  DP fail    → Lock pump at 70%, alarm
               Switch to backup sensor
  CW fail    → CT fans to 80% fixed

BMS/SCADA COMMS LOSS:
  PLC continues last known sequence
  Setpoints hold at last value
  No staging changes
  Alarm on comms restored
  Operator must re-enable AUTO

POWER FAILURE:
  Chillers trip (need clean power)
  Pumps on UPS: auto-restart
  Wait 30s for stable generator
  Then execute normal startup §3

§10SETPOINTS & PARAMETERS

✦ The commissioning “cheat sheet.” During L3, verify each parameter matches design. Always include adjustable ranges — without them, operators may set dangerous values.

Parameter	Default	Range
ChWST Setpoint	42°F	38–50°F
DP Setpoint	12 PSID	5–25
CW Return SP	78°F	65–85°F
Stage-On Load	80%	60–95%
Stage-Off Load	30%	15–45%
Anti-Recycle	300s	180–600s
Post-Stage Lock	15 min	10–30 min
CHWP Min Speed	30%	20–40%
CT Fan Min	20%	15–30%
Lead Rotation	7 days	1–30 days

§11INTERFACE POINTS

✦ Why this matters: No system operates in isolation. Missing an interface point is one of the most common commissioning defects. Cx tip: Verify each interface during L3 by having both sides confirm read/write.

TO DCIM / EPMS:
  → Plant kW total
  → Cooling capacity (tons delivered)
  → PUE contribution data
  → ChWST, ChWRT, ΔT, flow rate

FROM DCIM / EPMS:
  ← IT load (kW) — standby→auto trigger
  ← Cooling demand request

TO/FROM FIRE ALARM:
  → Plant running status
  ← Suppression signal → E-shutdown

TO/FROM LEAK DETECTION:
  ← Zone leak → isolate branch
  ← System leak → E-shutdown

TO/FROM ELECTRICAL:
  ← Generator / utility status
  → Plant load (gen load mgmt)

FUTURE // DESIGN → ENGINEERING → Cx PIPELINE

The platform that doesn't exist yet (but should)

Today, the handoff between design, controls engineering, and commissioning is largely manual. A designer creates P&IDs, a controls engineer manually re-interprets them into PLC/SCADA configuration, and a Cx agent manually writes test scripts from the SOO. Each handoff introduces errors and delay. The industry is converging toward a model-driven approach where a single source of truth drives everything downstream.

DESIGN MODEL

P&IDs · Equipment schedules · IO lists · Control narratives

BIM / Revit / EPLAN · AutomationML (IEC 62714)

auto-parse equipment + connections

TAG DATABASE

Auto-generated from equipment model

UDTs per equipment class · hierarchy · alarm config

map to templates

CONTROL LOGIC

PLC programs from SOO templates

Structured text / FBD auto-generated

HMI / GRAPHICS

ISA-101 compliant screens

Faceplates, trends, alarms from UDTs

Cx SCRIPTS

Test procedures from SOO + tag DB

Auto-generated acceptance criteria

deploy + verify

LIVE SYSTEM + DIGITAL TWIN

Runtime verified against design intent

Omniverse 3D twin · live data overlay · auto-Cx verification

WHAT WORKS TODAY

Template-driven HMI

Define one faceplate per equipment type (valve, pump, VFD), bind to UDTs. Every instance auto-generates its own screen. Ignition Perspective, Siemens WinCC, AVEVA System Platform all support this.

EPLAN → TIA Portal

Electrical design in EPLAN exports hardware config to Siemens TIA Portal via AutomationML. IO mapping, rack layout, and network config auto-transfer.

ISA-101 high-performance HMI

Strict design rules (gray backgrounds, no 3D pipes, color = abnormal state) are codifiable. An AI system could generate compliant screens more consistently than most human designers.

WHAT'S STILL MANUAL

SOO → Control logic

A controls engineer reads the SOO and manually programs PLC logic. Same intent, re-interpreted. No standard for machine-readable SOOs.

SOO → Cx scripts

Cx agents manually create test procedures from the same SOO the programmer used. Three humans reading one document = three interpretations = defects.

P&ID → Tag database

Equipment on a P&ID becomes tags manually. BIM semantic standards (Haystack, Brick Schema) are trying to bridge this, but adoption is slow.

NVIDIA POSITIONING

NVIDIA is uniquely positioned to close this gap. Omniverse provides the 3D digital twin — the “HMI” is the model itself with live data overlay. Agentic AI can parse design documents and generate control logic, tag databases, and Cx scripts from a single source of truth. The platform that unifies design → engineering → commissioning into one model-driven pipeline will fundamentally change how fast AI factories deploy. Instead of flat 2D SCADA screens, you get a photorealistic twin where you click a chiller and its faceplate appears with live data — no one needs to manually draw pump graphics when the 3D model already exists.

PLATFORM LANDSCAPE — WHO'S CLOSEST?

Platform	Design	Engineering	Auto HMI	Auto Cx
Siemens TIA + EPLAN	✓ EPLAN → AML export	✓ PLC + HMI in one IDE	✓ Faceplate templates	✗ Manual
Rockwell FT Design Hub	◐ Cloud-based config	✓ Studio 5000	◐ Global objects	✗ Manual
Ignition (Inductive)	✗ No design tool	✓ Excellent SCADA	✓ UDT templates	✗ Manual
Schneider EcoStruxure	◐ Separate products	✓ Multiple platforms	◐ Cx wizards	◐ Partial
NVIDIA Omniverse	✓ 3D from BIM/CAD	◐ Data connectors	✓ Twin IS the HMI	◐ Emerging (AI)

No single platform does all four today. The convergence of digital twins + semantic data models + agentic AI will close the gap within 2–5 years.

COMMISSIONING // Cx

Trust, but verify — systematically

is the process that proves every system works as designed — not just individually, but together, under realistic load conditions and failure scenarios.

L1 FactoryWitness testing at the manufacturer — verify capacity, features, nameplate data. Critical for custom switchgear, UPS, and generators with 20–40+ week lead times.

L2 InstallVerify installation meets design: mounting, alignment, wiring terminations, labeling, meggering cables, hydrostatic testing on piping. Trace I/O wiring from sensor → panel → PLC module.

L3 FunctionalIndividual system testing against the SOO: verify startup/shutdown sequences, test every alarm, verify PID tuning, run at 25/50/75/100% loads, confirm historian logging.

L4 IntegratedMulti-system failure testing with live or simulated IT load: utility loss, generator failure, UPS bypass, chiller trip, CRAH failure, CDU leak, BMS comms loss, cascade failures.

L5 SeasonalOngoing: economizer transitions, seasonal setpoints, re-test after modifications, continuous monitoring and trend analysis for degradation.

An un-commissioned datacenter is a datacenter that will fail under load. The question is when, not if.

IST // INTEGRATED SYSTEMS TESTING

Where data center CX gets unique

L4/IST is performed with live IT load (or load banks). These are the failure scenarios that must be proven before a facility goes live:

Utility loss (one feed, both feeds)

Generator failure to start

Generator failure during operation

UPS bypass / UPS failure

Chiller trip under load

CRAH/CDU fan or pump failure

Cooling tower pump trip

Fire suppression activation (EPO)

BMS/SCADA communication loss

Cascade: chiller trip → temp rise → load shed

Key documentation: pre-written test scripts with expected results, pass/fail criteria, and space for actual results. Every deviation is a deficiency tracked to resolution.

DCIM // DATA CENTER INFRASTRUCTURE MANAGEMENT

The unified operational view

provides a single pane of glass across all physical infrastructure. It sits at the top of the control pyramid, consuming data from BMS, EPMS, and IT systems:

Asset Management

Every device tracked: location (building/floor/row/rack/U-position), model, serial, power connections, network connections. Change management for installs, moves, decommissions.

Capacity Planning

Power capacity per rack/row/room, cooling capacity, physical space, network ports. Power chain visualization: trace the path from utility feed to individual rack.

Integration Data Flows

EPMS → BMS: power readings for PUE calculation, load-based staging. BMS → DCIM: environmental data for capacity planning. EPMS → DCIM: actual power per rack for utilization tracking. All connected via API, SQL, or OPC-UA gateways.

Platforms: Nlyte, Sunbird dcTrack, Schneider EcoStruxure IT. Hyperscalers (including NVIDIA) often build custom DCIM in-house.

ARCHITECTURE // SENSOR TO DCIM DATA FLOW

End-to-end data architecture from physical sensors through the OT control stack (PLCs → OPC-UA → Ignition SCADA) to IT systems via MQTT, feeding DCIM, analytics, and cloud platforms.

Level 4/5 — Enterprise Network

Data Lake / Time-Series

InfluxDB

Digital Twin

Omniverse

AI / ML Analytics

Anomaly Det.

Business Dashboards

Grafana

REST API · WebSocket · SQL

DCIM PLATFORM

Asset MgmtCapacity PlanningPUEAlarmsWork Orders

LEVEL 3.5 — IT / OT DMZ

MQTT / Sparkplug B (TLS)

MQTT BROKER

HiveMQ / Mosquitto

Sparkplug B birth/death · QoS 1 · Retained

▼

Level 3 — Operations

CENTRAL IGNITION GATEWAY (On-Prem)

Tag Historian

SQL DB

Alarm Notification

Pipeline

MQTT Transmission

Sparkplug B

Sepasoft MES

SPC / OEE

MES · Scheduling · OEE · Quality Management

OPC-UA Server (port 62541) · Gateway Network ↓

Level 2 — Process (SCADA / HMI)

Historical DB

Local SQL

Local Ignition Server

SCADA / HMI

Local Client

Perspective

Cirrus Link Modules

Distributor / Engine

Edge gateways at remote sites ↓

Ignition Edge IIoT

Bldg A Cooling

Ignition Edge Panel

Bldg B

Ignition Edge

Power

SCADA · HMI · Alarming · Reporting · Trending

OPC-UA · Modbus TCP

Level 1 — Control (Intelligent Devices)

PLC

Beckhoff / Siemens

Onboard logic

RTU

Remote sites

Onboard logic

Power Meters / IEDs

Schneider ION/PM

Processor + comms

VFDs / Motor Starters

ABB / Danfoss

Processor + comms

Hardwired · 4-20mA · RS-485 · Dry contacts

Level 0 — Field Devices (Unintelligent)

Temp Sensors

RTD · Thermistor · 4-20mA

Pressure / DP

Transducers · 4-20mA

Actuators / Valves

Control valves · Dampers

Contacts / Switches

Leak det · Smoke · Door

OT Side (Levels 0–2)

Sensors & actuators (L0, unintelligent) → PLCs/RTUs/IEDs/meters (L1, intelligent) via hardwired I/O. L1 controllers → local Ignition SCADA/HMI & Edge (L2) via OPC-UA. Central Ignition Gateway (L3) aggregates via Gateway Network. Deterministic, real-time control.

IT/OT Bridge (DMZ)

Ignition's MQTT Transmission module publishes to a broker in the DMZ using Sparkplug B. Data flows OT→IT only. TLS encrypted, certificate auth, no inbound connections from IT to OT. The broker is the single point of data egress.

IT Side (Level 4)

DCIM platform subscribes to MQTT broker or pulls from Ignition's historian via SQL/API. Data lake stores raw telemetry for ML. Omniverse digital twin consumes real-time feeds. Grafana/custom dashboards for NOC displays.

IT/OT // THE PURDUE MODEL

Network segmentation for industrial systems

OT (PLCs, SCADA, sensors) prioritizes availability and safety. IT (business systems, cloud) prioritizes confidentiality. The Purdue Model (ISA-95) defines the separation:

Level 4/5Enterprise Network (ERP, cloud, DCIM, analytics)

LEVEL 3.5 — IT / OT DMZ

Level 3Operations (Central Ignition, Historian, MES)

Level 2Process (Local SCADA/HMI, Ignition Edge)

Level 1Control (PLCs, RTUs, IEDs, VFDs — intelligent devices)

Level 0Field Devices (sensors, actuators — unintelligent)

OT networks must be segmented from IT with a DMZ. Data flows OT → IT via historians, OPC-UA gateways, or MQTT brokers in the DMZ. Never expose PLCs directly to the IT network. Security standard: IEC 62443.

DIGITAL TWINS // NVIDIA OMNIVERSE

Virtual replica, real-time data

A digital twin is a virtual replica of a physical system continuously updated with real-time sensor data. NVIDIA Omniverse for data centers provides:

3D visualization of entire facility — racks, cooling, power

Real-time sensor data overlay (temperatures, power, airflow)

CFD simulation for airflow optimization

What-if scenarios: "Add 10 racks to Row C — what happens to cooling?"

Predictive maintenance via anomaly detection

Data pipeline: Sensor → PLC → OPC-UA → Historian → API → Omniverse connector → real-time 3D updates. Uses USD (Universal Scene Description) format.

THE AI DIFFERENCE

Why AI workloads break traditional monitoring

Power volatility

Traditional servers draw steady power. GPU clusters swing from idle (~30% TDP) to full load (~100% TDP) in milliseconds when a training batch launches. This creates transient power spikes that stress UPS systems, trip breakers if margins are thin, and confuse capacity planning models built for static loads.

Thermal density

At 80–120 kW per rack, the thermal time constant collapses. A cooling failure that gives you 15 minutes of runway in a 10 kW/rack enterprise hall gives you <3 minutes in an AI hall. Monitoring latency that was "fine" at 60-second poll intervals becomes dangerous — you need sub-second environmental telemetry.

Workload correlation

In an AI cluster, all GPUs in a training job start and stop together. This means power and thermal loads are highly correlated across hundreds of racks — the opposite of the statistical diversity that traditional datacenter designs depend on. Your cooling plant must handle 0-to-100% step changes.

DEVICE CATALOG // ELECTRICAL SYSTEMS

Sample electrical device types — make, model, protocol, key data

A controls engineer must know what's on the other end of the wire. Below are representative devices found in a modern AI data center electrical distribution system — the equipment your SCADA/EPMS will monitor and control.

Device Type	Example Make / Model	Protocol	Key Data Points	Location
Revenue Meter	Schneider ION 9000	Modbus TCP / DNP3	V (L-L, L-N), A, kW, kVAR, kVA, PF, THD per harmonic, demand (15-min), kWh, frequency	MV switchgear
Branch Circuit Monitor	Schneider PM8000 / ION 7650	Modbus TCP	Per-circuit V, A, kW, kWh, PF, breaker status, alarm thresholds	Floor PDU / RPP
Protective Relay	SEL-751 / GE Multilin 489	Modbus RTU / DNP3 / IEC 61850	Fault current, trip status, fault type (OC/GF/diff), event log, waveform capture	MV / LV switchgear
UPS	Vertiv Liebert EXL S1 / Eaton 93PM	SNMP v3 / Modbus TCP	Input/output V, A, kW, load %, battery SOC, runtime remaining, bypass status, temp, alarm state	UPS room
Automatic Transfer Switch	ASCO 7000 Series / Russelectric	Modbus TCP / BACnet	Source 1/2 status, active source, transfer count, transfer time (ms), V per source	MV switchgear room
Generator Controller	DEIF AGC-4 / DSE 8610	Modbus RTU/TCP	kW output, RPM, coolant temp, oil pressure, fuel level, battery V, run hours, start count	Generator yard
Intelligent Rack PDU	ServerTech PRO3X / Raritan PX3	SNMP v3 / Modbus / REST API	Per-outlet V, A, kW, kWh, PF, inlet temp/humidity, outlet switching, alarm thresholds	In-rack
Static Transfer Switch	Schneider Galaxy VS STS / Eaton STS	Modbus TCP / SNMP	Active source, preferred source, transfer count, transfer time (<4ms), source quality	Critical distribution
Power Quality Analyzer	Dranetz HDPQ / Fluke 1760	Modbus TCP / proprietary	THD (V&I), voltage sags/swells, transients, flicker, unbalance, EN 50160 compliance	MV bus / critical loads

DEVICE CATALOG // MECHANICAL SYSTEMS

Sample mechanical device types — make, model, protocol, key data

The mechanical (cooling) side has its own ecosystem of intelligent devices and dumb field instruments. The BMS/SCADA must integrate all of them into a unified monitoring and control platform.

Device Type	Example Make / Model	Protocol / Signal	Key Data Points	Location
Centrifugal Chiller	York YK / Trane CVHF / Carrier 19XR	BACnet IP / Modbus TCP	ChWST, ChWRT, ΔT, loading %, kW, COP, compressor RPM, refrigerant press/temp, oil temp, alarm codes	Chiller plant
Cooling Tower	Evapco AT / BAC Series 3000	BACnet / Modbus	Fan speed %, CW supply/return temp, basin temp, vibration, fan status, cell enable/disable	Roof / yard
VFD (Variable Freq Drive)	ABB ACS880 / Danfoss VLT / Yaskawa GA800	Modbus TCP / PROFINET / EtherNet/IP	Speed (Hz/RPM), motor A, kW, torque %, run status, fault code, PID setpoint/feedback, drive temp	Pump/fan motors
CDU (Coolant Dist Unit)	CoolIT DCLC / Vertiv XDU / Motivair ChilledDoor	Modbus TCP / BACnet IP	IT supply/return temp, facility supply/return temp, flow (GPM), ΔP, leak detect, pump status, conductivity	Row-end / in-row
CRAH / AHU	Schneider Uniflair / Vertiv Liebert CW	BACnet IP / Modbus	Supply/return air temp, fan speed, valve position, filter ΔP, coil temp, humidity, unit status	Data hall perimeter
Control Valve (2-way)	Belimo / Siemens ACVATIX	4–20 mA / 0–10 V / BACnet	Position feedback (%), command %, actuator status (open/closed/fault), torque	Chiller/CRAH coils
Temp Sensor (RTD/Thermistor)	Siemens QAM2120 / TE Connectivity	4–20 mA / RTD (Pt100/Pt1000)	Temperature °F/°C (pipe, duct, ambient, immersion)	Pipes, ducts, racks, ambient
Differential Pressure Sensor	Setra 231 / Dwyer MS-111	4–20 mA / 0–10 V	ΔP across filter, coil, pump, or chiller (PSID / Pa)	Filter, coils, headers
Flow Meter	Badger Meter ModMAG / Siemens SITRANS FM	4–20 mA / Modbus / HART	Flow rate (GPM), totalizer (gallons), flow velocity, direction	CHW/CW mains, CDU loops
Leak Detection	RLE Technologies / TraceTek TT-FFS	Dry contact / Modbus RTU	Leak presence (Y/N), leak location (distance along cable), zone ID	Under floor, pipe routes, CDU
Smoke / Fire Detection	Xtralis VESDA-E VEA / Honeywell FSL100	Proprietary / relay / Modbus	Smoke level (obscuration), alert/action/fire thresholds, sampling pipe zone, flow status	Above/below rack, ceiling

DCIM TELEMETRY // CRITICAL DATA POINTS

What DCIM needs at the telemetry level — and why

A platform aggregates thousands of data points into actionable intelligence. But not all data is equal. The points below are the most critical telemetry a DCIM consumes — the data that drives capacity decisions, triggers alarms, calculates efficiency metrics, and keeps the facility running.

POWER

ELECTRICAL TELEMETRY

FACILITY-LEVEL (MV METERING)

Total facility kW — real-time total site power; denominator for PUE
kWh (cumulative) — energy billing, carbon footprint, trend analysis
Power factor — utility penalty if PF < 0.95; indicates harmonic issues
Peak demand (15-min avg) — drives utility demand charges; capacity trigger
THD (voltage & current) — harmonic distortion from switch-mode PSUs; transformer derating
Frequency (Hz) — grid stability indicator; triggers generator start

IT LOAD (RACK / PDU LEVEL)

Total IT kW — numerator for PUE; sum of all rack PDU readings
Per-rack kW — capacity utilization per position; stranding detection
Per-outlet amps — breaker trip risk; load balancing across phases
UPS load % — headroom for transients; triggers capacity alarm at 80%
UPS battery SOC & runtime — ride-through availability; replacement scheduling
Generator run hours & fuel % — maintenance scheduling; fuel delivery trigger

PUE CALCULATION (REAL-TIME)

PUE = Total Facility Power (MV meter) ÷ IT Load Power (sum of rack PDU kW)

Trended at 15-min intervals. Dashboard target: 1.10–1.20 for liquid-cooled AI facilities.

THERMAL

COOLING & ENVIRONMENTAL TELEMETRY

ENVIRONMENTAL (DATA HALL)

Rack inlet temp (per-rack) — ASHRAE A1 limit: 64–80°F; alarms at 82°F; GPU throttling risk above 85°F
Rack exhaust temp — ΔT across rack = proxy for load; abnormal ΔT = airflow issue
Supply air temp (CRAH output) — control variable for CRAH PID loop
Return air temp (CRAH intake) — determines cooling demand and valve modulation
Humidity (%RH) — ASHRAE recommends 8–60% RH; too low = ESD risk; too high = condensation
Dew point — condensation risk for cold pipes; critical for liquid cooling environments

COOLING PLANT

ChW supply/return temp — primary control target; ±0.5°F stability critical for AI loads
ChW ΔT — design ΔT (13°F typ.); low ΔT syndrome wastes energy and reduces chiller capacity
ChW ΔP (header) — secondary pump VFD control variable; reset based on valve positions
Chiller loading % & COP — staging trigger; efficiency optimization
CW supply/return temp — cooling tower fan speed control variable
Wet-bulb temp (outdoor) — economizer enable/disable; cooling tower approach calculation

CDU / LIQUID COOLING

IT loop supply/return temp — GPU thermal margin; alarm at >50°C supply
IT loop flow rate (GPM) — low flow = insufficient heat removal; pump fault indicator
IT loop ΔP — filter clogging indicator; pump curve verification
Coolant conductivity (μS/cm) — DI water quality; >5 μS/cm = resin exhausted
Leak detection status — zone/distance; highest-priority alarm in liquid-cooled DCs

MECHANICAL EQUIPMENT STATUS

Pump/fan VFD speed (%) — efficiency tracking; affinity law verification
Motor current (A) — overload detection; bearing failure early warning
Valve position (%) — hunting detection; DP reset input (most-open valve algorithm)
Filter ΔP — replacement scheduling; airflow restriction alarm
Vibration (mm/s) — rotating equipment health; bearing/impeller failure prediction

SAFETY

LIFE SAFETY & SECURITY TELEMETRY

FIRE / SMOKE

● VESDA smoke level (alert/action/fire)
● Fire panel zone status
● Suppression system armed/discharged
● Pre-action valve status
● EPO button status (armed/tripped)

WATER / LEAK

● Leak detection cable alarm + location
● Drip pan water-sense contacts
● CDU reservoir level
● Makeup water flow (unexpected = leak)
● Under-floor flood sensors

PHYSICAL SECURITY

● Door contacts (open/closed/forced)
● Access control events (badge in/out)
● Camera feed status (online/offline)
● Motion detection zones
● Mantrap interlock status

TELEMETRY ARCHITECTURE — DATA RATES & STORAGE

Data Category	Typical Poll Rate	Historian Deadband	Retention	Why It Matters
Power (kW, A, V)	1–5 sec	1–2%	3+ years	PUE, billing, capacity trending, fault detection
Temperature	5–15 sec	0.5°F / 0.3°C	1–3 years	Thermal compliance, SLA, cooling optimization
Flow / Pressure	5–30 sec	2–5%	1–2 years	Pump performance, filter health, balancing
Equipment status (on/off)	On change	N/A (digital)	5+ years	Runtime tracking, PM scheduling, failure analysis
Alarms & events	On change	N/A	7+ years	Root cause analysis, compliance audit trail
Security / access	On event	N/A	1–7 years	Compliance (SOC 2, ISO 27001), incident response

A 1,000-rack AI facility with ~30 points per rack + plant-level instrumentation generates 50,000–100,000 monitored points. At 5-second scan rates, that's 10,000–20,000 writes/second to the historian. Proper deadband configuration reduces actual storage volume by 80–90% without losing operationally significant changes.

SILICON

From Transistor to Tensor Core

Zoom in far enough and a GPU is just billions of switches — transistors patterned at TSMC's 4–3 nm class nodes.[src] Zoom out and they form a hierarchy purpose-built for one operation: matrix multiplication.

HIERARCHY

A GPU is a memory machine

Registers — per-thread, ~tens of KB, fastest path.
— per-SM scratchpad, ~hundreds of KB, where tiles its work.[src]
— on-package stacked DRAM. ~3–8 TB/s, 80–192 GB per GPU on Hopper / Blackwell class parts.[src][src]
CPU DRAM — DDR5, ~hundreds of GB/s, used as overflow.
NVMe + network — datasets, checkpoints, inter-node traffic.

For most inference workloads, throughput is bounded not by but by how fast weights can be streamed from into the SRAM-resident kernel.

COMPARE

CPU · GPU · TPU · NPU

Class	Style	Strength	Limit
CPU	Latency, scalar	Branchy code	Few ops/cycle
GPU	SIMT, throughput	Dense matmul	Memory bandwidth
TPU	Systolic array	Big matmul, low overhead	Less flexible
NPU	On-device, INT8/4	Power-efficient inference	Limited memory

All four converge on the same insight: is ~90% of the work, so dedicate the silicon to it.

NUMERICAL PRECISION // INTERACTIVE

Fewer bits, more throughput, more risk

Bits / element

Relative throughput

16×

Accuracy regime

Training-grade

Memory footprint

2.0× smaller

Same exponent range as FP32; the modern training default.

FP8 E4M3/E5M2 are now standard for forward-pass training on Hopper / Blackwell.[src][src]

MATHEMATICS

Matrix Multiplication, All The Way Down

A neural network is, mechanically, a long composition of linear maps separated by simple non-linearities. During training, calculus tells us how to nudge each weight to reduce a scalar . That's it. The intelligence emerges from composition and scale.

FORWARD PASS

y = σ(W · x + b)

For each layer, multiply the input vector x by a learned weight matrix W, add a bias b, then apply a non-linearity σ — usually in modern transformers.[src]

BACKWARD PASS

Backpropagation in one breath

Compute loss L between prediction and target.
Apply the chain rule from output back to input, accumulating ∂L/∂W for every weight.
Update each weight: W ← W − η · ∂L/∂W.
Repeat for trillions of tokens.

is just the chain rule, executed efficiently as a reverse-mode automatic differentiation pass over the computational graph.

OPTIMIZER // INTERACTIVE

Loss Landscape Visualization

η (lr)0.080

The vector field is a synthetic 2D loss; modern training uses AdamW with cosine learning-rate schedules and warmup. The real loss landscape lives in billions of dimensions, where most local minima behave similarly well.

TRANSFORMER ARCHITECTURE

Attention Is The Engine

Every modern frontier model — GPT, Claude, Gemini, Llama — is a stack of identical transformer blocks.[src] Each block does two things: lets every token look at every other token (), then transforms each token independently (MLP). Stack 60–120 of those, train on trillions of tokens, and you get an LLM.

TOKENIZER // BPE (DEMO)

Type to see the merges fire

·the·strawberry·weighs·30·grams.

▁ marks a word boundary (sentencepiece convention). Real BPE learns ~50k merges from corpus statistics.[src] Notice how rare words like “strawberry” fragment — that's why some models miscount its letters.

ATTENTION HEATMAP

Each token decides who to listen to

recency

Each row = a query token; brightness = how much it attends to each key token. Real heads in trained models specialize spontaneously into induction heads, name-mover heads, etc. Pattern shown is heuristic for clarity.[src]

BLOCK ANATOMY

One transformer block, repeated N times

instead of LayerNorm — fewer params, same stability.[src]
inject position into Q/K via rotation — extends cleanly to long contexts.[src]
share K/V heads across multiple Q heads to shrink the during inference.[src]
SwiGLU MLP is the de facto feed-forward in modern LLMs.[src]
swaps the dense MLP for a router + many experts; only k experts fire per token.[src]
tiles attention into -resident blocks — same math, far less traffic.[src]

FROM LOGITS TO WORDS

Sampling: turning a probability vector into text

The model outputs a probability distribution over the entire vocabulary for the next token. Temperature sharpens or flattens it; top-p truncates the long tail. Then we draw one token, append it, and feed the new sequence back in.

Temperature0.70Top-p (nucleus)0.90

prompt: "The system processes the input data, and"

thelogiclightlatticelatticelogicwitheverybytedownthecircuit

Brighter = higher probability; magenta = the token actually sampled. At temperature 0 the same prompt always picks the argmax.

TRAINING

Trillions of Tokens, Months of Wall Time

Pretraining is conceptually simple: predict the next token, average the loss across a batch, backprop, update. The hard parts are data quality, distributed orchestration, and not crashing for 90 days.[src]

SIM // CHINCHILLA COMPUTE

C ≈ 6 · N · D

Parameters (B)70B

Training tokens (T)15T

GPUs16,384

Sustained TFLOPs / GPU750

Total FLOPs

6.3e+24

Wall time

5.9days

Energy

2,333MWh

Est. cost

$2.8M

Chinchilla-optimal D ≈ 20 × N. GPT-4 class is in the 1e25–1e26 FLOP regime.[src][src] Llama 3 405B used ~3.8e25 FLOPs.[src]

DISTRIBUTED PARALLELISM

How a forward pass shards across 16k GPUs

— each GPU holds the full model, processes a different micro-batch, gradients at the end.
— slice individual across GPUs along the hidden dim (Megatron-style).[src]
— assign different layer ranges to different GPUs; micro-batches flow through like an assembly line.
— shard parameters, gradients, and optimizer states across the data-parallel group; gather just-in-time.[src][src]
Expert parallel — for MoE, route tokens to expert shards living on different GPUs.[src]

Real frontier runs combine all five along orthogonal axes ("3D" or "4D" parallelism) to keep every GPU saturated.

DATA

Crawl → dedup → filter → mix

Web crawls (Common Crawl), code (GitHub), books, math, multilingual corpora. Deduplicated at document and substring level, quality-filtered by classifiers, then mixed by domain. Quality > quantity past a point.[src]

POST-TRAIN

→ RM → /

Supervised fine-tuning on demonstrations, then preference data shapes a reward model, then PPO or — increasingly — Direct Preference Optimization aligns the policy.[src][src] Anthropic's Constitutional AI automates the preference signal.[src]

EVAL

Benchmarks vs. reality

MMLU, GPQA, SWE-bench, HumanEval, ARC-AGI.[src][src][src] Watch for — if eval data leaked into pretraining, the score is meaningless. Real-world capability often lags benchmarks.

INFERENCE

Serving the Trained Mind

Inference has two phases. processes your prompt in parallel — it is compute-bound. generates one token at a time, streaming weights from on every step — it is memory-bandwidth bound. Modern serving stacks (vLLM, TensorRT-LLM, SGLang) exist to keep both phases saturated through continuous batching and paged KV cache management.[src]

SIM // LATENCY BUDGET

Configure a deployment

Model size (B params)70B

Input context (tokens)2,000

Output tokens500

Concurrent requests (batch)8

PrecisionGPU

GPUs needed

TTFT

1.42s

Throughput

459.4tok/s

KV cache

6.6GB

$ / 1M output tok

$8.35

Specs sourced from NVIDIA H100 / Blackwell briefs[src][src] and HBM3e standard.[src]FLOPs assume 40% MFU; HBM at 60% effective.

SERVING TRICKS

— new requests join an in-flight batch each decode step.[src]
— stored in fixed-size pages, like an OS virtual memory.
— a tiny draft model proposes K tokens; the big model verifies them in one pass.[src]
Prefix caching — reuse KV across requests sharing a system prompt.

QUANTIZATION

Post-training quantization to 4-bit () shrinks weights 4× and roughly 4× inference throughput, with single-digit % accuracy loss on most tasks.[src][src]

ON-DEVICE

in modern phones and laptops run 3–8B models at . Local means private, low-latency, and free per query — at the cost of capability. Apple Intelligence, Phi-class, Gemma 2B all live here.

AGENTS

Loops, tools, retrieval

while not done:
    response = model(messages, tools=[search, code, ...])
    if response.tool_calls:
        for call in response.tool_calls:
            result = run(call)
            messages += [tool_result(result)]
    else:
        done = True
        return response.text

Agents are LLMs in a control loop, calling functions and reading the results back into context. is the same shape with one tool: vector search over your documents. standardizes how tools are exposed to models.

CONSIDERATIONS

Considerations & Open Questions

Modern AI is genuinely useful and genuinely fragile. The failures are not bugs to be patched — they are direct consequences of the training objective and the architecture.

FAILURE MODES

Hallucination. The model is trained to maximize next-token likelihood, not truth. When the prompt enters a region of weight-space with low evidence, it interpolates plausibly. There is no internal "I don't know" signal unless one was explicitly trained in.
Prompt injection. The model cannot distinguish instructions from data. Anything that reaches its context — a webpage, an email, a tool result — can hijack behavior.[src]
Jailbreaks. Safety training is a thin shell over a much larger base model. Adversarial prompts find the seams.
Distribution shift. Performance degrades on inputs unlike the training distribution — long context, niche domains, low-resource languages.
Reward hacking. RLHF optimizes for human-rater approval, which is correlated with — but not identical to — being correct or helpful.

EXTERNALITIES

Energy. A single frontier training run consumes ~10–50 GWh — comparable to a small town for a year. Aggregate inference is now larger than training for major providers.[src]
Water. Evaporative cooling can use 1–2 L per kWh of IT load; closed-loop direct-to-chip designs use far less. Site choice matters more than model choice.
Grid impact. Hyperscalers are now signing multi-gigawatt PPAs and reviving nuclear capacity to keep up.[src]
Synthetic data feedback. If models train on outputs of earlier models, distributions narrow and rare phenomena vanish — "model collapse."[src]

OPEN QUESTIONS

What we still don't know

Interpretability.

We can read individual activations and trace small circuits, but we cannot, for any frontier model, fully explain why a given output was produced.

Alignment.

Specifying what we want a powerful optimizer to do — precisely enough that it won't satisfy the letter while violating the spirit — remains unsolved.

Scaling limits.

Loss continues to fall predictably with compute, but capability jumps are discontinuous and hard to forecast. Data may bind before compute does.

Generalization vs. memorization.

How much of model behavior is learned algorithm vs. retrieved training data is an active research question with real legal and scientific stakes.

QUICK REFERENCE

Quick Reference

Key acronyms and critical concepts across infrastructure, controls, silicon, and AI systems — organized for rapid review and team learning.

ACRONYMS // A–Z

Every abbreviation you need to know

Abbr	Full Name	Quick Note
PUE	Power Usage Effectiveness	Total facility power ÷ IT power. Target: 1.10–1.20 for hyperscale.
TDP	Thermal Design Power	Max sustained chip power draw (watts). Sets cooling requirement.
UPS	Uninterruptible Power Supply	Battery backup bridging 10–15 s gap until generators reach speed.
PDU	Power Distribution Unit	Distributes power from facility feed to rack-level outlets.
ATS	Automatic Transfer Switch	Switches load from utility to generator on failure (100–500 ms).
CDU	Coolant Distribution Unit	Heat exchanger between facility water loop and IT liquid loop.
CRAC	Computer Room Air Conditioning	DX refrigerant-based cooling unit. Good up to ~15 kW/rack.
CRAH	Computer Room Air Handler	Chilled-water based. More efficient at scale than CRAC.
EPO	Emergency Power Off	Kills power to entire zone. Code-required, controversial (nuisance trips).
VFD	Variable Frequency Drive	Controls motor speed (pumps, fans). Key PUE optimization lever.
EPMS	Electrical Power Monitoring System	Network of power meters — V, I, kW, PF, THD at every distribution point.
BMS	Building Management System	Supervisory control for HVAC, cooling plant, environmental monitoring.
SCADA	Supervisory Control and Data Acquisition	Industrial control + HMI layer. Single pane of glass for operators.
DCIM	Data Center Infrastructure Management	Unified view of assets, capacity, power chains, and environment.
PLC	Programmable Logic Controller	Deterministic real-time controller. Scan cycle <10 ms.
DDC	Direct Digital Control	Microprocessor-based HVAC loop control (replaces pneumatic).
HMI	Human-Machine Interface	Operator screens: one-line diagrams, alarm dashboards, schematics.
SOO	Sequence of Operations	The spec document PLC programmers code from and Cx agents test against.
Cx	Commissioning	Systematic verification: L1 factory → L2 install → L3 functional → L4 integrated → L5 seasonal.
IST	Integrated Systems Testing	L4 commissioning: multi-system failure testing with live IT load.
OPC-UA	OPC Unified Architecture	Modern, secure, platform-independent PLC-to-SCADA protocol.
BACnet	Building Automation and Control Networks	ASHRAE/ISO standard for building automation interoperability.
MQTT	Message Queuing Telemetry Transport	Lightweight pub/sub for IIoT. Sparkplug B adds standardized namespace.
SNMP	Simple Network Management Protocol	UDP-based monitoring for UPS, PDU, CRAC. v3 adds encryption.
GPU	Graphics Processing Unit	Massively parallel processor. Primary AI compute engine.
HBM	High Bandwidth Memory	3D-stacked DRAM via TSVs. 3–8 TB/s bandwidth per GPU.
SM	Streaming Multiprocessor	GPU execution unit containing CUDA cores + Tensor Cores.
FLOPS	Floating Point Operations/Second	Compute throughput. B200: ~2.25 PFLOPS at FP8.
MFU	Model FLOPs Utilization	Actual vs. peak utilization. Good training: 30–45%.
BF16	Brain Floating Point 16	16-bit, same exponent range as FP32. Default training precision.
FP8	8-bit Floating Point	E4M3/E5M2 variants. 2× Tensor Core throughput vs. BF16.
NIC	Network Interface Card	400/800 Gb/s per GPU in AI clusters.
SFT	Supervised Fine-Tuning	Post-training on curated (prompt, response) pairs.
RLHF	Reinforcement Learning from Human Feedback	Reward model + PPO to align outputs with human preference.
DPO	Direct Preference Optimization	Simpler RLHF alternative — no separate reward model needed.
FSDP	Fully Sharded Data Parallel	Shards model across GPUs. Each gathers params just-in-time.
DP	Data Parallel	Full model copy per GPU, AllReduce gradients.
TP	Tensor Parallel	Splits matmuls across GPUs along hidden dim. Needs NVLink.
PP	Pipeline Parallel	Different layers on different GPUs. Assembly line approach.
RAG	Retrieval-Augmented Generation	Inject retrieved docs into prompt. Reduces hallucination.
KV Cache	Key-Value Cache	Cached K/V from prior tokens. Grows linearly with sequence length.
TTFT	Time to First Token	Prefill latency. Users perceive this as responsiveness.
GQA	Grouped Query Attention	Multiple Q heads share K/V heads → smaller KV cache.
MoE	Mixture of Experts	Router selects top-k of N expert MLPs per token. More params, same FLOPs.
BPE	Byte Pair Encoding	Subword tokenizer. Iteratively merges frequent adjacent pairs.

KEY CONCEPTS // ESSENTIALS

Core definitions

AI Factory

NVIDIA's term: a purpose-built facility that manufactures intelligence (tokens), not just stores data. Manages entire AI lifecycle: data ingestion → training → fine-tuning → high-volume inference. Positioned as national-scale critical infrastructure.

2N Redundancy

Two independent power paths (A+B), each carrying 100% load. Standard for Tier III/IV mission-critical facilities.

Purdue Model

ISA-95 network segmentation: Level 0 (physical) → Level 5 (enterprise). IT/OT DMZ between Level 3 and 4.

PID Loop

Proportional-Integral-Derivative control. Kp (reacts to error), Ki (eliminates steady-state error), Kd (dampens oscillation).

Scan Cycle

PLC execution: read inputs → execute program → write outputs → comms. Repeats every 1–20 ms.

IEC 61131-3

PLC programming standard: Ladder Diagram, Structured Text, Function Block Diagram, Instruction List, SFC.

Alarm Rationalization

ISA-18.2 best practice: <1 actionable alarm per operator per 10 min. Prevents alarm fatigue.

Hot/Cold Aisle Containment

Physical separation preventing hot exhaust from mixing with cold supply. Without it, 30–40% cooling wasted.

Free Cooling (Economizer)

Using outside air or raised chiller setpoints when ambient is cool enough. Saves 20–40% cooling energy.

Direct-to-Chip Liquid Cooling

Cold plates on CPU/GPU die. Warm water (30–45°C) enables year-round free cooling. Required above ~50 kW/rack.

ASHRAE TC 9.9

Thermal guidelines: A1 class recommends 18–27°C dry-bulb at rack inlet. Outside range = accelerated failures.

AllReduce

Collective op: every GPU contributes a tensor, all are summed, every GPU gets the result. Ring topology minimizes waste.

FlashAttention

IO-aware attention: tiles QKV into SRAM-sized blocks. Avoids N² HBM materialization. 2–4× speedup.

Chinchilla Scaling

Compute-optimal: D ≈ 20 × N tokens. 70B model → 1.4T tokens. FLOPs ≈ 6ND.

Continuous Batching

New requests join in-flight batch at each decode step. Keeps GPU utilization high.

Speculative Decoding

Small draft model proposes K tokens, large model verifies in one pass. ~2–3× speedup, same quality.

Residual Connection

output = layer(x) + x. Gradient highway enabling 100+ layer training. Without it, gradients vanish.

Softmax Temperature

τ applied to logits: softmax(logits/τ). Low τ → deterministic. High τ → creative. τ→0 = argmax.

VESDA

Very Early Smoke Detection Apparatus. Laser-based air sampling — detects smoke before visible particles form.

Clean Agent Suppression

FM-200, Novec 1230 — gaseous fire suppression for IT rooms. Leaves no residue.

Digital Twin

Virtual replica with real-time sensor data. NVIDIA Omniverse for CFD, what-if scenarios, predictive maintenance.

QUICK MATH // BACK-OF-ENVELOPE

Key numbers to know

Power Chain

• Utility: 12.47–34.5 kV (medium voltage)
• Step-down: 480V (US) / 415V (intl)
• GPU rack (NVL72): ~120 kW
• Single B200 GPU: ~1,000W TDP
• 10k GPU cluster: ~12–15 MW facility

Cooling

• Air cooling limit: ~15–20 kW/rack
• Liquid cooling: 50–120+ kW/rack
• ASHRAE A1 inlet: 18–27°C
• Typical ChW supply: 42°F / 5.5°C
• PUE 0.1 improvement at 100 MW ≈ $6M/yr saved

AI Training

• Chinchilla: D ≈ 20N tokens
• FLOPs ≈ 6 × N × D
• Good MFU: 30–45%
• NVLink 5: ~1.8 TB/s per GPU
• HBM3e: 3–8 TB/s bandwidth

Q&A // INFRASTRUCTURE & CX

Common questions — Infrastructure

"Walk me through the power distribution from utility to server."

Utility power arrives at 12.47–34.5 kV medium voltage. Main switchgear meters and routes it through an ATS (automatic transfer switch) that can flip to generator feed. Step-down transformers bring it to 480V (US) or 415V (international). From there it splits into redundant A/B paths through UPS systems (battery backup for the 10–15 second generator start gap). UPS output feeds floor PDUs (power distribution units) with static transfer switches. Floor PDUs step down to rack-level bus bars or rack PDUs, which distribute to individual servers via dual power supplies — so each server draws from both the A and B path simultaneously. Every link in this chain has EPMS power meters reporting voltage, current, kW, power factor, and THD in real time.

"What's the difference between 2N and N+1 redundancy?"

N+1: You have N modules needed to carry full load, plus 1 spare. If one fails, the spare picks up the slack. Cheaper, but a second failure means downtime. Example: 4 UPS modules where you only need 3, so you can lose one.

2N: Two completely independent, fully redundant power paths (A and B). Each path alone can carry 100% of the load. There is no shared component — separate utility feeds, separate transformers, separate UPS, separate PDUs. If the entire A side goes down, B carries everything. This is the standard for Tier III/IV mission-critical facilities and is mandatory for AI training clusters where a power glitch kills a multi-day training run. 2N+1 adds an extra spare module per side for even higher reliability.

"How does commissioning L4/IST differ from L3?"

L3 (Functional) tests individual systems in isolation against the Sequence of Operations. You verify one chiller starts, ramps, alarms, and shuts down correctly. You test one UPS on bypass. One generator on load. Each system is proven independently.

L4/IST (Integrated Systems Testing) tests multi-system failure scenarios with live IT load (or load banks). You simulate utility loss and verify the entire chain responds: ATS transfers, generators start and sync, UPS bridges the gap, BMS adjusts cooling. Then you cascade failures — chiller trip under load → temperature rise → BMS starts lag chiller → if it fails, load shedding kicks in. L4 proves the systems work together under stress, not just individually. Every deviation is a tracked deficiency requiring resolution before the facility goes live.

"What happens when a chiller trips under full load?"

Immediate: chilled water supply temperature begins rising because remaining capacity can't match the heat load. The BMS detects the trip and initiates the lag chiller start sequence (anti-recycle timer permitting — typically 300 seconds). During this gap, CHW supply temp may rise 3–5°F. CRAH/CDU units see reduced delta-T and increase fan/pump speeds to compensate. If the lag chiller also fails or takes too long, rack inlet temps cross the high-temp alarm threshold (typically 95°F/35°C), triggering a warning alarm. If temps continue rising, the critical alarm (100°F/38°C) triggers IT load shedding — the BMS or DCIM shuts down non-essential compute to reduce heat load. In a liquid-cooled AI hall at 120 kW/rack, you have about 2–3 minutes of thermal runway before throttling begins, versus 10–15 minutes in a traditional 10 kW/rack enterprise hall. This is why proper chiller staging logic, anti-recycle bypass, and N+1 cooling capacity are non-negotiable.

"How do you optimize PUE in a liquid-cooled facility?"

Six primary levers: (1) Raise chilled water supply temperature — warm-water cooling (30–45°C) enables free cooling via dry coolers year-round in most climates, eliminating compressor energy. (2) VFDs on all pumps and fans — variable speed drives match motor speed to actual load instead of running at 100% constant. (3) Eliminate CRAH/CRAC units — liquid cooling removes 60–80% of heat at the chip, so air-side cooling can be minimal or eliminated. (4) Economizer modes — use outside air or raised condenser water setpoints when ambient conditions allow. (5) Higher voltage distribution (415V vs 208V) — reduces I²R distribution losses. (6) Efficient UPS — ECO mode, lithium-ion batteries, and right-sizing UPS capacity to avoid running at low load factors. A well-designed liquid-cooled AI facility can achieve PUE 1.03–1.10; each 0.1 improvement at 100 MW saves ~$6M/year.

"Explain the Purdue Model and why OT networks must be segmented."

The Purdue Model (ISA-95) defines 6 levels: Level 0 (physical process — pipes, wires, air), Level 1 (field I/O — sensors, drives, VFDs), Level 2 (control — PLCs, DDC, BMS controllers), Level 3 (site operations — SCADA, historian), Level 4 (business — DCIM, MES, IT systems), Level 5 (enterprise — ERP, email, cloud). A strict IT/OT DMZ with firewalls sits between Level 3 and Level 4. OT networks prioritize availability and safety — a PLC controlling a fire suppression interlock must never be disrupted by a software update, a vulnerability scan, or an IT policy change. IT networks prioritize confidentiality. Mixing them means an IT compromise (phishing, ransomware) could reach PLCs that control physical safety systems. Data flows OT→IT only, via historians, OPC-UA gateways, or MQTT brokers in the DMZ. Security standard: IEC 62443. Never expose PLCs directly to the IT or internet network.

"What protocols do power meters use? How about HVAC?"

Power meters: Modbus TCP (Ethernet, port 502) or Modbus RTU (RS-485 serial). Some advanced meters also expose data via SNMP or REST APIs. Modbus uses a simple register-based addressing scheme — FC03 reads holding registers, FC04 reads input registers. No native security, so it relies on network segmentation.

HVAC / BMS: BACnet IP (UDP port 47808) is the ASHRAE/ISO standard for building automation. Data is organized as objects (Analog Input, Analog Output, Binary Value, Schedule, Trend Log) with properties (Present-Value, Status-Flags). Supports COV (Change of Value) subscriptions. Some legacy systems use LON or proprietary serial. Modern integration uses OPC-UA as the unifying layer, with gateways translating BACnet↔OPC-UA and Modbus↔OPC-UA at system boundaries.

Q&A // AI & ML SYSTEMS

Common questions — AI/ML

"Why can't you air-cool a GPU training cluster?"

Physics. A single GB200 NVL72 rack dissipates ~120 kW in a 0.6 m² footprint. Air cooling works by blowing cold air across heat sinks and exhausting hot air — but air has very low thermal capacity (specific heat ~1 kJ/kg·K vs water at ~4.2 kJ/kg·K). To remove 120 kW with air, you'd need airflow volumes that are physically impossible to route through a rack — the fans alone would consume massive power and create deafening noise. The practical ceiling for air cooling is ~15–20 kW/rack. Above 50 kW/rack, direct-to-chip liquid cooling is mandatory: cold plates mounted on each GPU/CPU transfer heat to a water loop via a CDU (Coolant Distribution Unit). The CDU rejects heat to the building chilled water plant. Liquid removes 60–80% of the server heat, leaving only residual air cooling for memory, storage, and fans.

"What's the difference between prefill and decode?"

LLM inference has two distinct phases. Prefill processes the entire input prompt in parallel — one big forward pass through all layers, populating the KV cache for every token in the prompt. Prefill is compute-bound (lots of matmuls on many tokens at once). It determines TTFT (time to first token).

Decode generates output tokens one at a time. Each step reads the full model weights from HBM but processes only one new token, using the KV cache from all previous tokens. Decode is memory-bandwidth bound — the GPU spends most of its time loading weights, not computing. Throughput ≈ model_size_bytes / HBM_bandwidth. This is why techniques like continuous batching (amortize weight loading across many concurrent requests) and speculative decoding (verify K draft tokens in one pass) matter so much for serving efficiency.

"How does tensor parallelism differ from data parallelism?"

Data parallelism (DP): Every GPU holds a complete copy of the model. Each GPU processes a different mini-batch of data. After computing local gradients, all GPUs AllReduce them to stay synchronized. Simple, but every GPU needs enough memory for the full model + optimizer states. Works across nodes with moderate bandwidth.

Tensor parallelism (TP): A single matmul is split across GPUs along the hidden dimension (Megatron-style). For a weight matrix W of shape [h, 4h], GPU 0 gets W[:, :2h] and GPU 1 gets W[:, 2h:]. Each GPU computes its slice, then they AllReduce the result. This requires high-bandwidth links (NVLink at 1.8 TB/s) because activations are communicated every layer. TP reduces per-GPU memory by the number of TP ranks. Typically used within a node (2–8 GPUs on NVLink), while DP is used across nodes (over InfiniBand/Ethernet).

"What is FlashAttention and why does it matter?"

Standard attention computes softmax(QK^T/√d)·V, which requires materializing the full N×N attention matrix in HBM (GPU main memory). For a 128K context, that's 128K² = 16 billion elements — massive memory and bandwidth cost.

FlashAttention (Dao et al., 2022) tiles the computation into small blocks that fit entirely in GPU SRAM (fast on-chip memory, ~20 MB). It computes exact attention — no approximation — but never materializes the full N×N matrix. This reduces HBM reads/writes from O(N²) to O(N²/M) where M is SRAM size. Result: 2–4× wallclock speedup, dramatically lower memory usage, and the ability to train with much longer contexts without running out of memory. It's now the default in virtually every major training and inference framework.

"How do you reduce KV cache memory usage?"

The KV cache stores Key and Value tensors for all previous tokens across all layers. For a 70B model with 80 layers, 8 KV heads, 128-dim heads, at FP16, a single sequence of 4K tokens uses ~2.5 GB. At batch size 64, that's 160 GB — more than the model weights themselves. Three main strategies:

(1) GQA (Grouped Query Attention) — share KV heads across multiple Q heads. If 32 Q heads share 8 KV heads, the KV cache is 4× smaller. Llama 2 70B, Mistral, and most modern models use GQA. (2) KV cache quantization — store cached K/V in FP8 or INT8 instead of FP16, halving or quartering memory. (3) PagedAttention (vLLM) — manage KV cache in fixed-size pages like OS virtual memory. Eliminates fragmentation that previously wasted 60–80% of allocated KV space. Pages can be shared (prefix caching) or freed independently.

"What makes AI datacenter power loads different from enterprise?"

Three fundamental differences: (1) Power volatility — enterprise servers draw relatively steady power. GPU clusters swing from ~30% TDP at idle to 100% TDP in milliseconds when a training batch launches. This creates transient spikes that stress UPS systems and confuse capacity planning models built for static loads. (2) Thermal density — at 80–120 kW/rack vs enterprise 5–15 kW/rack, the thermal time constant collapses. A cooling failure that gives you 15 minutes in an enterprise hall gives you <3 minutes in an AI hall. Monitoring latency that was fine at 60-second intervals becomes dangerous. (3) Workload correlation — in an AI cluster, all GPUs in a training job start and stop together, so power and thermal loads are highly correlated across hundreds of racks. Traditional designs assume statistical diversity (some racks busy, some idle, averaging out). AI training breaks that assumption — your cooling plant must handle near-instantaneous 0-to-100% step changes.

"What is MoE and how does it affect inference?"

Mixture of Experts (MoE) replaces the dense MLP (feed-forward) block in each transformer layer with N parallel "expert" MLPs plus a learned router. For each token, the router selects the top-k experts (typically k=2 of N=8–64). Total model parameters are much larger (since all N experts exist), but per-token FLOPs are the same as a dense model (only k experts fire per token).

Impact on inference: The full model must be in memory (all experts loaded), so MoE models need more GPU memory than a dense model of equivalent quality — a 47B-active-parameter MoE might have 140B total parameters. However, per-token compute cost equals the active parameter count, so inference is faster than a dense 140B model. The challenge is expert load balancing — if the router sends most tokens to the same few experts, you get hotspots on some GPUs and idle capacity on others. Auxiliary load-balancing losses during training and expert parallelism (spreading experts across GPUs) mitigate this.

REFERENCES

From Power Grid to Inference

What is AI, really?

Why does AI need a building?

The AI Factory

The Building Is the Computer

Drag to scale a training cluster

One rack, 72 GPUs, ~120 kW

Complete electrical path — utility to server

How many paths to the server?

Traditional approaches

Direct-to-chip: the new standard

Inside the cooling loop — component by component

The bridge between IT and facility

Where heat leaves the silicon

Rejecting heat to atmosphere

Moving heat between loops

Pump types in cooling plants

Moving air through towers & AHUs

How it all connects — primary/secondary CHW + condenser loop

Total Facility Power ÷ IT Equipment Power

Why liquid cooling at scale is hard

Submerging servers in dielectric fluid

Vaporization at the cold plate — boiling without immersion

What's next — emerging technologies pushing cooling forward

When the data center leaves Earth — cooling, comms & controls in vacuum

Proving the facility works — before IT load arrives

Single pane of glass — from rack to C-suite

Why training is bandwidth-bound

The data pipeline

Fire, leak, EPO

You Can't Manage What You Can't Measure

From field device to operations center

How many alarms does a datacenter generate?

The workhorse of industrial automation

How PLCs are programmed

The operator's window into the plant

The platform NVIDIA uses

How equipment talks to each other

Electrical Power Monitoring

Cooling Plant Control

Non-negotiable systems

The bridge between design and code

SOO excerpt: lead/lag logic

Complete SOO with explanations

The platform that doesn't exist yet (but should)

Trust, but verify — systematically

Where data center CX gets unique

The unified operational view

Network segmentation for industrial systems

Virtual replica, real-time data

Why AI workloads break traditional monitoring

Sample electrical device types — make, model, protocol, key data

Sample mechanical device types — make, model, protocol, key data

What DCIM needs at the telemetry level — and why

From Transistor to Tensor Core

A GPU is a memory machine

CPU · GPU · TPU · NPU

Fewer bits, more throughput, more risk

Matrix Multiplication, All The Way Down

y = σ(W · x + b)

Backpropagation in one breath

Loss Landscape Visualization

Attention Is The Engine

Type to see the merges fire

Each token decides who to listen to

One transformer block, repeated N times

Sampling: turning a probability vector into text

Trillions of Tokens, Months of Wall Time

C ≈ 6 · N · D

How a forward pass shards across 16k GPUs

Crawl → dedup → filter → mix

SFT → RM → RLHF / DPO

Benchmarks vs. reality

Serving the Trained Mind

Configure a deployment

Loops, tools, retrieval

Considerations & Open Questions

What we still don't know

Quick Reference

Every abbreviation you need to know

→ RM → /