From Power Grid to Inference
A team learning guide
AI is not magic. It is electricity, linear algebra, and scale — disciplined into prediction. This guide walks your team through every layer of the system, from a 130-megawatt facility to the softmax function that selects the next token.
Get Started↓ scroll to begin
What is AI, really?
When you ask ChatGPT a question and it answers, it feels like thinking. It isn't. What's actually happening is a very fast, very expensive math problem: the system has read billions of sentences and learned statistical patterns between words. Given a prompt, it predicts the most probable next word, then the next, then the next — hundreds of times per second.
That "prediction engine" is called a neural network. Think of it as a massive spreadsheet with billions of adjustable numbers (called parameters). During training, the system reads data, makes predictions, checks its mistakes, and tweaks those numbers to be slightly less wrong next time. Do this trillions of times and the spreadsheet gets eerily good at predicting language, images, code, or music.
Feed the network terabytes of text. It reads, predicts, gets corrected, and adjusts billions of parameters over weeks or months.
After training, you have a giant file of tuned parameters. This is the AI — a frozen snapshot of everything it learned. GPT-4 is ~1.8 trillion parameters.
When you ask it a question, the model runs those parameters against your prompt to generate one token (word-piece) at a time. This is called inference.
Why does AI need a building?
The math behind AI is simple — multiply matrices and add biases — but the scale is staggering. Training a frontier model means multiplying trillions of numbers together, billions of times. Your laptop's processor can't do this in a useful timeframe.
So companies pack thousands of specialised chips called GPUs into warehouse-sized buildings called data centers. These chips draw enormous amounts of electricity, and all that electricity becomes heat. Cooling that heat requires industrial-scale plumbing — chilled water loops, cooling towers, sometimes even outdoor ponds. A single training cluster can draw as much power as a small city.
The AI Factory
NVIDIA uses a deliberate term for these facilities: . Not "data center" — factory. A traditional data center stores and serves data. An AI factory manufactures intelligence. Raw data enters one end; trained models and inference tokens come out the other. The primary product is tokens — and throughput is the measure of output, just like any manufacturing operation.[src]
This framing changes how you think about every layer of the stack. The facility isn't just housing for servers — it's an industrial production line that manages the entire AI lifecycle:
Data pipelines ingest, clean, and structure trillions of tokens from unstructured text, images, code, and sensor data. Data quality directly determines model quality — garbage in, garbage out, but at a trillion-token scale.
Trained models generate predictions, decisions, and content in real time. Inference outputs feed back into the system as a data flywheel — improving model accuracy over time. Every token generated creates new training signal.
GPUs, NVLink/NVSwitch fabrics, InfiniBand networking, parallel storage, liquid cooling — plus the software stack: CUDA, TensorRT, NIM microservices. Hardware and software designed as a single integrated system.
NVIDIA Omniverse lets teams design, simulate, and optimize the entire facility virtually — testing layout changes, modeling failure scenarios, and validating cooling before construction begins.
Automation tools handle hyperparameter tuning, model deployment, and performance monitoring. The factory operates 24/7 — training, fine-tuning, and serving inference at scale with minimal human intervention.
NVIDIA positions AI infrastructure as national infrastructure — as fundamental as roads, power grids, and telecommunications. Sovereign nations are building their own AI factories to cultivate local language models, protect data sovereignty, and drive economic competitiveness. Every enterprise, every government will need access to one.
In the sections ahead, we'll examine every part of this system — from the power grid and cooling systems (Infrastructure) through chip architecture (Silicon), the underlying math (Mathematics), the transformer architecture that makes language models work (Architecture), training and inference at scale (Training & Inference), and real-world considerations (Considerations). Hover any highlighted term for a technical definition.
The Building Is the Computer
If you're new here
A is a large, climate-controlled building packed with thousands of computers. Most of the internet already runs from buildings like this. What makes an AI data center different is the sheer density of power and heat. Training a single frontier AI model can require thousands of specialised chips called , each drawing up to 1,000 watts — roughly the same as a microwave oven running flat out. Multiply that by 10,000+ chips and you need as much power as a small town.
All that electricity becomes heat. So data centers need industrial cooling — water loops, heat exchangers, or cooling towers — just to stop the chips from melting. The section below lets you explore exactly how much power a training cluster needs.
A frontier model is not trained on a laptop. It is trained inside a purpose-built industrial facility that converts megawatts of grid power into gradients. Site selection is dominated by three things: cheap reliable electricity, fiber backbones, and water or air cool enough to reject heat.[src]
Drag to scale a training cluster
For reference: 100k Blackwell GPUs ≈ a mid-size city's continuous load.[src]
One rack, 72 GPUs, ~120 kW
- Compute trays — 18× per rack, each holding 2 Grace CPUs + 4 Blackwell GPUs, lashed together by 5th-gen at ~1.8 TB/s per GPU.[src][src]
- NVSwitch trays — 9× per rack form a non-blocking all-to-all fabric so any GPU can reach any other GPU at full NVLink bandwidth.
- — direct-to-chip liquid cooling. Air alone cannot evacuate ~120 kW from a single 0.6 m² footprint.
- Power shelves — bus-bar delivery; redundant 415 V three-phase feeds backed by and diesel/gas generators in 2N or N+1 topology.
- Optics — 800 Gb/s or Ethernet links exit the rack toward a rail-optimized fat-tree spine.[src]
Complete electrical path — utility to server
Every AI factory converts grid power into GPU compute through a carefully engineered chain. Each stage steps down voltage, adds protection, and provides monitoring. A 2N topology means two fully independent paths — either side can carry the full facility load alone.
Utility entrance through MV switchgear. Vacuum or SF₆ breakers provide fault isolation. Protective relays (SEL-751, GE Multilin 489) monitor overcurrent, differential, ground fault. Revenue-grade metering (CT/PT accuracy class 0.3) for utility billing. ATS (automatic transfer switch) or paralleling switchgear coordinates utility ↔ generator transitions with <10s open-transition or <100ms closed-transition transfer.
Cast-coil dry-type transformers step down to 480 V. K-rated (K-13 or K-20) to handle harmonic distortion from server power supplies. UPS systems (rotary flywheel or lithium-ion) provide 10–15 min ride-through for generator start. Floor PDUs step down 480 → 208/120 V with integrated breakers and metering. RPPs (remote power panels) distribute at row level — each RPP feeds 4–8 racks with individually breakered circuits.
Intelligent rack PDUs (ServerTech, Raritan, APC) provide per-outlet monitoring: voltage, current, kW, kWh, power factor. Data reported via SNMP, Modbus TCP, or REST API to . Typical GPU rack draws 40–120+ kW with >90% power factor. At 120 kW (GB200 NVL72), each rack needs 2 × 100A 208V circuits or direct 480 V bus bar feed bypassing 208 V transformation entirely.
| Stage | Metering | Key Measurements | Protocol |
|---|---|---|---|
| MV Switchgear | ION 9000 / SEL-735 | V, A, kW, kVAR, PF, THD, demand | Modbus TCP / DNP3 |
| Generator | Controller (DSE, DEIF) | kW, fuel level, RPM, coolant temp, battery V | Modbus RTU/TCP |
| Transformer | Temp sensors (winding/oil) | Winding temp, oil temp, loading % | 4–20 mA / Modbus |
| UPS | Built-in controller | Input/output V, A, kW, battery SOC, temp, runtime | SNMP / Modbus TCP |
| Floor PDU / STS | ION 7650 / PM8000 | V, A, kW per panel, breaker status, transfer count | Modbus TCP |
| RPP | Branch circuit monitor | Per-breaker A, kW, alarm on trip | Modbus TCP / BACnet |
| Rack PDU | Intelligent PDU | Per-outlet V, A, kW, kWh, PF, inlet temp | SNMP / Modbus / REST |
All metering data feeds into for real-time PUE calculation, capacity planning, and alarm management. Total facility power (metered at MV) ÷ IT load (metered at rack PDU) = .
How many paths to the server?
GPU racks draw 40–120+ kW each (GB200 NVL72 exceeds 120 kW). This drives unique requirements for power density, UPS sizing, and breaker coordination.
Traditional approaches
Air cooling works up to ~15–20 kW/rack. Above 50 kW/rack, air alone is physically insufficient regardless of airflow volume.
Direct-to-chip: the new standard
The GB200 NVL72 is liquid-cooled at rack level. Robust leak detection and emergency drain procedures are mandatory — a CDU leak can destroy millions in hardware in minutes.
Inside the cooling loop — component by component
AI data centers reject tens of megawatts of heat continuously. Below is a detailed look at every major mechanical component in the liquid cooling chain, from the cold plate bolted to the GPU die to the cooling tower rejecting heat to the atmosphere. Understanding these components — and how they interact — is essential for anyone involved in design, commissioning, or operations.
Heat flows left-to-right: GPU → cold plate → CDU → chiller → cooling tower → atmosphere. Two isolated water loops prevent contamination of IT equipment.
The bridge between IT and facility
Purpose: The CDU is the boundary between the clean, deionized IT water loop and the facility's chilled water loop. It ensures the two never mix — contamination of the IT loop with facility water (which contains corrosion inhibitors, biocides) would damage cold plates and clog micro-channels.
Scale: A typical row-level CDU serves 4–8 racks (200–600 kW). Rack-level CDUs serve a single rack (~40–120 kW). Large deployments use centralized CDUs serving entire rows or pods.
Redundancy: Dual pumps (lead/standby), dual facility-side connections, and N+1 CDU sparing per row. Automatic failover on pump fault or low-flow alarm. Typical response time: <5 seconds.
Where heat leaves the silicon
mounting bolt
mounting bolt
Micro-channels maximize surface area. Copper fins as thin as 0.2 mm create hundreds of parallel flow paths. Turbulent flow at the channel level dramatically increases heat transfer coefficient vs. smooth bore.
TIM (Thermal Interface Material) fills microscopic air gaps between the IHS and cold plate. Indium foil or high-performance paste with >5 W/m·K conductivity. Poor TIM application is the #1 cause of GPU thermal throttling.
Quick-disconnects (QD) allow hot-swap of servers without draining the loop. Non-drip / dry-break QDs are mandatory in IT environments. Rated for 10,000+ connect/disconnect cycles.
Rejecting heat to atmosphere
Evaporative cooling exploits the latent heat of vaporization — water evaporating from the fill media absorbs ~1,000 BTU/lb, cooling the remaining water. Approach temp (basin temp minus wet-bulb) of 5–7°F is typical. Lower approach = larger, more expensive tower.
Water consumption: ~1.8 gal/kWh of heat rejected (evaporation + blowdown). A 100 MW AI facility can consume 300,000+ gallons/day. Water treatment (biocide, scale inhibitor, pH control) is critical to prevent Legionella, scaling, and corrosion.
Fan control: VFD modulates fan speed to maintain condenser water return temp setpoint. Multiple cells staged on/off as load changes. Fan energy is 2–5% of total cooling plant energy.
Moving heat between loops
Plate & frame are compact, high-effectiveness (ε > 0.90) and easily expandable — add plates to increase capacity. Used in CDUs, economizer bypass, and free-cooling heat exchangers. Brazed variants (BPHE) are smaller but not field-serviceable.
Shell & tube are used inside chillers as the condenser and evaporator. Refrigerant flows shell-side (changes phase); water flows tube-side. Fouling reduces efficiency — condenser approach temp rises 1°F per year without cleaning, costing ~2% chiller efficiency per degree.
Pump types in cooling plants
Moving air through towers & AHUs
How it all connects — primary/secondary CHW + condenser loop
Total Facility Power ÷ IT Equipment Power
- • Raise chilled water supply temp → more economizer hours
- • Liquid cooling → eliminates fan energy, enables free cooling
- • Variable speed drives on pumps and fans
- • Hot/cold aisle containment
- • Efficient UPS (ECO mode, lithium-ion batteries)
- • Higher voltage distribution (415V vs 208V) reduces I²R losses
[src] A 0.1 improvement in PUE at a 100 MW facility saves ~10 MW of cooling/overhead power — roughly $6M/year at industrial electricity rates.
Why liquid cooling at scale is hard
Moving from air to liquid sounds straightforward — but at AI-scale densities, it introduces a set of engineering challenges that the data center industry is still actively solving. These are the real-world problems that make or break a liquid-cooled deployment.
Water inside a server room is inherently risky. A single fitting failure can destroy hundreds of thousands of dollars of GPU hardware in minutes. Every connection point — quick-disconnects, manifold joints, CDU internals — is a potential leak site. Mitigation: dry-break QD fittings, leak detection cables under every pipe run, drip pans beneath CDUs, automatic isolation valves that shut down a loop segment within seconds of detection. Despite this, many operators still consider liquid cooling "high anxiety" compared to air.
Air-cooled servers slide in and out of racks freely. Liquid-cooled servers are tethered to plumbing. Replacing a GPU node means disconnecting fluid lines, managing residual coolant, and reconnecting without introducing air bubbles. Blind-mate connectors (auto-connecting on rack insertion) help, but add cost and complexity. Service technicians need new training — HVAC pipe-fitting skills meet IT operations. Mean-time-to-repair (MTTR) is longer unless the facility is designed with maintenance access and drain points from day one.
Cold plates only capture 60–80% of server heat (GPUs, CPUs). The remaining 20–40% (VRMs, DIMMs, NVMe drives, NICs) still radiates into the room as hot air. You can't eliminate CRAHs or room-level cooling entirely — you need a parallel air system for the "residual" heat. Designing the interaction between these two systems (liquid capturing the bulk, air handling the remainder) requires careful airflow modeling and control coordination. Over-cooling with air wastes energy; under-cooling risks component damage on non-liquid-cooled parts.
The IT loop requires ultra-pure deionized water (<1 μS/cm conductivity) to prevent galvanic corrosion and micro-channel fouling. But DI water is aggressive — it leaches metal ions from fittings, especially if mixed metals (copper cold plates + aluminum manifolds) are present. Ongoing water chemistry monitoring (conductivity, pH, dissolved O₂, particulate count) is mandatory. Glycol-based coolants resist corrosion but reduce heat transfer capacity by 10–15% and complicate leak cleanup. There is no perfect fluid — every choice is a trade-off.
Unlike air cooling (standardized 19" racks, ASHRAE guidelines, universal CRAH compatibility), liquid cooling lacks industry-wide standards. Every server vendor has different cold plate designs, manifold connectors, flow requirements, and CDU interfaces. OCP (Open Compute Project) is working on standardized liquid cooling specifications, but adoption is still early. This makes multi-vendor deployments painful — a Dell cold plate won't mate with an HPE manifold. Facilities must commit to a vendor ecosystem or invest heavily in adapters and custom plumbing.
Existing air-cooled facilities weren't designed for liquid cooling. Retrofitting requires: raised-floor penetrations or overhead pipe routing, structural reinforcement (water-filled pipes are heavy), new chiller/tower capacity, CDU floor space, and leak containment infrastructure. Floor loading jumps from ~150 lbs/ft² (air-cooled) to 250+ lbs/ft² (liquid-cooled with dense GPU racks). Many buildings simply can't support it without structural modifications. Greenfield builds can design for liquid from the start — but the industry is converting existing facilities faster than it can build new ones.
Submerging servers in dielectric fluid
Immersion cooling eliminates air entirely — servers are submerged in a tank filled with electrically non-conductive (dielectric) fluid. The fluid makes direct contact with every component on the board, removing heat from GPUs, CPUs, VRMs, DIMMs, and NVMe drives simultaneously. No cold plates, no fans, no CRAHs. Two variants exist: single-phase (fluid stays liquid) and two-phase (fluid boils at the chip surface).
- ✓ Eliminates all server fans — 10–15% IT power savings
- ✓ No CRAHs or raised floor required
- ✓ Every component cooled equally — no hot spots
- ✓ Operates at higher fluid temps → more free-cooling hours
- ✓ PUE of 1.02–1.05 achievable
- ✓ Quieter than air-cooled — no fan noise
- ✓ Mineral oil is cheap and widely available
- ✓ No leak risk to room — fluid stays in sealed tank
- ✗ Messy serviceability — servers drip when removed
- ✗ Increased MTTR — draining & handling required
- ✗ Material compatibility: some connectors, labels, thermal pads dissolve
- ✗ Weight: filled tanks reach 1,500–3,000 lbs — structural reinforcement needed
- ✗ Fluid monitoring (viscosity, particulates, moisture) adds operational overhead
- ✗ Server OEM warranty may be voided — most OEMs don't certify for immersion
- ✗ Fire code compliance: mineral oil is combustible (Class IIIB)
- ✗ Limited vendor ecosystem vs. cold-plate solutions
When a liquid boils, it absorbs the latent heat of vaporization — the energy required to break intermolecular bonds. For engineered dielectric fluids, this is typically 80–120 kJ/kg. This is in addition to the sensible heat the liquid absorbs as its temperature rises. The result: 10–100× higher heat transfer coefficients vs. single-phase convection. A boiling surface can reject >20 W/cm² with only a few degrees of superheat above the fluid's boiling point. For comparison, forced-air convection maxes out at ~0.5 W/cm².
- ✓ Highest heat transfer of any cooling method — handles >1,000W TDP
- ✓ No pumps in primary IT loop — self-circulating phase change
- ✓ Uniform chip temperature regardless of load (boiling is isothermal)
- ✓ Can handle 200+ kW/rack densities
- ✓ Near-silent operation — no fans, no pump vibration
- ✓ PUE of 1.01–1.03 theoretically achievable
- ✓ Fluid is non-flammable (unlike mineral oil)
- ✗ Extremely high fluid cost ($50–150/liter × 200–500L per tank)
- ✗ PFAS regulatory risk — EU REACH restrictions, 3M exit
- ✗ Vapor management — tank must be sealed; fugitive emissions are GWP concern
- ✗ Material compatibility — aggressive solvents attack some elastomers, labels, TIMs
- ✗ Limited production-scale deployments (still early-adopter stage)
- ✗ Service complexity — servers must drain before removal
- ✗ Fluid replenishment logistics and environmental disposal
- ✗ Condenser sizing critical — undersized = vapor loss + capacity limit
| Attribute | Air Cooling | Direct-to-Chip (Cold Plate) | Single-Phase Immersion | Two-Phase Immersion |
|---|---|---|---|---|
| Max rack density | 15–25 kW | 120–200 kW | 100–200 kW | 200+ kW |
| Heat transfer coeff. | 5–25 W/m²·K | 5,000–10,000 W/m²·K | 50–200 W/m²·K | 10,000–100,000 W/m²·K |
| PUE achievable | 1.3–1.6 | 1.03–1.15 | 1.02–1.05 | 1.01–1.03 |
| Fluid cost | N/A (air) | ~$0 (water) | $2–30/L | $50–150/L |
| Residual heat path | All via air | 20–40% via air | 100% via fluid | 100% via fluid |
| Serviceability | Excellent | Moderate (QD fittings) | Challenging (drip/drain) | Challenging (drain + vapor) |
| Maturity | Decades (standard) | Production (NVIDIA std) | Early production | Pilot / early adopter |
| Regulatory risk | None | None | Low (oil fire codes) | High (PFAS) |
Immersion cooling introduces a different set of monitoring requirements compared to cold-plate systems. The fluid itself becomes a critical asset to monitor and maintain.
- ● Fluid temperature (bulk & stratified)
- ● Viscosity (degradation indicator)
- ● Dielectric breakdown voltage
- ● Moisture content (ppm)
- ● Particulate count (cleanliness)
- ● Acid number (oxidation products)
- ● Fluid level (leak/loss detection)
- ● Tank inlet / outlet ΔT
- ● Per-server inlet temp (immersed sensor)
- ● Condenser coil in/out temps
- ● Vapor space temperature (two-phase)
- ● Condenser approach temperature
- ● Ambient air above tank
- ● High-temp shutdown (fluid overheat)
- ● Low-level alarm (fluid loss)
- ● Vapor pressure monitoring (two-phase)
- ● Leak detection under/around tanks
- ● Emergency drain valve (gravity-fed)
- ● Fire suppression integration
Vaporization at the cold plate — boiling without immersion
This is the “best of both worlds” approach you may have seen: the extreme heat transfer of phase-change boiling, but contained inside a sealed cold plate bolted to the GPU — no immersion bath, no dripping servers. The dielectric fluid boils at the chip surface inside the cold plate. Vapor travels through tubing to a remote condenser where it releases heat to the facility loop and condenses back to liquid, which returns to the cold plate. This is fundamentally different from single-phase direct-to-chip (where water stays liquid and just gets warmer).
insulated line
small pump
Vapor rises naturally because it's less dense than liquid. Liquid returns by gravity. Requires the condenser to be physically abovethe cold plate — typically at the top of the rack or in an overhead manifold. Zero pumping energy in the IT loop. Limited by the height differential and pressure drop.
A small magnetically-coupled pump moves liquid back to the evaporator, freeing the design from gravity constraints. The pump only handles liquid return (low flow, low pressure) — the phase change still does the heavy lifting of heat transport. Condenser can be placed anywhere convenient.
A boiling fluid stays at its saturation temperature regardless of how much heat you dump into it. A 700W GPU and a 1,500W GPU on the same loop will both stabilize at the fluid's boiling point (say, 55°C). This eliminates hot-spot variation between chips and gives a flat, predictable Tjunction across the entire cluster.
Single-phase water carries ~4.2 kJ per kg per °C of temperature rise. A two-phase dielectric carries 80–120 kJ/kg of latent heat on phase change alone — >20× more energy per unit mass moved. Result: you need ~1/10th the coolant flow rate to remove the same heat. Smaller pipes, smaller pumps, less plumbing complexity.
Conventional single-phase cold plates start running out of margin at ~1,000W GPUs because the inlet-to-outlet ΔT widens and the chip sees inconsistent cooling. 2P-DTC handles 1,500W+ TDPs with the same supply temperature and flat thermal profile. This makes it a leading candidate for NVIDIA Rubin and beyond.
Because phase change happens at a fixed temperature (set by fluid choice), 2P-DTC can run with facility water at 35–45°C while still keeping GPU Tj below throttle limits. This means year-round free cooling in almost any climate — no compressors, no chillers, just a dry cooler or cooling tower.
Pumped 2-phase using a proprietary dielectric (HC-1). Sealed evaporators (Enhanced Nucleation Evaporators) bolt directly onto GPU/CPU. Deployed by Equinix, Aligned. Among the most commercially mature 2P-DTC offerings.
Pumped 2-phase using a non-PFAS, low-GWP fluorocarbon refrigerant. Spinoff from Nokia Bell Labs technology. Focuses on drop-in cold-plate replacement for existing direct-to-chip racks.
Microconvective single-phase today, with two-phase variants in development. Uses high-velocity impinging jets inside the cold plate to maximize heat transfer coefficient. HPE OEM partnership.
Operates the IT loop at sub-atmospheric pressure so leaks pull air in rather than push fluid out. Pairs naturally with 2-phase designs where vapor pressure management is critical.
| Attribute | Single-Phase DTC (water) | Two-Phase DTC | Two-Phase Immersion |
|---|---|---|---|
| Where boiling occurs | N/A (no phase change) | Inside sealed cold plate | Open bath around servers |
| Fluid | Treated water or PG mix | Dielectric refrigerant (small volume) | Dielectric refrigerant (large volume) |
| Fluid volume per rack | 5–20 L | 2–10 L | 200–500 L |
| Server form factor | Standard rack-mount | Standard rack-mount | Vertical in tank |
| Serviceability | QD fittings, hot-swap | QD fittings, hot-swap | Drain & lift from tank |
| Max chip TDP | ~1,000–1,200 W | 1,500 W+ | 1,500 W+ |
| Facility water supply temp | 25–35°C (warm water) | 35–45°C (free cooling) | 25–40°C |
| Residual air cooling needed | Yes (~20–40% of heat) | Yes (~20–30% of heat) | No |
| Leak consequence | Water on electronics = damage | Dielectric — non-conductive, evaporates | Dielectric — non-conductive |
| Maturity (2026) | Production standard (NVL72) | Early production / pilot | Pilot / early adopter |
The fluid's saturation temperature is a direct function of system pressure. A pressure drift indicates fluid loss, non-condensable gas ingress, or condenser fouling. Pressure transducers on the vapor line are mandatory.
If heat flux exceeds the critical heat flux (CHF) of the boiling surface, the cold plate can “dry out” — a vapor film forms between liquid and the hot surface, collapsing heat transfer. Tj will spike in seconds. PLC must trip the GPU before this happens.
Air leaking into a sub-atmospheric loop creates NCG pockets in the condenser, reducing effective area. Most 2P-DTC systems have an automatic vent or NCG purge cycle, with monitoring for purge events.
Unlike water systems, you can't just top off from a city tap. Fluid loss = expensive refrigerant replacement. Continuous mass monitoring (via accumulator level or pressure-temperature correlation) catches slow leaks early.
Today, single-phase direct-to-chip (water) is the production standard for GB200 NVL72 and equivalent 120 kW racks. 2P-DTC is the leading contender for the next density jump — 200+ kW racks with 1,500W+ GPUs (NVIDIA Rubin generation). It preserves the serviceability and familiar form factors of single-phase DTC while delivering the heat-flux handling of immersion. The main barriers are fluid cost, supply-chain constraints around non-PFAS refrigerants, and the relative newness of the vendor ecosystem compared to mature water-based cold plates.
What's next — emerging technologies pushing cooling forward
With GPU TDP heading toward 1,500W+ (NVIDIA Rubin) and rack densities exceeding 200 kW, even current liquid cooling approaches face limits. Here's where the industry and academia are pushing boundaries.
Instead of bolting a cold plate on top of the chip, etch micro-channels directly into the silicon die or interposer. Coolant flows microns away from the transistors. DARPA's ICECool program demonstrated >1 kW/cm² heat removal — 10× what conventional cold plates achieve. This eliminates TIM, IHS, and the thermal resistance stack entirely. Challenges: fabrication complexity, integrating fluid connections into chip packaging, reliability over billions of thermal cycles. Timeline: 5–10 years from production GPU integration.
Solid-state heat pumps with no moving parts — apply voltage and one side gets cold. Current TECs are inefficient (COP ~0.5–1.5 vs mechanical chiller COP of 5–8), but new bismuth telluride nanostructures and thin-film designs are closing the gap. Use case: spot cooling for the hottest chip regions (hot-spot management) rather than bulk heat rejection. Intel has demonstrated embedded TEC layers that reduce Tj by 15°C at targeted hot spots. Could complement liquid cooling rather than replace it.
Suspending nanoparticles (copper oxide, aluminum oxide, carbon nanotubes) in base fluids can improve thermal conductivity by 10–40%. Lab results show significant heat transfer improvements at low concentrations (1–5% by volume). Challenges: particle settling over time, abrasion of micro-channels, long-term stability, and cost. No production deployments yet in data centers, but active research funded by ARPA-E and DOE. Dielectric nanofluids could make immersion cooling significantly more effective.
Running GPU supply water at 45–55°C (instead of the traditional 15–20°C) enables year-round free cooling — no chillers needed at all. The "waste" heat at 55–65°C return is hot enough for district heating. Nordic data centers (Meta Luleå, Google Hamina) already feed waste heat to municipal heating networks. Lenovo's Neptune platform runs at 50°C supply, achieving PUE <1.03. The efficiency gain is dramatic: eliminating chiller compressors removes 30–40% of total cooling plant energy. Challenge: GPU silicon must tolerate higher junction temperatures, and not all workloads perform identically at elevated temps.
A middle-ground approach: chilled water coils with EC fans mounted on the rack rear door. Captures 60–100% of rack exhaust heat before it enters the room. Handles 30–50 kW/rack without liquid inside the server. Popular as a retrofit for air-cooled facilities adding GPU density. No cold plates, no QD fittings, no IT-loop plumbing — just chilled water to the door. Limitation: can't match direct-to-chip efficiency at 100+ kW/rack densities, and doesn't address chip-level hot spots.
The industry consensus: direct-to-chip liquid cooling is the default for any new AI data center build. Air cooling is being relegated to edge, enterprise, and legacy workloads. The open questions are whether immersion or cold-plate wins for the densest deployments, whether on-chip microfluidics can reach production in time for the next GPU generation, and whether the industry can standardize fast enough to avoid vendor lock-in. Meanwhile, every 100W increase in GPU TDP makes the case for liquid cooling stronger — and the gap between air's ceiling and GPU demand wider.
When the data center leaves Earth — cooling, comms & controls in vacuum
Starcloud (formerly Lumen Orbit), Axiom Space, Lonestar Data Holdings, and several Chinese ventures are seriously proposing multi-megawatt compute clusters in low Earth orbit (LEO). The pitch: unlimited solar power, no water, no land, no NIMBY. The reality: the two hardest problems on Earth — power and cooling — get even harder in vacuum, and a third problem (communications) becomes load-bearing. Here's how each of those works without an atmosphere underneath you.
Common misconception: “space is cold, so cooling is easy.” The opposite is true. With no air, there is no convection. With nothing in contact, there is no conduction. The only way to reject heat into space is by infrared radiation, and radiation is the weakest of the three heat-transfer modes by orders of magnitude.
Radiative heat flux: q = ε·σ·(T⁴ − T_space⁴)
Where ε ≈ 0.85–0.92 for good radiator coatings, σ is the Stefan-Boltzmann constant, and T_space ≈ 3 K (cosmic background) is effectively zero. The catch: T⁴ means radiator capacity drops fast as you try to operate cooler. A radiator at 50°C rejects roughly ~600 W/m². At 20°C it rejects only ~420 W/m². That's why orbital data center designs always want to run radiators hot.
- ● ISS: 14 ammonia radiator panels, ~75 kW total rejection
- ● ISS radiator area: ~1,560 m²
- ● Effective rejection: ~48 W/m² (avg over orbit)
- ● A 5 MW orbital DC would need ~100,000 m² of radiator at the same efficiency
- ● That's a square ~315 m × 315 m — bigger than 14 football fields
- ● Solar panels for the same 5 MW: ~25,000 m² (4× smaller than the radiators)
In orbit, radiator area, not solar panel area, is the binding constraint. This flips terrestrial design intuition: on Earth power is expensive and cooling is cheap-ish. In orbit, power from the sun is essentially free, but every watt you generate must eventually be radiated, and radiator mass-to-orbit is the dominant cost driver.
Loop heat pipes (LHPs) move heat passively from chip → radiator using capillary-driven phase change. No pump, no moving parts, indefinite lifetime. The workhorse of spacecraft thermal control.
NH₃ boils around −33°C at 1 atm but is operated at higher pressure in spacecraft to boil at 5–40°C. Latent heat (~1,370 kJ/kg) is ~6× water's — minimizes pump power and tubing mass.
Radiators must face deep space, not the sun or Earth. Attitude control system (ACS) constantly slews the spacecraft to keep solar arrays sun-pointed and radiators edge-on to the sun — a coupled optimization problem unique to orbital DCs.
An orbital DC is useless if you can't move data to it and results back. There are three communication tiers, each with different bandwidth, latency, and standards.
| Link Type | Tech | Bandwidth | Latency | Use Case |
|---|---|---|---|---|
| Ground ↔ Sat (RF) | Ka / Ku band | 10–40 Gbps | 5–20 ms (LEO) | Bulk uplink/downlink |
| Ground ↔ Sat (Optical) | Laser comm | 100 Gbps – 1 Tbps | 5–10 ms (LEO) | High-volume training data transfer |
| Sat ↔ Sat (Optical ISL) | Inter-satellite laser | 100–200 Gbps | <5 ms | Distributed compute mesh between DCs |
| TT&C (control) | S-band RF | kbps – Mbps | 10–50 ms | Telemetry, commands (separate from payload) |
The Consultative Committee for Space Data Systems defines the standards every space agency and most commercial operators follow. Key protocols: Space Packet Protocol (datagram format), AOS/TM/TC (telemetry & telecommand framing), CFDP (file transfer with delay tolerance), DTN/Bundle Protocol (store-and-forward for intermittent links).
A LEO satellite sees any one ground station for only ~5–10 minutes per pass. To get continuous downlink you need a global network: AWS Ground Station, Azure Orbital, KSAT, Viasat RTE. Optical-link networks (Starlink-style) provide always-on backhaul by routing through inter-satellite links to a sat currently over a station.
On Earth, the worst case for a misbehaving server is a remote-hands ticket. In orbit, there are no remote hands. Everything from thermal control to GPU error recovery must be autonomous, with ground operators reduced to high-level commanding and forensic analysis after the fact.
Hierarchical autonomy: every subsystem (power, thermal, comms, GPU cluster) runs local detection of out-of-limit conditions, isolates the fault, executes a recovery action (switch to redundant unit, safe-mode the affected zone, throttle GPUs to reduce heat), and reports up. Ground intervenes only when on-board recovery exhausts its playbook. Think of it as self-healing infrastructure as a hard requirement, not a nice-to-have.
The control computer (BMS equivalent) runs on radiation-hardened silicon: BAE RAD750, Cobham GR740, or newer commercial-off-the-shelf + ECC + TMR (triple modular redundancy) designs. Flight software runs on RTOSes like NASA cFS (Core Flight System), VxWorks, or RTEMS. The GPUs themselves are commercial — but every command they receive routes through the rad-hard supervisor.
Cosmic rays cause single-event upsets (SEUs — bit flips) and single-event latch-ups (SELs — stuck transistors) in commercial silicon. Mitigations: ECC memory everywhere, periodic scrubbing of DRAM, latch-up current sensors that power-cycle affected blocks within microseconds, and software-level checkpoint/restart for training jobs.
Inside the spacecraft, the control plane runs on SpaceWire (200 Mbps) or SpaceFibre (multi-Gbps) for high-speed; MIL-STD-1553 (1 Mbps, triple-redundant) for safety-critical commands; CAN bus for low-rate sensor polling. Payload (the actual GPU cluster) uses standard 400G/800G Ethernet or InfiniBand, just shielded and qualified for vibration.
- ● Cold plate inlet/outlet T
- ● Radiator panel T (each)
- ● Heat pipe evaporator T
- ● Ammonia loop pressure
- ● GPU Tj (per die)
- ● Sun/shade angle
- ● Solar array current/voltage (per string)
- ● Battery state of charge
- ● Battery cell voltages
- ● Bus voltages (28V, 100V HVDC)
- ● Eclipse prediction (orbit-driven)
- ● Charge regulator status
- ● SEU/SEL event counter
- ● Radiation dose accumulator
- ● Attitude (quaternion, rates)
- ● Reaction wheel speeds
- ● Propellant remaining
- ● Comm link SNR / BER
Proposing 5 GW orbital data centers powered entirely by solar. Pitched as the only way to scale compute past Earth's power-grid limits. First demonstrator: a small GPU cluster aboard a SpaceX rideshare.
Building a commercial space station; orbital compute as a planned payload. Has an MOU with several hyperscalers for in-orbit data processing of Earth observation feeds.
Lunar data centers — leverages the Moon's vacuum + regolith shielding + cold permanently-shadowed craters for radiative cooling. Successfully booted data payloads on Intuitive Machines IM-1/IM-2 missions.
EU-funded feasibility study (ASCEND — Advanced Space Cloud for European Net-zero emission and Data sovereignty) concluded orbital DCs are technically feasible and could be net-positive on CO₂ by 2050 if launch emissions drop.
Orbital DCs solve power and (theoretically) cooling but introduce three first-principles problems: launch cost (still ~$1,500/kg even on Starship — and radiators are heavy), servicing impossibility (no swap-a-bad-DIMM in LEO), and orbital debris exposure (large radiator panels are meteoroid/debris bullseyes). The economics work only if launch drops below ~$200/kg and compute density per kg jumps another 5–10×. For an AI controls engineer this is a far-horizon problem — but the thermal physics and FDIR principles are directly applicable to edge / remote / unmanned terrestrial sites today.
PLC programming, SCADA/Ignition architecture, OPC-UA, Modbus, MQTT/Sparkplug B, BMS control loops, PID tuning, VFD sequencing, chiller staging logic, and safety interlocks are covered in depth in the Controls & Automation module below.
Proving the facility works — before IT load arrives
is the systematic verification that every system performs per the and design intent. In mission-critical facilities, it follows five levels — each building on the last. Skipping levels means discovering failures under live load, which can cost millions per hour.
The commissioning agent (CxA) is independent from the installing contractor — they represent the owner's interest and verify the contractor's work. All deficiencies are tracked in an issues log with severity, responsible party, and resolution deadline.
Single pane of glass — from rack to C-suite
- • BMS — HVAC status, temps, setpoints, alarms
- • EPMS — power at every distribution stage
- • CMDB — asset inventory, serial numbers, rack positions
- • Network management — switch port mapping, bandwidth
- • Rack PDU — per-server power, environmental sensors
- • CDU / liquid cooling — flow, temp, pressure, leak status
- • Capacity planning — power, cooling, space available per rack/row/hall
- • Real-time PUE — computed from EPMS data, trended over time
- • What-if modeling — "can I add 20 racks to Hall B?"
- • Alarm correlation — root cause across mechanical + electrical
- • Change management — track every rack move/add/change
- • Compliance — SLA uptime reporting, environmental audit trail
In AI factories, DCIM is evolving toward real-time digital twins and ML-driven optimization — adjusting cooling setpoints and power distribution automatically based on predicted GPU workload patterns.
Why training is bandwidth-bound
Every training step, gradients computed on each GPU must be summed across the entire cluster — an over billions of parameters. If the network can't keep up, GPUs sit idle. That's why hyperscalers build dedicated rail-optimized fabrics with one 800 Gb/s per GPU.[src]
The data pipeline
Trillions of pre-tokenized tokens are sharded across a parallel filesystem (, or vendor-specific). Data loaders stream shards into GPU memory while the previous batch is still being processed — overlap is everything.
Fire, leak, EPO
VESDA (laser-based air sampling) for earliest smoke detection. Clean-agent suppression (FM-200, Novec 1230) for IT rooms. Pre-action sprinklers (double interlock) for high-value spaces. EPO (Emergency Power Off) kills entire zones — controversial due to nuisance activation risk but code-required in many jurisdictions. Leak detection cables along every pipe route and under raised floors.
You Can't Manage What You Can't Measure
Why industrial monitoring matters
An AI datacenter isn't just a building with servers — it's an industrial plant running at the thermal and electrical edge of what physics allows. A single 120 kW GPU rack operates at power densities that would be classified as heavy industrial in any other context. When a chiller valve fails at 2 AM and rack inlet temperatures start climbing, you have minutes — not hours — before thermal throttling degrades a multi-million-dollar training run.
This is why every serious datacenter operator invests as heavily in , , and as they do in the compute hardware itself. The monitoring infrastructure is the operational infrastructure. Without it, you're flying blind through a thunderstorm.
The control stack in a mission-critical facility is itself a layered architecture — field devices at the bottom, supervisory systems in the middle, and at the top. Each layer serves a different time scale: PLCs react in milliseconds, BMS loops in seconds, DCIM analytics in minutes to hours.
From field device to operations center
IT network: DCIM, enterprise analytics, cloud dashboards, ERP. Ignition Cloud Edition, Cirrus Link cloud injectors (AWS/Azure/GCP), Sepasoft ERP connectors, Omniverse digital twins. Directory services, DNS, mail servers.
Central Ignition Gateway (on-prem), SQL historian, MES, Sepasoft modules (SPC, batch, track & trace, OEE). Plant-wide operations, scheduling, and reporting.
Local servers, clients (Perspective/Vision), Ignition Edge gateways (IIoT + Panel), Cirrus Link modules. Live one-line diagrams, alarming, trending, reporting. Operator commands flow down from here.
s, RTUs, IEDs, power meters (Schneider ION/PM), VFDs, motor starters — any device with onboard logic or a processor. Executing closed-loop control: valve modulation, speed control, generator paralleling, UPS bypass. Communicating via TCP, OPC-UA, .
Dumb field instruments with no onboard processing: temperature sensors (RTDs, thermistors), pressure transducers, DP sensors, flow switches, limit switches, contact closures, leak detection cables, smoke detectors, control valves, damper actuators. Output is a raw signal (4–20 mA, 0–10 V, dry contact).
How many alarms does a datacenter generate?
Without alarm rationalization, operators drown in noise. Best practice: <1 actionable alarm per operator per 10 minutes (ISA-18.2).
The workhorse of industrial automation
A is an industrial digital computer purpose-built for real-time control. Where a server runs an OS and applications, a PLC runs a deterministic scan cycle: (1) read all inputs, (2) execute the control program, (3) update all outputs, (4) handle communications. Typical scan time: 1–20 ms — fast enough to catch a pump cavitation event before damage occurs.
For NVIDIA data centers, expect Beckhoff TwinCAT or Siemens TIA Portal for mechanical controls, and Schneider EcoStruxure for power/BMS.
How PLCs are programmed
PLC programming uses IEC 61131-3 languages. The two most common:
Visual language resembling electrical relay circuits. Series contacts = AND, parallel contacts = OR. Core elements: NO/NC contacts, output coils, timers (TON/TOF), counters (CTU/CTD), latch/unlatch.
Start pump when conditions met:
─┤Start_PB├──┤NOT E_Stop├──┤Level_OK├──(Pump_Run)─
NO NC NO COIL
Seal-in circuit:
─┬─┤Start_PB├─┬─┤NOT E_Stop├──(Pump_Run)─
└─┤Pump_Run├─┘
(seal)High-level language similar to Pascal. Preferred for complex math, state machines, and data manipulation.
IF ChW_Supply > Setpoint + Deadband THEN
IF NOT Chiller_1_Running THEN
Chiller_1_Start := TRUE;
Stage_Timer(IN:=TRUE, PT:=T#300s);
ELSIF Stage_Timer.Q THEN
Chiller_2_Start := TRUE;
END_IF;
END_IF;The operator's window into the plant
provides a centralized interface to monitor and control industrial processes. The architecture:
The platform NVIDIA uses
Ignition is a modern, Java-based SCADA platform with a web gateway model. Key differentiators: unlimited licensing (one server price, unlimited clients and tags), cross-platform (Windows/Linux/macOS), modular architecture, and Python scripting via Jython.
# Ignition scripting (Jython) temp = system.tag.readBlocking( ["[default]Chiller_1/ChW_Supply_Temp"] )[0].value system.tag.writeBlocking( ["[default]Chiller_1/Setpoint"], [42.0] )
Ignition runs as a single gateway service — a web server hosting a designer IDE, client sessions, device connections, and module runtime all from one process. The gateway exposes a web UI on port 8088 (HTTP) or 8043 (HTTPS) for admin, and serves Perspective sessions to any browser. Unlike legacy SCADA, there are no seat licenses: you buy one gateway, attach unlimited screens, clients, and tags.
Tags are Ignition's core abstraction — every data point in the system is a tag. Tags live in a hierarchical folder structure and can be organized by system, building, or equipment.
Perspective is Ignition's HTML5 visualization module — the successor to Vision. Operators open a browser tab (or a native Perspective Workstation app), authenticate via SAML/LDAP/AD, and see live plant graphics. Key concepts:
How equipment talks to each other
A single datacenter may contain equipment from 15+ vendors. Getting them to communicate reliably is one of the hardest integration challenges in the industry. These are the core protocols:
replaces legacy COM/DCOM with a platform-independent, secure protocol. The server exposes a hierarchical node tree (Objects → Variables → Methods). Clients browse, read/write, and subscribe to changes — subscriptions push data on change, eliminating polling overhead. Built-in TLS encryption, certificate-based auth.
Primary standard for PLC-to-SCADA and SCADA-to-IT integration. Ignition's OPC-UA server is a key integration point.
The simplest and most widely deployed industrial protocol (since 1979). Modbus TCP runs over Ethernet port 502; Modbus RTU runs over RS-485 serial. Data organized as registers: Coils (R/W bits), Discrete Inputs (read-only bits), Input Registers (read-only 16-bit), Holding Registers (R/W 16-bit). Common function codes: FC03 (read holding), FC04 (read input), FC06 (write single register), FC16 (write multiple).
Every power meter, VFD, and simple sensor speaks Modbus. No security features — relies on network segmentation.
is a lightweight publish/subscribe messaging protocol designed for constrained networks. Unlike request/response protocols (HTTP, Modbus), MQTT decouples producers from consumers — publishers don't need to know who's listening, and subscribers don't need to know who's sending. This makes it ideal for IIoT and data center monitoring where thousands of sensors feed data to multiple consumers (SCADA, DCIM, analytics, digital twins).
Central message router. All clients connect to the broker — never directly to each other. The broker receives published messages, filters by topic, and delivers to matching subscribers. Examples: HiveMQ (enterprise), Mosquitto (open-source), EMQX (high-scale).
Hierarchical UTF-8 strings using / as delimiter. Example: site/bldg-A/mech/chiller/CH-1/chwst. Wildcards: + matches one level,# matches all remaining levels.site/bldg-A/mech/+/+/chwst gets all chiller supply temps.
A client publishes a message to a topic. Any client subscribed to that topic (or a matching wildcard) receives it. One publisher can feed many subscribers. Many publishers can feed one subscriber. No coupling.
| Level | Guarantee | Handshake | DC Use Case |
|---|---|---|---|
| QoS 0 | At most once (fire & forget) | None — send and move on | High-frequency sensor telemetry (temp every 1s). Losing one reading is fine — next one is 1s away. |
| QoS 1 | At least once | PUBACK from broker | Alarm notifications, setpoint changes. Must arrive but duplicates are tolerable. Most common in IIoT. |
| QoS 2 | Exactly once | 4-step handshake (PUBREC/PUBREL/PUBCOMP) | Billing/metering data, command acknowledgments. High overhead — use sparingly. |
Broker stores the last message published to a topic with the “retain” flag. When a new subscriber connects, it immediately gets the retained message — no waiting for the next publish cycle. Critical for DCIM: a new dashboard instance instantly shows current values instead of blank until the next sensor poll.
On connect, a client registers a “will” message with the broker. If the client disconnects unexpectedly (network drop, crash), the broker publishes the will message on its behalf. Used to signal device offline status — e.g., site/bldg-A/edge-gw-1/status → OFFLINE. Sparkplug B formalizes this as “death certificates.”
With cleanSession=false, the broker stores a client's subscriptions and queues messages while it's offline. When the client reconnects, it receives all missed messages. Essential for edge gateways that may lose WAN connectivity — combined with Ignition's Store & Forward for zero data loss.
Good topic design enables flexible subscriptions. A well-designed hierarchy lets DCIM subscribe to everything, while a cooling engineer subscribes only to their building's mechanical data.
# Topic hierarchy pattern:
{site}/{building}/{system}/{subsystem}/{equipment}/{point}
# Examples:
site-sv1/bldg-A/mech/chiller/CH-1/chwst → 42.1°F
site-sv1/bldg-A/mech/chiller/CH-1/status → RUNNING
site-sv1/bldg-A/mech/ct/CT-2/fan-speed → 72%
site-sv1/bldg-A/elec/swgr/MSB-1/kw → 4250
site-sv1/bldg-A/elec/pdu/PDU-R3-A/phase-a-amps → 82.4
site-sv1/bldg-A/env/row-12/rack-4/inlet-temp → 74.8°F
# Subscription examples:
site-sv1/bldg-A/mech/# → ALL mech data, bldg A
site-sv1/+/elec/swgr/+/kw → ALL switchgear kW, all bldgs
site-sv1/bldg-A/mech/chiller/+/chwst → ALL chiller supply temps
# → EVERYTHING (DCIM firehose)| Aspect | Polling (Modbus/OPC-UA) | Pub/Sub (MQTT) |
|---|---|---|
| Data flow | SCADA asks each device on schedule | Devices publish when data changes or on interval |
| Bandwidth | Constant — polls even when nothing changes | Proportional to actual change rate |
| Latency | Worst-case = poll interval | Near-instant on change |
| Scale | More devices = slower cycle | Broker handles 100k+ connections |
| WAN friendly | Fragile over unreliable links | Built for constrained/intermittent networks |
| Adding consumers | New connection per consumer | Just subscribe — no impact on publisher |
Note: OPC-UA also supports subscriptions (server pushes on change) — but MQTT is purpose-built for the edge-to-cloud segment where OPC-UA's rich data model isn't needed.
Raw MQTT is payload-agnostic — it delivers bytes without caring what they mean. Sparkplug B (maintained by the Eclipse Foundation) adds an application-layerspecification that standardizes how industrial data is structured, encoded, and managed over MQTT. It turns MQTT from a transport protocol into a complete IIoT data infrastructure.
spBv1.0/{group_id}/{msg_type}/{edge_node_id}/{device_id}
# Message types:
NBIRTH — Edge node comes online (all metrics)
NDEATH — Edge node goes offline
NDATA — Edge node metric updates
DBIRTH — Device comes online
DDEATH — Device goes offline
DDATA — Device metric updates
NCMD — Command to edge node
DCMD — Command to device
# Example:
spBv1.0/NVIDIA-SV1/DDATA/EDGE-GW-BLDG-A/CH-1
→ Chiller 1 data from Building A edge gatewayWhen an edge node connects, it publishes an NBIRTH message containing ALL of its metrics with metadata (name, datatype, engineering units, alias). This acts as a self-describing schema — any subscriber immediately knows what data this node provides.
The node also registers an LWT with the broker: an NDEATH message. If the node drops, the broker publishes NDEATH. Every consumer knows instantly that this node's data is stale.
Why it matters: In raw MQTT, if a device just stops publishing, consumers don't know if the device is offline or if the value just hasn't changed. Sparkplug B eliminates this ambiguity.
Sparkplug B uses Google Protocol Buffers (Protobuf) for payload encoding instead of JSON or XML. Result: 3–10× smaller payloads, faster serialization/deserialization. Each metric carries: name, alias (numeric shorthand), timestamp, datatype, and value. After BIRTH, DDATA messages send only changed metrics using aliases — extremely efficient.
Ignition implements Sparkplug B via Cirrus Link modules:
- • MQTT Engine — On the central gateway. Subscribes to the broker and auto-creates tags from BIRTH messages.
- • MQTT Transmission — On edge gateways. Publishes OPC-UA tag data as Sparkplug B messages to the broker.
- • MQTT Distributor — Optional built-in MQTT broker within Ignition (for simpler deployments without a standalone broker).
- • TLS 1.2+ — Encrypts all traffic between clients and broker. Port 8883 (MQTTS) instead of 1883.
- • Mutual TLS (mTLS) — Both client and broker present certificates. Standard for IIoT.
- • Certificate rotation — Edge devices need automated cert renewal (SCEP, EST, or custom CA).
- • Username/password — Basic auth (always over TLS).
- • ACLs — Topic-level read/write permissions per client. E.g., an edge gateway can only publish to its own subtree.
- • Client ID validation — Broker enforces unique client IDs; duplicate connection = kick the old one.
- • DMZ placement — Broker sits in the IT/OT DMZ. OT side publishes in; IT side subscribes out. No inbound connections from IT to OT.
A single broker is a single point of failure. Production deployments use clustered brokers (HiveMQ cluster, EMQX cluster) or active/standby pairs with shared persistent storage. Clients configure multiple broker endpoints and reconnect automatically on failure. Sparkplug B's birth/death mechanism ensures state is rebuilt after any failover.
runs over UDP port 47808 (0xBAC0). Data organized as objects: Analog Input/Output/Value, Binary I/O/V, Multi-State, Trend Log, Schedule. Each object has properties (Present-Value, Status-Flags, Description). Supports COV (Change of Value) subscriptions. Many data centers still use BACnet for HVAC and BMS. Gateways translate BACnet ↔ Modbus or BACnet ↔ OPC-UA at system boundaries.
| Feature | OPC-UA | Modbus TCP | MQTT | BACnet IP |
|---|---|---|---|---|
| Model | Client/Server | Master/Slave | Pub/Sub | Client/Server |
| Security | Built-in TLS | None native | TLS optional | Limited |
| Data Model | Rich (typed nodes) | Flat registers | Payload agnostic | Object-based |
| Best For | PLC↔SCADA | Meters, simple I/O | IIoT, edge, cloud | Building HVAC |
| Scalability | Excellent | Limited (247 dev) | Excellent | Good |
Electrical Power Monitoring
Every branch circuit, every , every bus section has a power meter reporting voltage, current, kW, kWh, power factor, and THD in real time. This data feeds PUE calculations, capacity forecasting, and — critically — fault detection. A 2% phase imbalance caught by EPMS today prevents a busbar failure next month.
Common platforms: Schneider ION meters, SEL, Dranetz. Communication via Modbus TCP/RTU. Energy dashboards aggregate for billing, allocation, and sub-metering by tenant or department.
Cooling Plant Control
The controls chiller staging, cooling tower fan speed, pump VFDs, CRAH/CRAC units, and economizer dampers. In an AI datacenter running direct-to-chip liquid cooling, the BMS must maintain CDU supply temperature within ±1°C while dynamically responding to GPU workload changes that can swing rack power by 40% in seconds.
PID loops everywhere: supply air temp, chilled water ΔP, condenser water temperature. Tuning parameters (Kp, Ki, Kd) directly impact stability and energy efficiency.
Non-negotiable systems
VESDA (laser-based air sampling) for earliest smoke detection. Clean-agent suppression (FM-200, Novec 1230) for IT rooms. Pre-action sprinklers (double interlock) for high-value spaces.EPO (Emergency Power Off) kills power to entire zones. These systems have hard interlocks with BMS and EPMS — a fire alarm can shut down HVAC and trip power to a fire zone. Governed by NFPA 75/76.
Leak detection cables run under raised floors and along every pipe route — critical for liquid-cooled environments where a CDU leak can damage millions in hardware.
The bridge between design and code
A is the definitive document describing how a system operates under all conditions. It's the spec the PLC programmer codes from, the commissioning agent tests against, and the operations team references for troubleshooting.
SOO excerpt: lead/lag logic
Lead Chiller Start:
IF ChW_Return > Setpoint + 2°F
AND Cooling_Demand > 20%
AND No_Active_Alarms
THEN Start Lead_Chiller
WAIT 300s (anti-recycle timer)
Lag Chiller Stage-On:
IF Load% > 75% on running chillers
(sustained 10 min)
OR ChW_Supply > Setpoint + 3°F
(sustained 5 min)
THEN Start next chiller in rotation
DP Setpoint Reset:
Target: most-open valve at 90%
DP Setpoint: 12 PSID (range 5-25)
PID: Kp=2.0, Ki=0.5, Kd=0.0
Output: Pump VFD 30% min - 100% maxEvery setpoint, every timer, every PID gain in the SOO becomes a configurable parameter in the PLC program. The commissioning agent verifies each one during L3/L4 testing.
Complete SOO with explanations
Below is a complete, annotated Sequence of Operations for a chilled water plant typical of a hyperscale AI data center. Each section includes an explanation of why it exists and what reviewers/programmers should focus on.
SYSTEM: Central Chilled Water Plant — Data Hall A CAPACITY: 3 × 1,000-ton centrifugal water-cooled chillers (N+1) DESIGN CONDITIONS: ChW Supply Temp (ChWST): 42 °F ChW Return Temp (ChWRT): 55 °F Design ΔT: 13 °F CW Supply Temp (CWST): 85 °F (summer design) CW Return Temp (CWRT): 95 °F EQUIPMENT: CH-1, CH-2, CH-3 Chillers (York YK, VFD compressors) CHWP-1, -2, -3 Primary CHW Pumps (VFD, 1,500 GPM ea) CWP-1, -2, -3 Condenser Water Pumps (constant, 2,200 GPM) CT-1, CT-2, CT-3 Cooling Towers (variable-speed fans)
MODE DESCRIPTION
──── ──────────────────────────────────────────────
AUTO BMS/SCADA controls all staging and sequencing.
All equipment HOA switches must be in AUTO.
MANUAL Operator commands via HMI. Interlocks active.
Staging logic disabled.
OFF All equipment commanded off. Safeties monitored.
EMERGENCY Fire alarm, EPO, or critical leak detection.
Immediate shutdown. See §8.
STANDBY Ready to start, waiting for cooling demand
signal from DCIM (IT load > 50 kW threshold).PREREQUISITES (all must be TRUE):
□ System mode = AUTO or STANDBY
□ No active critical alarms (Level 1 or 2)
□ ChW loop pressure > 15 PSIG (system filled)
□ All HOA switches = AUTO at MCC
□ CW isolation valves OPEN (limit switch FB)
□ No active fire suppression signal
SEQUENCE:
STEP 1: Start lead CW Pump
WAIT: Flow switch TRUE within 30s
FAIL: Alarm "CW Pump Fail", halt
STEP 2: Start lead CT fan(s) at 30% minimum
WAIT: 15s for fan to reach speed
STEP 3: Open chiller CW isolation valves
WAIT: End-switch OPEN within 45s
STEP 4: Start lead CHW Pump at 40% speed
WAIT: Flow > 800 GPM within 30s
STEP 5: Command lead Chiller to START
WAIT: Chiller RUNNING within 120s
NOTE: Chiller has internal start seq
(oil pump, guide vanes). Don't bypass.
STEP 6: Release PID control:
- CHWP → DP setpoint (see §5)
- CT fan → CW return temp (see §6)
- Chiller → ChWST setpoint (42°F)
ANTI-RECYCLE: 300s min between starts
per chiller (compressor protection)NORMAL SHUTDOWN:
1. Command chiller STOP (unload guide vanes)
WAIT: Chiller STOPPED within 180s
2. Run CHWP at 40% for 120s (post-circ)
3. Stop CHW Pump
4. Close CW isolation valves
5. Stop CW Pump
6. Ramp down CT fans over 30s
EMERGENCY SHUTDOWN (see §8 triggers):
ALL equipment → IMMEDIATE STOP
Exception: CHWP runs 30s for chiller
protection if safe to do soDP SETPOINT: 12 PSID (range: 5–25 PSID) DP SENSOR: Most remote CRAH / coil header OUTPUT: VFD speed → CHWP-1/2/3 PID TUNING: Kp = 2.0 Ki = 0.5 Kd = 0.0 Output range: 30% min – 100% max DP RESET (energy optimization): Target: most-open CRAH valve = 85–95% IF all valves < 70%: DP setpoint −0.5 PSI IF any valve > 95%: DP setpoint +0.5 PSI Rate limit: 1 PSI per 5 minutes Floor: never below 6 PSID
CW RETURN SETPOINT: 78°F (range 65–85°F)
OUTPUT: CT fan VFD speed (all parallel)
PID: Kp=3.0 Ki=0.8 Kd=0.0
Output: 20%–100%
OPTIMAL RESET:
CW SP = Wet_Bulb + 7°F (min 65°F)
FREE COOLING / WATERSIDE ECON:
IF CW_Supply < ChW_Return − 3°F
(sustained 10 min):
→ Enable plate HX bypass
→ Modulate valve for ChWST SP
→ Stage down chillers if HX
handles full loadSTAGE-ON (add chiller): Load > 80% sustained 10 min OR ChWST > SP+3°F sustained 5 min → Start next in rotation POST-STAGE LOCKOUT: 15 min STAGE-OFF (remove chiller): Load < 30% per unit sustained 15 min AND >1 chiller running → Stop lag chiller (lowest hours) POST-STAGE LOCKOUT: 20 min ROTATION: Lead rotates weekly (Sun 02:00) OR runtime delta > 500 hrs Sequence: CH-1→CH-2→CH-3→CH-1
| Alarm | Condition | SP | Delay | Severity | Action |
|---|---|---|---|---|---|
| CHW_HI_TEMP | ChWST > SP | +5°F | 60s | Warning | Notify, stage on lag |
| CHW_CRIT_TEMP | ChWST > critical | +8°F | 30s | Critical | All chillers ON, page on-call |
| DP_LOW | DP < minimum | 4 PSID | 30s | Warning | Pump speed → 80% |
| CHILLER_FAULT | Controller fault | — | 0s | Critical | Start standby, page |
| PUMP_FAIL | Flow FALSE while ON | — | 30s | Critical | Start standby pump |
| LEAK_DETECT | Leak sensor active | — | 5s | Critical | Isolate zone, E-shutdown |
| VFD_FAULT | VFD controller fault | — | 0s | Critical | Start standby pump |
| CHW_LO_PRESS | Loop pressure low | 10 PSI | 60s | Warning | Notify — possible leak |
| CW_HI_TEMP | CWRT > limit | 98°F | 120s | Warning | CT fans → 100% |
| CT_FAN_FAULT | Fan VFD / vibration | — | 0s | Warning | Redistribute to other CTs |
SENSOR FAILURE:
ChWST fail → Use ChWRT − design ΔT
Lock pump speed, alarm
DP fail → Lock pump at 70%, alarm
Switch to backup sensor
CW fail → CT fans to 80% fixed
BMS/SCADA COMMS LOSS:
PLC continues last known sequence
Setpoints hold at last value
No staging changes
Alarm on comms restored
Operator must re-enable AUTO
POWER FAILURE:
Chillers trip (need clean power)
Pumps on UPS: auto-restart
Wait 30s for stable generator
Then execute normal startup §3| Parameter | Default | Range |
|---|---|---|
| ChWST Setpoint | 42°F | 38–50°F |
| DP Setpoint | 12 PSID | 5–25 |
| CW Return SP | 78°F | 65–85°F |
| Stage-On Load | 80% | 60–95% |
| Stage-Off Load | 30% | 15–45% |
| Anti-Recycle | 300s | 180–600s |
| Post-Stage Lock | 15 min | 10–30 min |
| CHWP Min Speed | 30% | 20–40% |
| CT Fan Min | 20% | 15–30% |
| Lead Rotation | 7 days | 1–30 days |
TO DCIM / EPMS: → Plant kW total → Cooling capacity (tons delivered) → PUE contribution data → ChWST, ChWRT, ΔT, flow rate FROM DCIM / EPMS: ← IT load (kW) — standby→auto trigger ← Cooling demand request
TO/FROM FIRE ALARM: → Plant running status ← Suppression signal → E-shutdown TO/FROM LEAK DETECTION: ← Zone leak → isolate branch ← System leak → E-shutdown TO/FROM ELECTRICAL: ← Generator / utility status → Plant load (gen load mgmt)
The platform that doesn't exist yet (but should)
Today, the handoff between design, controls engineering, and commissioning is largely manual. A designer creates P&IDs, a controls engineer manually re-interprets them into PLC/SCADA configuration, and a Cx agent manually writes test scripts from the SOO. Each handoff introduces errors and delay. The industry is converging toward a model-driven approach where a single source of truth drives everything downstream.
Define one faceplate per equipment type (valve, pump, VFD), bind to UDTs. Every instance auto-generates its own screen. Ignition Perspective, Siemens WinCC, AVEVA System Platform all support this.
Electrical design in EPLAN exports hardware config to Siemens TIA Portal via AutomationML. IO mapping, rack layout, and network config auto-transfer.
Strict design rules (gray backgrounds, no 3D pipes, color = abnormal state) are codifiable. An AI system could generate compliant screens more consistently than most human designers.
A controls engineer reads the SOO and manually programs PLC logic. Same intent, re-interpreted. No standard for machine-readable SOOs.
Cx agents manually create test procedures from the same SOO the programmer used. Three humans reading one document = three interpretations = defects.
Equipment on a P&ID becomes tags manually. BIM semantic standards (Haystack, Brick Schema) are trying to bridge this, but adoption is slow.
NVIDIA is uniquely positioned to close this gap. Omniverse provides the 3D digital twin — the “HMI” is the model itself with live data overlay. Agentic AI can parse design documents and generate control logic, tag databases, and Cx scripts from a single source of truth. The platform that unifies design → engineering → commissioning into one model-driven pipeline will fundamentally change how fast AI factories deploy. Instead of flat 2D SCADA screens, you get a photorealistic twin where you click a chiller and its faceplate appears with live data — no one needs to manually draw pump graphics when the 3D model already exists.
| Platform | Design | Engineering | Auto HMI | Auto Cx |
|---|---|---|---|---|
| Siemens TIA + EPLAN | ✓ EPLAN → AML export | ✓ PLC + HMI in one IDE | ✓ Faceplate templates | ✗ Manual |
| Rockwell FT Design Hub | ◐ Cloud-based config | ✓ Studio 5000 | ◐ Global objects | ✗ Manual |
| Ignition (Inductive) | ✗ No design tool | ✓ Excellent SCADA | ✓ UDT templates | ✗ Manual |
| Schneider EcoStruxure | ◐ Separate products | ✓ Multiple platforms | ◐ Cx wizards | ◐ Partial |
| NVIDIA Omniverse | ✓ 3D from BIM/CAD | ◐ Data connectors | ✓ Twin IS the HMI | ◐ Emerging (AI) |
No single platform does all four today. The convergence of digital twins + semantic data models + agentic AI will close the gap within 2–5 years.
Trust, but verify — systematically
is the process that proves every system works as designed — not just individually, but together, under realistic load conditions and failure scenarios.
An un-commissioned datacenter is a datacenter that will fail under load. The question is when, not if.
Where data center CX gets unique
L4/IST is performed with live IT load (or load banks). These are the failure scenarios that must be proven before a facility goes live:
Key documentation: pre-written test scripts with expected results, pass/fail criteria, and space for actual results. Every deviation is a deficiency tracked to resolution.
The unified operational view
provides a single pane of glass across all physical infrastructure. It sits at the top of the control pyramid, consuming data from BMS, EPMS, and IT systems:
Every device tracked: location (building/floor/row/rack/U-position), model, serial, power connections, network connections. Change management for installs, moves, decommissions.
Power capacity per rack/row/room, cooling capacity, physical space, network ports. Power chain visualization: trace the path from utility feed to individual rack.
EPMS → BMS: power readings for PUE calculation, load-based staging. BMS → DCIM: environmental data for capacity planning. EPMS → DCIM: actual power per rack for utilization tracking. All connected via API, SQL, or OPC-UA gateways.
Platforms: Nlyte, Sunbird dcTrack, Schneider EcoStruxure IT. Hyperscalers (including NVIDIA) often build custom DCIM in-house.
End-to-end data architecture from physical sensors through the OT control stack (PLCs → OPC-UA → Ignition SCADA) to IT systems via MQTT, feeding DCIM, analytics, and cloud platforms.
Sensors & actuators (L0, unintelligent) → PLCs/RTUs/IEDs/meters (L1, intelligent) via hardwired I/O. L1 controllers → local Ignition SCADA/HMI & Edge (L2) via OPC-UA. Central Ignition Gateway (L3) aggregates via Gateway Network. Deterministic, real-time control.
Ignition's MQTT Transmission module publishes to a broker in the DMZ using Sparkplug B. Data flows OT→IT only. TLS encrypted, certificate auth, no inbound connections from IT to OT. The broker is the single point of data egress.
DCIM platform subscribes to MQTT broker or pulls from Ignition's historian via SQL/API. Data lake stores raw telemetry for ML. Omniverse digital twin consumes real-time feeds. Grafana/custom dashboards for NOC displays.
Network segmentation for industrial systems
OT (PLCs, SCADA, sensors) prioritizes availability and safety. IT (business systems, cloud) prioritizes confidentiality. The Purdue Model (ISA-95) defines the separation:
OT networks must be segmented from IT with a DMZ. Data flows OT → IT via historians, OPC-UA gateways, or MQTT brokers in the DMZ. Never expose PLCs directly to the IT network. Security standard: IEC 62443.
Virtual replica, real-time data
A digital twin is a virtual replica of a physical system continuously updated with real-time sensor data. NVIDIA Omniverse for data centers provides:
Data pipeline: Sensor → PLC → OPC-UA → Historian → API → Omniverse connector → real-time 3D updates. Uses USD (Universal Scene Description) format.
Why AI workloads break traditional monitoring
Traditional servers draw steady power. GPU clusters swing from idle (~30% TDP) to full load (~100% TDP) in milliseconds when a training batch launches. This creates transient power spikes that stress UPS systems, trip breakers if margins are thin, and confuse capacity planning models built for static loads.
At 80–120 kW per rack, the thermal time constant collapses. A cooling failure that gives you 15 minutes of runway in a 10 kW/rack enterprise hall gives you <3 minutes in an AI hall. Monitoring latency that was "fine" at 60-second poll intervals becomes dangerous — you need sub-second environmental telemetry.
In an AI cluster, all GPUs in a training job start and stop together. This means power and thermal loads are highly correlated across hundreds of racks — the opposite of the statistical diversity that traditional datacenter designs depend on. Your cooling plant must handle 0-to-100% step changes.
Sample electrical device types — make, model, protocol, key data
A controls engineer must know what's on the other end of the wire. Below are representative devices found in a modern AI data center electrical distribution system — the equipment your SCADA/EPMS will monitor and control.
| Device Type | Example Make / Model | Protocol | Key Data Points | Location |
|---|---|---|---|---|
| Revenue Meter | Schneider ION 9000 | Modbus TCP / DNP3 | V (L-L, L-N), A, kW, kVAR, kVA, PF, THD per harmonic, demand (15-min), kWh, frequency | MV switchgear |
| Branch Circuit Monitor | Schneider PM8000 / ION 7650 | Modbus TCP | Per-circuit V, A, kW, kWh, PF, breaker status, alarm thresholds | Floor PDU / RPP |
| Protective Relay | SEL-751 / GE Multilin 489 | Modbus RTU / DNP3 / IEC 61850 | Fault current, trip status, fault type (OC/GF/diff), event log, waveform capture | MV / LV switchgear |
| UPS | Vertiv Liebert EXL S1 / Eaton 93PM | SNMP v3 / Modbus TCP | Input/output V, A, kW, load %, battery SOC, runtime remaining, bypass status, temp, alarm state | UPS room |
| Automatic Transfer Switch | ASCO 7000 Series / Russelectric | Modbus TCP / BACnet | Source 1/2 status, active source, transfer count, transfer time (ms), V per source | MV switchgear room |
| Generator Controller | DEIF AGC-4 / DSE 8610 | Modbus RTU/TCP | kW output, RPM, coolant temp, oil pressure, fuel level, battery V, run hours, start count | Generator yard |
| Intelligent Rack PDU | ServerTech PRO3X / Raritan PX3 | SNMP v3 / Modbus / REST API | Per-outlet V, A, kW, kWh, PF, inlet temp/humidity, outlet switching, alarm thresholds | In-rack |
| Static Transfer Switch | Schneider Galaxy VS STS / Eaton STS | Modbus TCP / SNMP | Active source, preferred source, transfer count, transfer time (<4ms), source quality | Critical distribution |
| Power Quality Analyzer | Dranetz HDPQ / Fluke 1760 | Modbus TCP / proprietary | THD (V&I), voltage sags/swells, transients, flicker, unbalance, EN 50160 compliance | MV bus / critical loads |
Sample mechanical device types — make, model, protocol, key data
The mechanical (cooling) side has its own ecosystem of intelligent devices and dumb field instruments. The BMS/SCADA must integrate all of them into a unified monitoring and control platform.
| Device Type | Example Make / Model | Protocol / Signal | Key Data Points | Location |
|---|---|---|---|---|
| Centrifugal Chiller | York YK / Trane CVHF / Carrier 19XR | BACnet IP / Modbus TCP | ChWST, ChWRT, ΔT, loading %, kW, COP, compressor RPM, refrigerant press/temp, oil temp, alarm codes | Chiller plant |
| Cooling Tower | Evapco AT / BAC Series 3000 | BACnet / Modbus | Fan speed %, CW supply/return temp, basin temp, vibration, fan status, cell enable/disable | Roof / yard |
| VFD (Variable Freq Drive) | ABB ACS880 / Danfoss VLT / Yaskawa GA800 | Modbus TCP / PROFINET / EtherNet/IP | Speed (Hz/RPM), motor A, kW, torque %, run status, fault code, PID setpoint/feedback, drive temp | Pump/fan motors |
| CDU (Coolant Dist Unit) | CoolIT DCLC / Vertiv XDU / Motivair ChilledDoor | Modbus TCP / BACnet IP | IT supply/return temp, facility supply/return temp, flow (GPM), ΔP, leak detect, pump status, conductivity | Row-end / in-row |
| CRAH / AHU | Schneider Uniflair / Vertiv Liebert CW | BACnet IP / Modbus | Supply/return air temp, fan speed, valve position, filter ΔP, coil temp, humidity, unit status | Data hall perimeter |
| Control Valve (2-way) | Belimo / Siemens ACVATIX | 4–20 mA / 0–10 V / BACnet | Position feedback (%), command %, actuator status (open/closed/fault), torque | Chiller/CRAH coils |
| Temp Sensor (RTD/Thermistor) | Siemens QAM2120 / TE Connectivity | 4–20 mA / RTD (Pt100/Pt1000) | Temperature °F/°C (pipe, duct, ambient, immersion) | Pipes, ducts, racks, ambient |
| Differential Pressure Sensor | Setra 231 / Dwyer MS-111 | 4–20 mA / 0–10 V | ΔP across filter, coil, pump, or chiller (PSID / Pa) | Filter, coils, headers |
| Flow Meter | Badger Meter ModMAG / Siemens SITRANS FM | 4–20 mA / Modbus / HART | Flow rate (GPM), totalizer (gallons), flow velocity, direction | CHW/CW mains, CDU loops |
| Leak Detection | RLE Technologies / TraceTek TT-FFS | Dry contact / Modbus RTU | Leak presence (Y/N), leak location (distance along cable), zone ID | Under floor, pipe routes, CDU |
| Smoke / Fire Detection | Xtralis VESDA-E VEA / Honeywell FSL100 | Proprietary / relay / Modbus | Smoke level (obscuration), alert/action/fire thresholds, sampling pipe zone, flow status | Above/below rack, ceiling |
What DCIM needs at the telemetry level — and why
A platform aggregates thousands of data points into actionable intelligence. But not all data is equal. The points below are the most critical telemetry a DCIM consumes — the data that drives capacity decisions, triggers alarms, calculates efficiency metrics, and keeps the facility running.
- Total facility kW — real-time total site power; denominator for PUE
- kWh (cumulative) — energy billing, carbon footprint, trend analysis
- Power factor — utility penalty if PF < 0.95; indicates harmonic issues
- Peak demand (15-min avg) — drives utility demand charges; capacity trigger
- THD (voltage & current) — harmonic distortion from switch-mode PSUs; transformer derating
- Frequency (Hz) — grid stability indicator; triggers generator start
- Total IT kW — numerator for PUE; sum of all rack PDU readings
- Per-rack kW — capacity utilization per position; stranding detection
- Per-outlet amps — breaker trip risk; load balancing across phases
- UPS load % — headroom for transients; triggers capacity alarm at 80%
- UPS battery SOC & runtime — ride-through availability; replacement scheduling
- Generator run hours & fuel % — maintenance scheduling; fuel delivery trigger
- Rack inlet temp (per-rack) — ASHRAE A1 limit: 64–80°F; alarms at 82°F; GPU throttling risk above 85°F
- Rack exhaust temp — ΔT across rack = proxy for load; abnormal ΔT = airflow issue
- Supply air temp (CRAH output) — control variable for CRAH PID loop
- Return air temp (CRAH intake) — determines cooling demand and valve modulation
- Humidity (%RH) — ASHRAE recommends 8–60% RH; too low = ESD risk; too high = condensation
- Dew point — condensation risk for cold pipes; critical for liquid cooling environments
- ChW supply/return temp — primary control target; ±0.5°F stability critical for AI loads
- ChW ΔT — design ΔT (13°F typ.); low ΔT syndrome wastes energy and reduces chiller capacity
- ChW ΔP (header) — secondary pump VFD control variable; reset based on valve positions
- Chiller loading % & COP — staging trigger; efficiency optimization
- CW supply/return temp — cooling tower fan speed control variable
- Wet-bulb temp (outdoor) — economizer enable/disable; cooling tower approach calculation
- IT loop supply/return temp — GPU thermal margin; alarm at >50°C supply
- IT loop flow rate (GPM) — low flow = insufficient heat removal; pump fault indicator
- IT loop ΔP — filter clogging indicator; pump curve verification
- Coolant conductivity (μS/cm) — DI water quality; >5 μS/cm = resin exhausted
- Leak detection status — zone/distance; highest-priority alarm in liquid-cooled DCs
- Pump/fan VFD speed (%) — efficiency tracking; affinity law verification
- Motor current (A) — overload detection; bearing failure early warning
- Valve position (%) — hunting detection; DP reset input (most-open valve algorithm)
- Filter ΔP — replacement scheduling; airflow restriction alarm
- Vibration (mm/s) — rotating equipment health; bearing/impeller failure prediction
- ● VESDA smoke level (alert/action/fire)
- ● Fire panel zone status
- ● Suppression system armed/discharged
- ● Pre-action valve status
- ● EPO button status (armed/tripped)
- ● Leak detection cable alarm + location
- ● Drip pan water-sense contacts
- ● CDU reservoir level
- ● Makeup water flow (unexpected = leak)
- ● Under-floor flood sensors
- ● Door contacts (open/closed/forced)
- ● Access control events (badge in/out)
- ● Camera feed status (online/offline)
- ● Motion detection zones
- ● Mantrap interlock status
| Data Category | Typical Poll Rate | Historian Deadband | Retention | Why It Matters |
|---|---|---|---|---|
| Power (kW, A, V) | 1–5 sec | 1–2% | 3+ years | PUE, billing, capacity trending, fault detection |
| Temperature | 5–15 sec | 0.5°F / 0.3°C | 1–3 years | Thermal compliance, SLA, cooling optimization |
| Flow / Pressure | 5–30 sec | 2–5% | 1–2 years | Pump performance, filter health, balancing |
| Equipment status (on/off) | On change | N/A (digital) | 5+ years | Runtime tracking, PM scheduling, failure analysis |
| Alarms & events | On change | N/A | 7+ years | Root cause analysis, compliance audit trail |
| Security / access | On event | N/A | 1–7 years | Compliance (SOC 2, ISO 27001), incident response |
A 1,000-rack AI facility with ~30 points per rack + plant-level instrumentation generates 50,000–100,000 monitored points. At 5-second scan rates, that's 10,000–20,000 writes/second to the historian. Proper deadband configuration reduces actual storage volume by 80–90% without losing operationally significant changes.
From Transistor to Tensor Core
Zoom in far enough and a GPU is just billions of switches — transistors patterned at TSMC's 4–3 nm class nodes.[src] Zoom out and they form a hierarchy purpose-built for one operation: matrix multiplication.
A GPU is a memory machine
- Registers — per-thread, ~tens of KB, fastest path.
- — per-SM scratchpad, ~hundreds of KB, where tiles its work.[src]
- — on-package stacked DRAM. ~3–8 TB/s, 80–192 GB per GPU on Hopper / Blackwell class parts.[src][src]
- CPU DRAM — DDR5, ~hundreds of GB/s, used as overflow.
- NVMe + network — datasets, checkpoints, inter-node traffic.
For most inference workloads, throughput is bounded not by but by how fast weights can be streamed from into the SRAM-resident kernel.
CPU · GPU · TPU · NPU
| Class | Style | Strength | Limit |
|---|---|---|---|
| CPU | Latency, scalar | Branchy code | Few ops/cycle |
| GPU | SIMT, throughput | Dense matmul | Memory bandwidth |
| TPU | Systolic array | Big matmul, low overhead | Less flexible |
| NPU | On-device, INT8/4 | Power-efficient inference | Limited memory |
All four converge on the same insight: is ~90% of the work, so dedicate the silicon to it.
Fewer bits, more throughput, more risk
Same exponent range as FP32; the modern training default.
FP8 E4M3/E5M2 are now standard for forward-pass training on Hopper / Blackwell.[src][src]
Matrix Multiplication, All The Way Down
A neural network is, mechanically, a long composition of linear maps separated by simple non-linearities. During training, calculus tells us how to nudge each weight to reduce a scalar . That's it. The intelligence emerges from composition and scale.
y = σ(W · x + b)
For each layer, multiply the input vector x by a learned weight matrix W, add a bias b, then apply a non-linearity σ — usually in modern transformers.[src]
Backpropagation in one breath
- Compute loss L between prediction and target.
- Apply the chain rule from output back to input, accumulating ∂L/∂W for every weight.
- Update each weight:
W ← W − η · ∂L/∂W. - Repeat for trillions of tokens.
is just the chain rule, executed efficiently as a reverse-mode automatic differentiation pass over the computational graph.
Loss Landscape Visualization
The vector field is a synthetic 2D loss; modern training uses AdamW with cosine learning-rate schedules and warmup. The real loss landscape lives in billions of dimensions, where most local minima behave similarly well.
Attention Is The Engine
Every modern frontier model — GPT, Claude, Gemini, Llama — is a stack of identical transformer blocks.[src] Each block does two things: lets every token look at every other token (), then transforms each token independently (MLP). Stack 60–120 of those, train on trillions of tokens, and you get an LLM.
Type to see the merges fire
▁ marks a word boundary (sentencepiece convention). Real BPE learns ~50k merges from corpus statistics.[src] Notice how rare words like “strawberry” fragment — that's why some models miscount its letters.
Each token decides who to listen to
Each row = a query token; brightness = how much it attends to each key token. Real heads in trained models specialize spontaneously into induction heads, name-mover heads, etc. Pattern shown is heuristic for clarity.[src]
One transformer block, repeated N times
- instead of LayerNorm — fewer params, same stability.[src]
- inject position into Q/K via rotation — extends cleanly to long contexts.[src]
- share K/V heads across multiple Q heads to shrink the during inference.[src]
- SwiGLU MLP is the de facto feed-forward in modern LLMs.[src]
- swaps the dense MLP for a router + many experts; only k experts fire per token.[src]
- tiles attention into -resident blocks — same math, far less traffic.[src]
Sampling: turning a probability vector into text
The model outputs a probability distribution over the entire vocabulary for the next token. Temperature sharpens or flattens it; top-p truncates the long tail. Then we draw one token, append it, and feed the new sequence back in.
Brighter = higher probability; magenta = the token actually sampled. At temperature 0 the same prompt always picks the argmax.
Trillions of Tokens, Months of Wall Time
Pretraining is conceptually simple: predict the next token, average the loss across a batch, backprop, update. The hard parts are data quality, distributed orchestration, and not crashing for 90 days.[src]
C ≈ 6 · N · D
Chinchilla-optimal D ≈ 20 × N. GPT-4 class is in the 1e25–1e26 FLOP regime.[src][src] Llama 3 405B used ~3.8e25 FLOPs.[src]
How a forward pass shards across 16k GPUs
- — each GPU holds the full model, processes a different micro-batch, gradients at the end.
- — slice individual across GPUs along the hidden dim (Megatron-style).[src]
- — assign different layer ranges to different GPUs; micro-batches flow through like an assembly line.
- — shard parameters, gradients, and optimizer states across the data-parallel group; gather just-in-time.[src][src]
- Expert parallel — for MoE, route tokens to expert shards living on different GPUs.[src]
Real frontier runs combine all five along orthogonal axes ("3D" or "4D" parallelism) to keep every GPU saturated.
Crawl → dedup → filter → mix
Web crawls (Common Crawl), code (GitHub), books, math, multilingual corpora. Deduplicated at document and substring level, quality-filtered by classifiers, then mixed by domain. Quality > quantity past a point.[src]
→ RM → /
Supervised fine-tuning on demonstrations, then preference data shapes a reward model, then PPO or — increasingly — Direct Preference Optimization aligns the policy.[src][src] Anthropic's Constitutional AI automates the preference signal.[src]
Benchmarks vs. reality
MMLU, GPQA, SWE-bench, HumanEval, ARC-AGI.[src][src][src] Watch for — if eval data leaked into pretraining, the score is meaningless. Real-world capability often lags benchmarks.
Serving the Trained Mind
Inference has two phases. processes your prompt in parallel — it is compute-bound. generates one token at a time, streaming weights from on every step — it is memory-bandwidth bound. Modern serving stacks (vLLM, TensorRT-LLM, SGLang) exist to keep both phases saturated through continuous batching and paged KV cache management.[src]
Configure a deployment
Specs sourced from NVIDIA H100 / Blackwell briefs[src][src] and HBM3e standard.[src]FLOPs assume 40% MFU; HBM at 60% effective.
- — new requests join an in-flight batch each decode step.[src]
- — stored in fixed-size pages, like an OS virtual memory.
- — a tiny draft model proposes K tokens; the big model verifies them in one pass.[src]
- Prefix caching — reuse KV across requests sharing a system prompt.
Post-training quantization to 4-bit () shrinks weights 4× and roughly 4× inference throughput, with single-digit % accuracy loss on most tasks.[src][src]
in modern phones and laptops run 3–8B models at . Local means private, low-latency, and free per query — at the cost of capability. Apple Intelligence, Phi-class, Gemma 2B all live here.
Loops, tools, retrieval
while not done:
response = model(messages, tools=[search, code, ...])
if response.tool_calls:
for call in response.tool_calls:
result = run(call)
messages += [tool_result(result)]
else:
done = True
return response.textAgents are LLMs in a control loop, calling functions and reading the results back into context. is the same shape with one tool: vector search over your documents. standardizes how tools are exposed to models.
Considerations & Open Questions
Modern AI is genuinely useful and genuinely fragile. The failures are not bugs to be patched — they are direct consequences of the training objective and the architecture.
- Hallucination. The model is trained to maximize next-token likelihood, not truth. When the prompt enters a region of weight-space with low evidence, it interpolates plausibly. There is no internal "I don't know" signal unless one was explicitly trained in.
- Prompt injection. The model cannot distinguish instructions from data. Anything that reaches its context — a webpage, an email, a tool result — can hijack behavior.[src]
- Jailbreaks. Safety training is a thin shell over a much larger base model. Adversarial prompts find the seams.
- Distribution shift. Performance degrades on inputs unlike the training distribution — long context, niche domains, low-resource languages.
- Reward hacking. RLHF optimizes for human-rater approval, which is correlated with — but not identical to — being correct or helpful.
- Energy. A single frontier training run consumes ~10–50 GWh — comparable to a small town for a year. Aggregate inference is now larger than training for major providers.[src]
- Water. Evaporative cooling can use 1–2 L per kWh of IT load; closed-loop direct-to-chip designs use far less. Site choice matters more than model choice.
- Grid impact. Hyperscalers are now signing multi-gigawatt PPAs and reviving nuclear capacity to keep up.[src]
- Synthetic data feedback. If models train on outputs of earlier models, distributions narrow and rare phenomena vanish — "model collapse."[src]
What we still don't know
Interpretability.
We can read individual activations and trace small circuits, but we cannot, for any frontier model, fully explain why a given output was produced.Alignment.
Specifying what we want a powerful optimizer to do — precisely enough that it won't satisfy the letter while violating the spirit — remains unsolved.Scaling limits.
Loss continues to fall predictably with compute, but capability jumps are discontinuous and hard to forecast. Data may bind before compute does.Generalization vs. memorization.
How much of model behavior is learned algorithm vs. retrieved training data is an active research question with real legal and scientific stakes.Quick Reference
Key acronyms and critical concepts across infrastructure, controls, silicon, and AI systems — organized for rapid review and team learning.
Every abbreviation you need to know
| Abbr | Full Name | Quick Note |
|---|---|---|
| PUE | Power Usage Effectiveness | Total facility power ÷ IT power. Target: 1.10–1.20 for hyperscale. |
| TDP | Thermal Design Power | Max sustained chip power draw (watts). Sets cooling requirement. |
| UPS | Uninterruptible Power Supply | Battery backup bridging 10–15 s gap until generators reach speed. |
| PDU | Power Distribution Unit | Distributes power from facility feed to rack-level outlets. |
| ATS | Automatic Transfer Switch | Switches load from utility to generator on failure (100–500 ms). |
| CDU | Coolant Distribution Unit | Heat exchanger between facility water loop and IT liquid loop. |
| CRAC | Computer Room Air Conditioning | DX refrigerant-based cooling unit. Good up to ~15 kW/rack. |
| CRAH | Computer Room Air Handler | Chilled-water based. More efficient at scale than CRAC. |
| EPO | Emergency Power Off | Kills power to entire zone. Code-required, controversial (nuisance trips). |
| VFD | Variable Frequency Drive | Controls motor speed (pumps, fans). Key PUE optimization lever. |
| EPMS | Electrical Power Monitoring System | Network of power meters — V, I, kW, PF, THD at every distribution point. |
| BMS | Building Management System | Supervisory control for HVAC, cooling plant, environmental monitoring. |
| SCADA | Supervisory Control and Data Acquisition | Industrial control + HMI layer. Single pane of glass for operators. |
| DCIM | Data Center Infrastructure Management | Unified view of assets, capacity, power chains, and environment. |
| PLC | Programmable Logic Controller | Deterministic real-time controller. Scan cycle <10 ms. |
| DDC | Direct Digital Control | Microprocessor-based HVAC loop control (replaces pneumatic). |
| HMI | Human-Machine Interface | Operator screens: one-line diagrams, alarm dashboards, schematics. |
| SOO | Sequence of Operations | The spec document PLC programmers code from and Cx agents test against. |
| Cx | Commissioning | Systematic verification: L1 factory → L2 install → L3 functional → L4 integrated → L5 seasonal. |
| IST | Integrated Systems Testing | L4 commissioning: multi-system failure testing with live IT load. |
| OPC-UA | OPC Unified Architecture | Modern, secure, platform-independent PLC-to-SCADA protocol. |
| BACnet | Building Automation and Control Networks | ASHRAE/ISO standard for building automation interoperability. |
| MQTT | Message Queuing Telemetry Transport | Lightweight pub/sub for IIoT. Sparkplug B adds standardized namespace. |
| SNMP | Simple Network Management Protocol | UDP-based monitoring for UPS, PDU, CRAC. v3 adds encryption. |
| GPU | Graphics Processing Unit | Massively parallel processor. Primary AI compute engine. |
| HBM | High Bandwidth Memory | 3D-stacked DRAM via TSVs. 3–8 TB/s bandwidth per GPU. |
| SM | Streaming Multiprocessor | GPU execution unit containing CUDA cores + Tensor Cores. |
| FLOPS | Floating Point Operations/Second | Compute throughput. B200: ~2.25 PFLOPS at FP8. |
| MFU | Model FLOPs Utilization | Actual vs. peak utilization. Good training: 30–45%. |
| BF16 | Brain Floating Point 16 | 16-bit, same exponent range as FP32. Default training precision. |
| FP8 | 8-bit Floating Point | E4M3/E5M2 variants. 2× Tensor Core throughput vs. BF16. |
| NIC | Network Interface Card | 400/800 Gb/s per GPU in AI clusters. |
| SFT | Supervised Fine-Tuning | Post-training on curated (prompt, response) pairs. |
| RLHF | Reinforcement Learning from Human Feedback | Reward model + PPO to align outputs with human preference. |
| DPO | Direct Preference Optimization | Simpler RLHF alternative — no separate reward model needed. |
| FSDP | Fully Sharded Data Parallel | Shards model across GPUs. Each gathers params just-in-time. |
| DP | Data Parallel | Full model copy per GPU, AllReduce gradients. |
| TP | Tensor Parallel | Splits matmuls across GPUs along hidden dim. Needs NVLink. |
| PP | Pipeline Parallel | Different layers on different GPUs. Assembly line approach. |
| RAG | Retrieval-Augmented Generation | Inject retrieved docs into prompt. Reduces hallucination. |
| KV Cache | Key-Value Cache | Cached K/V from prior tokens. Grows linearly with sequence length. |
| TTFT | Time to First Token | Prefill latency. Users perceive this as responsiveness. |
| GQA | Grouped Query Attention | Multiple Q heads share K/V heads → smaller KV cache. |
| MoE | Mixture of Experts | Router selects top-k of N expert MLPs per token. More params, same FLOPs. |
| BPE | Byte Pair Encoding | Subword tokenizer. Iteratively merges frequent adjacent pairs. |
Core definitions
NVIDIA's term: a purpose-built facility that manufactures intelligence (tokens), not just stores data. Manages entire AI lifecycle: data ingestion → training → fine-tuning → high-volume inference. Positioned as national-scale critical infrastructure.
Two independent power paths (A+B), each carrying 100% load. Standard for Tier III/IV mission-critical facilities.
ISA-95 network segmentation: Level 0 (physical) → Level 5 (enterprise). IT/OT DMZ between Level 3 and 4.
Proportional-Integral-Derivative control. Kp (reacts to error), Ki (eliminates steady-state error), Kd (dampens oscillation).
PLC execution: read inputs → execute program → write outputs → comms. Repeats every 1–20 ms.
PLC programming standard: Ladder Diagram, Structured Text, Function Block Diagram, Instruction List, SFC.
ISA-18.2 best practice: <1 actionable alarm per operator per 10 min. Prevents alarm fatigue.
Physical separation preventing hot exhaust from mixing with cold supply. Without it, 30–40% cooling wasted.
Using outside air or raised chiller setpoints when ambient is cool enough. Saves 20–40% cooling energy.
Cold plates on CPU/GPU die. Warm water (30–45°C) enables year-round free cooling. Required above ~50 kW/rack.
Thermal guidelines: A1 class recommends 18–27°C dry-bulb at rack inlet. Outside range = accelerated failures.
Collective op: every GPU contributes a tensor, all are summed, every GPU gets the result. Ring topology minimizes waste.
IO-aware attention: tiles QKV into SRAM-sized blocks. Avoids N² HBM materialization. 2–4× speedup.
Compute-optimal: D ≈ 20 × N tokens. 70B model → 1.4T tokens. FLOPs ≈ 6ND.
New requests join in-flight batch at each decode step. Keeps GPU utilization high.
Small draft model proposes K tokens, large model verifies in one pass. ~2–3× speedup, same quality.
output = layer(x) + x. Gradient highway enabling 100+ layer training. Without it, gradients vanish.
τ applied to logits: softmax(logits/τ). Low τ → deterministic. High τ → creative. τ→0 = argmax.
Very Early Smoke Detection Apparatus. Laser-based air sampling — detects smoke before visible particles form.
FM-200, Novec 1230 — gaseous fire suppression for IT rooms. Leaves no residue.
Virtual replica with real-time sensor data. NVIDIA Omniverse for CFD, what-if scenarios, predictive maintenance.
Key numbers to know
- • Utility: 12.47–34.5 kV (medium voltage)
- • Step-down: 480V (US) / 415V (intl)
- • GPU rack (NVL72): ~120 kW
- • Single B200 GPU: ~1,000W TDP
- • 10k GPU cluster: ~12–15 MW facility
- • Air cooling limit: ~15–20 kW/rack
- • Liquid cooling: 50–120+ kW/rack
- • ASHRAE A1 inlet: 18–27°C
- • Typical ChW supply: 42°F / 5.5°C
- • PUE 0.1 improvement at 100 MW ≈ $6M/yr saved
- • Chinchilla: D ≈ 20N tokens
- • FLOPs ≈ 6 × N × D
- • Good MFU: 30–45%
- • NVLink 5: ~1.8 TB/s per GPU
- • HBM3e: 3–8 TB/s bandwidth
Common questions — Infrastructure
Utility power arrives at 12.47–34.5 kV medium voltage. Main switchgear meters and routes it through an ATS (automatic transfer switch) that can flip to generator feed. Step-down transformers bring it to 480V (US) or 415V (international). From there it splits into redundant A/B paths through UPS systems (battery backup for the 10–15 second generator start gap). UPS output feeds floor PDUs (power distribution units) with static transfer switches. Floor PDUs step down to rack-level bus bars or rack PDUs, which distribute to individual servers via dual power supplies — so each server draws from both the A and B path simultaneously. Every link in this chain has EPMS power meters reporting voltage, current, kW, power factor, and THD in real time.
N+1: You have N modules needed to carry full load, plus 1 spare. If one fails, the spare picks up the slack. Cheaper, but a second failure means downtime. Example: 4 UPS modules where you only need 3, so you can lose one.
2N: Two completely independent, fully redundant power paths (A and B). Each path alone can carry 100% of the load. There is no shared component — separate utility feeds, separate transformers, separate UPS, separate PDUs. If the entire A side goes down, B carries everything. This is the standard for Tier III/IV mission-critical facilities and is mandatory for AI training clusters where a power glitch kills a multi-day training run. 2N+1 adds an extra spare module per side for even higher reliability.
L3 (Functional) tests individual systems in isolation against the Sequence of Operations. You verify one chiller starts, ramps, alarms, and shuts down correctly. You test one UPS on bypass. One generator on load. Each system is proven independently.
L4/IST (Integrated Systems Testing) tests multi-system failure scenarios with live IT load (or load banks). You simulate utility loss and verify the entire chain responds: ATS transfers, generators start and sync, UPS bridges the gap, BMS adjusts cooling. Then you cascade failures — chiller trip under load → temperature rise → BMS starts lag chiller → if it fails, load shedding kicks in. L4 proves the systems work together under stress, not just individually. Every deviation is a tracked deficiency requiring resolution before the facility goes live.
Immediate: chilled water supply temperature begins rising because remaining capacity can't match the heat load. The BMS detects the trip and initiates the lag chiller start sequence (anti-recycle timer permitting — typically 300 seconds). During this gap, CHW supply temp may rise 3–5°F. CRAH/CDU units see reduced delta-T and increase fan/pump speeds to compensate. If the lag chiller also fails or takes too long, rack inlet temps cross the high-temp alarm threshold (typically 95°F/35°C), triggering a warning alarm. If temps continue rising, the critical alarm (100°F/38°C) triggers IT load shedding — the BMS or DCIM shuts down non-essential compute to reduce heat load. In a liquid-cooled AI hall at 120 kW/rack, you have about 2–3 minutes of thermal runway before throttling begins, versus 10–15 minutes in a traditional 10 kW/rack enterprise hall. This is why proper chiller staging logic, anti-recycle bypass, and N+1 cooling capacity are non-negotiable.
Six primary levers: (1) Raise chilled water supply temperature — warm-water cooling (30–45°C) enables free cooling via dry coolers year-round in most climates, eliminating compressor energy. (2) VFDs on all pumps and fans — variable speed drives match motor speed to actual load instead of running at 100% constant. (3) Eliminate CRAH/CRAC units — liquid cooling removes 60–80% of heat at the chip, so air-side cooling can be minimal or eliminated. (4) Economizer modes — use outside air or raised condenser water setpoints when ambient conditions allow. (5) Higher voltage distribution (415V vs 208V) — reduces I²R distribution losses. (6) Efficient UPS — ECO mode, lithium-ion batteries, and right-sizing UPS capacity to avoid running at low load factors. A well-designed liquid-cooled AI facility can achieve PUE 1.03–1.10; each 0.1 improvement at 100 MW saves ~$6M/year.
The Purdue Model (ISA-95) defines 6 levels: Level 0 (physical process — pipes, wires, air), Level 1 (field I/O — sensors, drives, VFDs), Level 2 (control — PLCs, DDC, BMS controllers), Level 3 (site operations — SCADA, historian), Level 4 (business — DCIM, MES, IT systems), Level 5 (enterprise — ERP, email, cloud). A strict IT/OT DMZ with firewalls sits between Level 3 and Level 4. OT networks prioritize availability and safety — a PLC controlling a fire suppression interlock must never be disrupted by a software update, a vulnerability scan, or an IT policy change. IT networks prioritize confidentiality. Mixing them means an IT compromise (phishing, ransomware) could reach PLCs that control physical safety systems. Data flows OT→IT only, via historians, OPC-UA gateways, or MQTT brokers in the DMZ. Security standard: IEC 62443. Never expose PLCs directly to the IT or internet network.
Power meters: Modbus TCP (Ethernet, port 502) or Modbus RTU (RS-485 serial). Some advanced meters also expose data via SNMP or REST APIs. Modbus uses a simple register-based addressing scheme — FC03 reads holding registers, FC04 reads input registers. No native security, so it relies on network segmentation.
HVAC / BMS: BACnet IP (UDP port 47808) is the ASHRAE/ISO standard for building automation. Data is organized as objects (Analog Input, Analog Output, Binary Value, Schedule, Trend Log) with properties (Present-Value, Status-Flags). Supports COV (Change of Value) subscriptions. Some legacy systems use LON or proprietary serial. Modern integration uses OPC-UA as the unifying layer, with gateways translating BACnet↔OPC-UA and Modbus↔OPC-UA at system boundaries.
Common questions — AI/ML
Physics. A single GB200 NVL72 rack dissipates ~120 kW in a 0.6 m² footprint. Air cooling works by blowing cold air across heat sinks and exhausting hot air — but air has very low thermal capacity (specific heat ~1 kJ/kg·K vs water at ~4.2 kJ/kg·K). To remove 120 kW with air, you'd need airflow volumes that are physically impossible to route through a rack — the fans alone would consume massive power and create deafening noise. The practical ceiling for air cooling is ~15–20 kW/rack. Above 50 kW/rack, direct-to-chip liquid cooling is mandatory: cold plates mounted on each GPU/CPU transfer heat to a water loop via a CDU (Coolant Distribution Unit). The CDU rejects heat to the building chilled water plant. Liquid removes 60–80% of the server heat, leaving only residual air cooling for memory, storage, and fans.
LLM inference has two distinct phases. Prefill processes the entire input prompt in parallel — one big forward pass through all layers, populating the KV cache for every token in the prompt. Prefill is compute-bound (lots of matmuls on many tokens at once). It determines TTFT (time to first token).
Decode generates output tokens one at a time. Each step reads the full model weights from HBM but processes only one new token, using the KV cache from all previous tokens. Decode is memory-bandwidth bound — the GPU spends most of its time loading weights, not computing. Throughput ≈ model_size_bytes / HBM_bandwidth. This is why techniques like continuous batching (amortize weight loading across many concurrent requests) and speculative decoding (verify K draft tokens in one pass) matter so much for serving efficiency.
Data parallelism (DP): Every GPU holds a complete copy of the model. Each GPU processes a different mini-batch of data. After computing local gradients, all GPUs AllReduce them to stay synchronized. Simple, but every GPU needs enough memory for the full model + optimizer states. Works across nodes with moderate bandwidth.
Tensor parallelism (TP): A single matmul is split across GPUs along the hidden dimension (Megatron-style). For a weight matrix W of shape [h, 4h], GPU 0 gets W[:, :2h] and GPU 1 gets W[:, 2h:]. Each GPU computes its slice, then they AllReduce the result. This requires high-bandwidth links (NVLink at 1.8 TB/s) because activations are communicated every layer. TP reduces per-GPU memory by the number of TP ranks. Typically used within a node (2–8 GPUs on NVLink), while DP is used across nodes (over InfiniBand/Ethernet).
Standard attention computes softmax(QKT/√d)·V, which requires materializing the full N×N attention matrix in HBM (GPU main memory). For a 128K context, that's 128K² = 16 billion elements — massive memory and bandwidth cost.
FlashAttention (Dao et al., 2022) tiles the computation into small blocks that fit entirely in GPU SRAM (fast on-chip memory, ~20 MB). It computes exact attention — no approximation — but never materializes the full N×N matrix. This reduces HBM reads/writes from O(N²) to O(N²/M) where M is SRAM size. Result: 2–4× wallclock speedup, dramatically lower memory usage, and the ability to train with much longer contexts without running out of memory. It's now the default in virtually every major training and inference framework.
The KV cache stores Key and Value tensors for all previous tokens across all layers. For a 70B model with 80 layers, 8 KV heads, 128-dim heads, at FP16, a single sequence of 4K tokens uses ~2.5 GB. At batch size 64, that's 160 GB — more than the model weights themselves. Three main strategies:
(1) GQA (Grouped Query Attention) — share KV heads across multiple Q heads. If 32 Q heads share 8 KV heads, the KV cache is 4× smaller. Llama 2 70B, Mistral, and most modern models use GQA. (2) KV cache quantization — store cached K/V in FP8 or INT8 instead of FP16, halving or quartering memory. (3) PagedAttention (vLLM) — manage KV cache in fixed-size pages like OS virtual memory. Eliminates fragmentation that previously wasted 60–80% of allocated KV space. Pages can be shared (prefix caching) or freed independently.
Three fundamental differences: (1) Power volatility — enterprise servers draw relatively steady power. GPU clusters swing from ~30% TDP at idle to 100% TDP in milliseconds when a training batch launches. This creates transient spikes that stress UPS systems and confuse capacity planning models built for static loads. (2) Thermal density — at 80–120 kW/rack vs enterprise 5–15 kW/rack, the thermal time constant collapses. A cooling failure that gives you 15 minutes in an enterprise hall gives you <3 minutes in an AI hall. Monitoring latency that was fine at 60-second intervals becomes dangerous. (3) Workload correlation — in an AI cluster, all GPUs in a training job start and stop together, so power and thermal loads are highly correlated across hundreds of racks. Traditional designs assume statistical diversity (some racks busy, some idle, averaging out). AI training breaks that assumption — your cooling plant must handle near-instantaneous 0-to-100% step changes.
Mixture of Experts (MoE) replaces the dense MLP (feed-forward) block in each transformer layer with N parallel "expert" MLPs plus a learned router. For each token, the router selects the top-k experts (typically k=2 of N=8–64). Total model parameters are much larger (since all N experts exist), but per-token FLOPs are the same as a dense model (only k experts fire per token).
Impact on inference: The full model must be in memory (all experts loaded), so MoE models need more GPU memory than a dense model of equivalent quality — a 47B-active-parameter MoE might have 140B total parameters. However, per-token compute cost equals the active parameter count, so inference is faster than a dense 140B model. The challenge is expert load balancing — if the router sends most tokens to the same few experts, you get hotspots on some GPUs and idle capacity on others. Auxiliary load-balancing losses during training and expert parallelism (spreading experts across GPUs) mitigate this.
Further Reading
Every technical claim in this guide traces to a primary source. The field continues to evolve rapidly — interpretability, mechanistic analysis, training dynamics, and hardware co-design all remain active areas of research.
Source Material
- NVIDIA H100 Tensor Core GPU Architecture — NVIDIA, 2022
- NVIDIA Blackwell Architecture Technical Brief — NVIDIA, 2024
- GB200 NVL72 Datasheet — NVIDIA, 2024
- TSMC N3 / N3E Process Technology — TSMC, 2023
- HBM3E Standard (JESD238A) — JEDEC, 2024
- FP8 Formats for Deep Learning — NVIDIA / Arm / Intel, 2022
- Uptime Institute Global Data Center Survey — Uptime Institute, 2024
- Electricity 2024 — Analysis and Forecast to 2026 — IEA, 2024
- AI Datacenter Power & Cooling — SemiAnalysis, 2024
- NVLink and NVLink Switch (5th gen) — NVIDIA, 2024
- InfiniBand NDR (400 Gb/s) Specification — InfiniBand Trade Association, 2022
- Attention Is All You Need — Vaswani et al., 2017
- Training Compute-Optimal Large Language Models — Hoffmann et al. (DeepMind), 2022
- GPT-4 Technical Report — OpenAI, 2023
- The Llama 3 Herd of Models — Meta AI, 2024
- RoFormer: Enhanced Transformer with Rotary Position Embedding — Su et al., 2021
- GQA: Training Generalized Multi-Query Transformer Models — Ainslie et al., 2023
- GLU Variants Improve Transformer — Shazeer, 2020
- Root Mean Square Layer Normalization — Zhang & Sennrich, 2019
- FlashAttention: Fast and Memory-Efficient Exact Attention — Dao et al., 2022
- Switch Transformers — Fedus et al., 2021
- ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models — Rajbhandari et al., 2019
- Megatron-LM: Training Multi-Billion Parameter Models — Shoeybi et al., 2019
- PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel — Zhao et al., 2023
- Training language models to follow instructions with human feedback — Ouyang et al. (OpenAI), 2022
- Direct Preference Optimization — Rafailov et al., 2023
- Constitutional AI: Harmlessness from AI Feedback — Bai et al. (Anthropic), 2022
- Efficient Memory Management for LLM Serving with PagedAttention — Kwon et al., 2023
- Fast Inference from Transformers via Speculative Decoding — Leviathan et al., 2022
- GPTQ: Accurate Post-Training Quantization for GPT — Frantar et al., 2022
- AWQ: Activation-aware Weight Quantization — Lin et al., 2023
- Neural Machine Translation of Rare Words with Subword Units (BPE) — Sennrich et al., 2015
- Measuring Massive Multitask Language Understanding — Hendrycks et al., 2020
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark — Rein et al., 2023
- SWE-bench: Can Language Models Resolve Real-World Issues? — Jimenez et al., 2023
- Prompt Injection attacks against LLM-integrated Applications — Greshake et al., 2023
- The Curse of Recursion: Training on Generated Data Makes Models Forget — Shumailov et al., 2023
- What Is an AI Factory? — NVIDIA, 2025
End of guide.
← Back to Top