GPU Mathematics for Machine Learning

GPU Architectural Foundations

Tensor Core Performance Calculator

Matrix Size (N×N) Precision GPU Type

Operations Count

137.4 GFLOPS

Execution Time

0.034 ms

Memory Required

201.3 MB

Arithmetic Intensity

682.7 ops/byte

GEMM Formula: C = αAB + βC

Operations = 2 × N³ (for N×N matrices)

Memory = 3 × N² × bytes_per_element

Arithmetic Intensity = Operations / Memory_transferred

Architecture Comparison

Architecture	Parallelization	Memory Pattern	GPU Utilization	Tensor Core Benefit
Transformers	Excellent	Large matrix ops	70-85%	8× speedup
CNNs	Excellent	Regular, coalesced	80-95%	4× speedup
RNNs/LSTMs	Limited	Sequential dependencies	10-40%	Minimal
GNNs	Poor	Irregular, sparse	10-30%	None

GPU Performance Benchmark

GPU Model	Memory	FP32 TFLOPS	FP16 TFLOPS	Memory BW	Power	Price	Performance Score
H100 SXM	80GB HBM3	67	1,979	3.35 TB/s	700W	$30,000	9.8/10
A100 SXM	80GB HBM2e	19.5	312	2.0 TB/s	400W	$15,000	8.5/10
V100 SXM	32GB HBM2	15.7	125	900 GB/s	300W	$8,000	6.8/10
RTX 4090	24GB GDDR6X	35	165	1.0 TB/s	450W	$1,600	7.2/10

Memory Calculations

Training Memory Calculator

Model Parameters (Billions) Batch Size Sequence Length Hidden Dimension Number of Layers Training Mode

Model Parameters

700 GB

Gradient Memory

700 GB

Optimizer Memory

1400 GB

Activation Memory

2.4 TB

Total Memory Required

5.2 TB

GPUs Required (80GB each)

65 GPUs

Training Memory Formula

Total = Parameters + Gradients + Optimizer + Activations

Activations = B × S × H × L × (16 + 2/p) bytes

Mixed Precision: 10 bytes/param vs FP32: 16 bytes/param

Parameter Counting for Transformers

Vocabulary Size Embedding Dimension Max Position Number of Layers

Embedding Parameters

39.0M

Attention Parameters

84.9M

MLP Parameters

56.6M

Total Parameters

124.4M

Transformer Parameter Formula

C = E(V + P) + L(12E² + 13E) + 2E

E=embedding, V=vocab, P=position, L=layers

12E² = attention weights, 8E² ≈ MLP (4E hidden dim)

Distributed Training Mathematics

Scaling Efficiency Calculator

Number of GPUs Model Size (GB) Network Bandwidth (Gb/s) Compute Time per Step (ms) Parallelism Strategy

Communication Volume

612.5 GB

Communication Time

12.25 s

Scaling Efficiency

0.8%

Effective Speedup

0.06x

Recommended Strategy

ZeRO-3 + Pipeline

Distributed Training Formulas

Efficiency = Compute_time / (Compute_time + Communication_time)

All-reduce: 2×(N-1)/N × model_size

Pipeline bubble: (stages-1) / microbatches

Interactive GPU Count: 8

ZeRO Optimizer Analysis

ZeRO Stage 1

4×

Optimizer State Partitioning

ZeRO Stage 2

8×

+ Gradient Partitioning

ZeRO Stage 3

Linear

+ Parameter Partitioning

FLOPS Analysis & Performance

Training FLOPS Calculator

Model Parameters (Non-embedding) Training Tokens (Billions) Hardware Peak TFLOPS Achieved TFLOPS

FLOPS per Token

1.05 TFLOPS

Total Training FLOPS

3.15 × 10²³

Model FLOPS Utilization

60.6%

Training Time (days)

3.0 days

Performance Rating

Excellent

OpenAI Scaling Law

Training FLOPS ≈ 6N per token (N = non-embedding parameters)

Forward: 2N FLOPS, Backward: 4N FLOPS

MFU = Model_FLOPS_per_second / Hardware_peak_FLOPS

Inference FLOPS & KV Cache

Model Parameters (Billions) Batch Size Input Sequence Length Output Tokens to Generate Hidden Dimension Number of Layers

Prefill FLOPS

35 TFLOPS

Generation FLOPS (total)

17.5 TFLOPS

KV Cache Memory

4.5 GB

Total Inference FLOPS

52.5 TFLOPS

Tokens per Second (est.)

4.5 tok/s

Inference FLOPS Formulas

Autoregressive: T × 2N FLOPS for T output tokens

KV Cache: B × S × H × L × 2 × bytes_per_param

Prefill scales with sequence length squared

Cost Mathematics & Economics

Total Cost of Ownership Calculator

Number of GPUs GPU Type Usage Hours per Day Electricity Cost ($/kWh) Analysis Period (years)

Hardware Cost

$240,000

Annual Power Cost

$36,792

5-Year On-Premises TCO

$423,960

5-Year Cloud Cost

$2,102,400

Savings with On-Premises

$1,678,440

Breakeven Point

6.8 months

TCO Analysis Formula

On-Premises = Hardware + (Power × Hours × Days × Years)

Cloud = Hourly_rate × Hours × Days × Years

Power includes 30-50% cooling overhead

Inference Cost per Token

GPU Utilization (%) Tokens per Second Server Cost per Hour Model Size Category

Effective Tokens/Hour

288,000

Cost per 1K Tokens

$0.083

Daily Revenue Potential

$5,760

Optimization Opportunity

Medium

$0.001

Optimized Cost

80%+ utilization

$0.005

Average Cost

40-60% utilization

$0.01+

Poor Utilization

<30% utilization

Latest Optimization Strategies

FlashAttention Performance Impact

Sequence Length Batch Size Hidden Dimension FlashAttention Version

Memory Usage

2.1 GB

Speed Improvement

2.8×

H100 Utilization

75%

Memory Reduction

65%

FlashAttention-3 Improvements:

• Warp-specialization for H100 architecture
• Native FP8 support for maximum throughput
• 75% of theoretical H100 maximum utilization
• 1.5-2× speedup over FlashAttention-2

Gradient Checkpointing Trade-offs

Model Layers Checkpointing Strategy Base Memory (GB)

Memory Reduction

√L ≈ 75%

Compute Overhead

+20%

Effective Memory

125 GB

Batch Size Increase

4×

Gradient Checkpointing Formula

Memory = O(√L) instead of O(L)

Compute overhead ≈ 20% additional time

Net benefit: 2-4× larger batch sizes possible

5D Parallelization Strategy

Data Parallel

Batch splitting

Tensor Parallel

Weight distribution

Pipeline Parallel

Layer splitting

Context Parallel

Sequence splitting

Expert Parallel

MoE experts

Total GPUs Required

128

Communication Complexity

High

Meta's Llama 3.1 405B Configuration:

• 16,000+ H100 GPUs across multiple datacenters
• 5D parallelization with advanced scheduling
• Optimal efficiency achieved at 64-512 GPU clusters
• Sophisticated fault tolerance and checkpointing