GPU Architectural Foundations
Tensor Core Performance Calculator
Operations Count
137.4 GFLOPS
Execution Time
0.034 ms
Memory Required
201.3 MB
Arithmetic Intensity
682.7 ops/byte
GEMM Formula: C = αAB + βC
Operations = 2 × N³ (for N×N matrices)
Memory = 3 × N² × bytes_per_element
Arithmetic Intensity = Operations / Memory_transferred
Architecture Comparison
Architecture | Parallelization | Memory Pattern | GPU Utilization | Tensor Core Benefit |
---|---|---|---|---|
Transformers | Excellent | Large matrix ops | 70-85% | 8× speedup |
CNNs | Excellent | Regular, coalesced | 80-95% | 4× speedup |
RNNs/LSTMs | Limited | Sequential dependencies | 10-40% | Minimal |
GNNs | Poor | Irregular, sparse | 10-30% | None |
GPU Performance Benchmark
GPU Model | Memory | FP32 TFLOPS | FP16 TFLOPS | Memory BW | Power | Price | Performance Score |
---|---|---|---|---|---|---|---|
H100 SXM | 80GB HBM3 | 67 | 1,979 | 3.35 TB/s | 700W | $30,000 | 9.8/10 |
A100 SXM | 80GB HBM2e | 19.5 | 312 | 2.0 TB/s | 400W | $15,000 | 8.5/10 |
V100 SXM | 32GB HBM2 | 15.7 | 125 | 900 GB/s | 300W | $8,000 | 6.8/10 |
RTX 4090 | 24GB GDDR6X | 35 | 165 | 1.0 TB/s | 450W | $1,600 | 7.2/10 |