When you save a file to a solid-state drive, something happens at the atomic level that your hard drive could never accomplish. Electrons tunnel through an insulating barrier and become trapped in a microscopic cage, where they can remain for years without power. This is the fundamental magic of flash memory—and understanding it explains everything from why SSDs slow down when full to why they eventually wear out.
The first commercial flash memory chip appeared in 1988, but the technology traces back to a 1967 paper by Dawon Kahng and Simon Sze at Bell Labs. They proposed storing charge in a transistor’s floating gate—a conductive layer completely surrounded by insulator. Nearly six decades later, every NAND flash cell operates on this same principle, even as manufacturers have stacked cells hundreds of layers high and squeezed multiple bits into each one.
The Floating Gate: A Trap for Electrons
A NAND flash cell is, at its core, a transistor with an extra gate. The control gate connects to the wordline for addressing, while the floating gate sits isolated between two layers of oxide. When you program the cell, electrons tunnel through the thin tunnel oxide and accumulate on the floating gate. Their presence or absence changes the transistor’s threshold voltage—the voltage needed to turn it on.
Reading is straightforward: apply a voltage to the control gate and check if current flows. If the floating gate holds electrons, the cell requires higher voltage to conduct. If empty, it conducts at lower voltage. The difference represents your data.
The programming mechanism—Fowler-Nordheim tunneling—requires voltages around 20 volts. This isn’t applied directly but generated internally through charge pumps, since standard logic circuits operate at much lower voltages. The high voltage accelerates electrons enough to quantum-mechanically tunnel through the oxide barrier. It’s inefficient but reliable enough: modern flash cells can endure thousands of these cycles before the oxide degrades.
graph TB
subgraph "Floating Gate Transistor"
CG[Control Gate]
FG[Floating Gate]
TO[Tunnel Oxide]
GO[Gate Oxide]
S[Source]
D[Drain]
SUB[Substrate]
end
CG --> TO
TO --> FG
FG --> GO
GO --> SUB
SUB --> S
SUB --> D
style FG fill:#f9f,stroke:#333,stroke-width:2px
style CG fill:#bbf,stroke:#333
style TO fill:#dfd,stroke:#333
style GO fill:#dfd,stroke:#333
The floating gate is completely isolated by oxide layers, trapping electrons indefinitely without power.
Why Erase Before Write Changes Everything
Hard drives write data by magnetizing regions on a spinning platter. The magnetic field simply overwrites whatever was there before. Flash memory cannot do this. Once electrons are on the floating gate, adding more electrons only changes bits from 1 to 0—you cannot selectively remove electrons from individual cells.
Erasing requires removing all electrons from a block of cells, resetting every bit to 1. This is the erase operation, and it happens at the block level, not the page level. A typical block contains 256 to 1024 pages, each page being 8-16KB. When you want to update a single byte in an existing page, the SSD must:
- Read the entire page into a register
- Modify the byte
- Write the new page to a different physical location (a “free” page)
- Mark the old page as invalid
- Eventually, when enough pages in a block are invalid, copy valid pages elsewhere and erase the entire block
This read-modify-write sequence is why SSDs require sophisticated controllers. The operating system sees a simple array of logical block addresses, but the controller maintains a complex mapping table translating these addresses to physical locations that shift constantly.
The Hierarchy: From Package to Page
Flash memory is organized in a strict hierarchy that affects both performance and parallelism.
Packages are the black chips visible on an SSD’s circuit board. A consumer SSD might have 4-8 packages; enterprise drives can have 16 or more.
Each package contains one or more dies (typically 2-4). A die is the smallest unit that can execute commands independently. Multiple dies allow parallel operations.
Each die contains one or more planes (usually 2). Planes can perform identical operations simultaneously—reading from two different blocks, for instance.
Each plane contains hundreds to thousands of blocks. The block is the smallest unit that can be erased. Modern blocks are 4-16MB.
Each block contains hundreds of pages. The page is the smallest unit for reading and writing, typically 8-16KB plus metadata and error correction bytes.
This hierarchy enables parallelism. An SSD with 8 packages, 4 dies per package, and 2 planes per die can theoretically perform 64 concurrent operations. The controller schedules operations across this parallel structure to maximize throughput.
graph TD
SSD[SSD Drive]
P1[Package 1]
P2[Package 2]
D1[Die 1]
D2[Die 2]
PL1[Plane 1]
PL2[Plane 2]
B1[Block 1]
B2[Block 2]
PG1[Page 1]
PG2[Page 2]
SSD --> P1
SSD --> P2
P1 --> D1
P1 --> D2
D1 --> PL1
D1 --> PL2
PL1 --> B1
PL1 --> B2
B1 --> PG1
B1 --> PG2
SLC, MLC, TLC, QLC: Trading Reliability for Density
The floating gate can store more than one bit. By programming it to multiple voltage levels, each cell represents additional data states:
- SLC (Single-Level Cell): 1 bit per cell, 2 voltage states. Fastest, most durable (50,000-100,000 P/E cycles), most expensive.
- MLC (Multi-Level Cell): 2 bits per cell, 4 voltage states. Good balance (3,000-10,000 P/E cycles).
- TLC (Triple-Level Cell): 3 bits per cell, 8 voltage states. High density, lower endurance (1,000-3,000 P/E cycles).
- QLC (Quad-Level Cell): 4 bits per cell, 16 voltage states. Maximum density, lowest endurance (100-1,000 P/E cycles).
The endurance penalty isn’t arbitrary. With more voltage states, the gaps between them shrink. Reading TLC requires distinguishing between 8 voltage levels instead of 2; the sense amplifiers must be more precise, and noise margins are smaller. Programming also takes longer because each step must be verified more carefully to ensure the cell lands in the correct voltage window.
graph LR
subgraph "SLC: 2 States"
S1["State 0"]
S2["State 1"]
end
subgraph "MLC: 4 States"
M1["11"]
M2["10"]
M3["00"]
M4["01"]
end
subgraph "TLC: 8 States"
T1["111"]
T2["110"]
T3["100"]
T4["101"]
T5["001"]
T6["000"]
T7["010"]
T8["011"]
end
More bits per cell means narrower voltage windows, requiring more precise sensing and slower programming.
3D NAND: Building Upward When You Can’t Build Downward
For years, manufacturers scaled flash by shrinking cells, fitting more in the same silicon area. But below 20nm, physics intervened. Cells became so small that electrons could leak between them, and manufacturing yields plummeted.
In 2013, a different approach reached production: instead of shrinking cells horizontally, stack them vertically. 3D NAND (also called V-NAND) builds cells as vertical strings, with the channel running up through the center and wordlines encircling it like the floors of a building.
The manufacturing process is remarkably different from planar flash. Rather than etching individual transistors, manufacturers deposit alternating layers of oxide and silicon nitride, then etch a deep hole through all layers. They then hollow out the nitride layers and replace them with charge-trapping material (typically silicon nitride) and the gate material.
This approach reversed the endurance decline. First-generation 3D NAND used larger process nodes (40nm+ equivalent), so cells were physically bigger and more robust. The vertical structure also meant manufacturers could afford to use better materials and more reliable cell designs. Modern 3D TLC often matches or exceeds planar MLC in endurance.
Current production has reached 232 layers, with 300+ layers in development. The industry is already discussing paths to 1,000 layers. The challenge isn’t making the layers—it’s keeping them straight. A 232-layer stack is roughly 5-7 microns tall; maintaining structural integrity during manufacturing requires extraordinary precision in deposition and etching.
The Flash Translation Layer: Making Flash Look Like a Disk
Operating systems expect storage to behave like hard drives: sectors that can be overwritten in place. Flash memory works nothing like this. The Flash Translation Layer (FTL) bridges this gap.
The FTL maintains a mapping table that translates logical block addresses (LBAs) from the host to physical block addresses (PBAs) in the NAND. When you write to LBA 1000, the FTL finds a free page, writes the data there, and updates the mapping table. The old physical location is marked invalid.
This mapping table is typically stored in DRAM for speed—a 1TB SSD needs roughly 1GB of mapping table for page-level mapping. Budget SSDs without DRAM use Host Memory Buffer (HMB) to store part of the table in system RAM, or store it in NAND with significant performance penalties.
The FTL also implements garbage collection. When blocks accumulate invalid pages, the valid pages must be copied elsewhere and the block erased. This happens in the background during idle time when possible, but under sustained write loads, garbage collection competes with host operations, causing performance drops.
Write Amplification: When 4KB Becomes 16KB
Write amplification occurs when the SSD writes more data to flash than the host requested. Several factors contribute:
Page alignment: Writing 4KB to a 16KB page means the other 12KB were written unnecessarily (though modern controllers handle this better than early designs).
Read-modify-write: If a write isn’t aligned to page boundaries, the controller must read the existing page, modify it, and write it back.
Garbage collection: Moving valid pages during block recycling generates additional writes.
Wear leveling: To prevent some blocks from wearing out faster than others, the controller occasionally moves static data.
The write amplification factor (WAF) is calculated as:
$$WAF = \frac{\text{NAND Writes}}{\text{Host Writes}}$$A perfectly optimized SSD might achieve WAF of 1.0; real-world values range from 2-5 for consumer drives to 10+ for poorly configured enterprise workloads. High WAF directly reduces endurance: if a drive has 3,000 P/E cycles and WAF of 3, you only get 1,000 effective cycles.
Wear Leveling: Spreading the Pain
NAND cells have finite endurance. Without intervention, frequently written blocks would wear out quickly while static data blocks remained fresh, creating a capacity failure while most cells were still usable.
Wear leveling algorithms track erase counts for each block and make decisions to balance wear. Static wear leveling periodically moves cold data from low-wear blocks to high-wear blocks, freeing the low-wear blocks for new writes. Dynamic wear leveling simply chooses the lowest-wear block among candidates for new writes.
The trade-off is complexity and write amplification. Moving static data generates writes that wouldn’t otherwise occur. Controllers must balance wear leveling aggressiveness against performance and endurance goals.
TRIM: Telling the SSD What You’ve Deleted
On a hard drive, deleting a file merely marks the space as available; the data remains until overwritten. On an SSD, this creates a problem: the SSD doesn’t know those pages are invalid, so it wastes time and endurance preserving them during garbage collection.
The TRIM command solves this. When you delete a file, the operating system sends TRIM to the SSD, indicating which LBAs are now unused. The SSD can then mark those physical pages as invalid, making them candidates for erase operations.
Without TRIM, an SSD that appears half-full might actually be nearly full of stale data, leaving little space for garbage collection. Performance degrades severely as the controller struggles to find free pages. With TRIM, the controller knows exactly which space is available.
Over-Provisioning: The Hidden Reserve
Every SSD has more flash capacity than the user can access. This over-provisioning (OP) serves several purposes:
Factory over-provisioning: A “1TB” SSD typically has 1024GB of raw flash, but only 931GB is user-accessible (the difference between decimal GB and binary GiB). The remaining space handles manufacturing defects and provides working area.
User over-provisioning: Allocating extra unpartitioned space increases OP. Enterprise SSDs often have 20-30% OP; consumer drives typically have 7-15%.
More OP means more free blocks for garbage collection, reducing write amplification and maintaining performance under heavy loads. It also provides spare blocks to replace failed ones, extending apparent endurance.
SLC Caching: The Speed Illusion
Modern TLC and QLC SSDs would be painfully slow without a trick: temporarily treating some cells as SLC. A TLC cell programmed to store only one bit can be written much faster because there are only two voltage states to distinguish. The controller writes incoming data to this pseudo-SLC cache, then gradually folds it to TLC in the background.
The cache size varies—some drives have fixed SLC partitions, others dynamically allocate space. When the cache fills, write speeds can drop from 2,000 MB/s to 200 MB/s as data must go directly to TLC or QLC. This explains why a large file transfer starts fast and suddenly slows dramatically.
Error Correction: When Bits Go Wrong
NAND flash is inherently unreliable. Each page includes spare bytes for error correction codes (ECC). Modern SSDs use LDPC (Low-Density Parity-Check) codes, which can correct dozens of bit errors per 1KB chunk.
As cells wear, error rates increase. The controller tracks these rates and can trigger early retirement of problematic blocks. Temperature also affects reliability—cold cells hold charge better but require higher voltages to program; hot cells program faster but leak charge more quickly.
Read disturb presents another challenge: reading a page repeatedly can slowly program nearby pages on the same wordline due to voltage stress. Controllers track read counts and migrate data from heavily-read blocks before errors accumulate.
Data Retention: The Shelf Life Problem
An SSD sitting unpowered isn’t truly static. Electrons slowly leak from floating gates, causing data to decay over time. This retention period depends on:
- Temperature: Higher temperatures accelerate leakage. An SSD rated for 1 year retention at 25°C might only last weeks at 85°C.
- Wear: Well-used cells leak faster. A cell near its P/E limit might retain data for months instead of years.
- Cell type: QLC’s narrow voltage margins mean small amounts of leakage cause bit errors sooner than SLC.
JEDEC standards specify minimum retention: 1 year for consumer SSDs, 3 months for enterprise. In practice, fresh drives often retain data for many years at room temperature. But a heavily used drive stored in a hot car could lose data in months.
NVMe vs SATA: The Interface Matters
The move from SATA to NVMe wasn’t just about bandwidth. SATA was designed for hard drives, with a single command queue depth of 32. NVMe was designed for flash, with 65,535 queues of 65,536 commands each.
This parallelism matters because flash can handle many concurrent operations across its multiple dies. A SATA SSD must serialize commands through the AHCI protocol, limiting parallelism. An NVMe SSD can keep all its dies busy simultaneously.
Latency also differs significantly. SATA’s AHCI adds protocol overhead. NVMe’s direct PCIe connection adds microseconds rather than milliseconds. For random I/O workloads, this latency advantage often matters more than raw bandwidth.
When SSDs Fail
SSDs fail differently than hard drives. HDDs typically fail catastrophically—heads crash, motors seize, or electronics fry. SSDs tend to fail gracefully: blocks are marked bad, capacity slowly decreases, performance degrades, but the drive often remains readable.
Backblaze’s reliability data shows SSDs with annualized failure rates around 0.5-1.5%, compared to 1.5-2.5% for HDDs. But the failure modes differ: SSDs rarely give audible warning, and sudden power loss can corrupt mapping tables or interrupt firmware updates catastrophically.
SSD controllers also have finite lifetimes, typically designed for 5-7 years of operation. The controller can fail while the NAND remains functional, making data recovery extremely difficult—the encryption keys and mapping tables are lost.
The Future: Where Flash Goes Next
The industry isn’t done scaling. Beyond stacking more layers, manufacturers are exploring:
PLC (Penta-Level Cell): 5 bits per cell, 32 voltage states. Density increases 25% over QLC, but endurance drops to perhaps 100 cycles. Suitable only for cold storage.
ZNS (Zoned Namespace): Exposing the flash’s zone structure to the host, allowing applications like databases to align writes with erase boundaries. Reduces write amplification by eliminating the FTL’s guessing game.
Computational Storage: Moving compute to the drive, processing data where it sits rather than moving it to the CPU. Early products target video transcoding and database operations.
The fundamental physics hasn’t changed since Kahng and Sze’s 1967 paper. We’re still trapping electrons on isolated gates. But the engineering around that simple concept has become extraordinarily sophisticated—and will continue evolving as long as we need to store more data in less space.
References
-
Kahng, D., & Sze, S. M. (1967). A Floating-Gate and Its Application to Memory Devices. Bell System Technical Journal, 46(4), 1288-1295.
-
IEEE. (2025). Milestones: The Floating Gate EEPROM, 1976-1978. https://ethw.org/Milestones:The_Floating_Gate_EEPROM,_1976_-_1978
-
Cooke, J. (2007). Flash Memory 101: An Introduction to NAND Flash. Micron Technology. https://www.micron.com/-/media/client/global/documents/products/technical-note/nand-flash/tn2919.pdf
-
Lee, B. C., et al. (2009). Architecting Flash Memory-Based SSDs for Reliability and Performance. MIT Computer Science and Artificial Intelligence Laboratory.
-
Chen, F., et al. (2009). Understanding Intrinsic Characteristics and System Implications of Flash Memory Based Solid State Drives. ACM SIGMETRICS.
-
Agrawal, N., et al. (2008). Design Tradeoffs for SSD Performance. USENIX ATC.
-
Chung, T., et al. (2009). A Survey of Flash Translation Layer. Journal of Systems Architecture.
-
Park, C., et al. (2008). A Reconfigurable FTL (Flash Translation Layer) Architecture for NAND Flash-Based Applications. ACM TODAES.
-
Iyengar, A., et al. (2023). A Survey of Data Management on Modern SSDs. Communications of the ACM.
-
Backblaze. (2024). Backblaze Drive Stats Report. https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2025/
-
Hansen, C. (2019). Coding for SSDs - Part 3: Pages, Blocks, and the Flash Translation Layer. Code Capsule. https://codecapsule.com/2014/02/12/coding-for-ssds-part-3-pages-blocks-and-the-flash-translation-layer/
-
flashdba. (2014). Understanding Flash: Blocks, Pages and Program / Erases. https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/
-
Cai, Y., et al. (2017). Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery. Now Publishers.
-
Liu, D., et al. (2018). Read Disturb Errors in MLC NAND Flash Memory. ACM/IEEE DSN.
-
Marelli, A., & Micheloni, R. (2018). BCH and LDPC Error Correction Codes for NAND Flash Memories. Inside Solid State Drives, Springer.
-
Zhao, K., et al. (2013). LDPC-in-SSD: Making Advanced Error Correction Codes Work for Solid-State Drives. USENIX FAST.
-
Samsung Semiconductor. (2024). Different Types of NAND Flash. https://semiconductor.samsung.com/support/tools-resources/dictionary/semiconductors-101-part-2-different-types-of-nand-flash/
-
Kingston Technology. (2025). Difference Between SLC, MLC, TLC and 3D NAND. https://www.kingston.com/en/blog/pc-performance/difference-between-slc-mlc-tlc-3d-nand
-
Lam Research. (2024). The Road to 1,000 Layer 3D NAND. https://newsroom.lamresearch.com/road-1000-layer-3D-NAND
-
NVM Express. (2024). NVMe Zoned Namespaces (ZNS) Command Set Specification. https://nvmexpress.org/specification/nvme-zoned-namespaces-zns-command-set-specification/