Storage is the backbone of any data system. Without smart ways to store and access data, even the fanciest algorithms fall flat. Today, we're breaking down everything you need to know about data system storage.
First Things First: What Even Is a Data System's Job?
Before we get to storage, let's set the stage. A data system (think databases, data warehouses) has two big jobs: store data and answer queries. But it's not as simple as saving a file to your laptop---there are tons of questions to answer first.
For storage: Where do we put the data? Disk? Cloud? Both? The CAP theorem (Consistency, Availability, Partition Tolerance) comes into play here---you can't have all three, so you pick what matters most for your use case. How do we format it? CSV? Record files? And how long do we keep it? Stable storage (like a hard drive) for forever, or in-memory (like RAM) for fast access but temporary? Durability matters here---you don't want to lose data if the power goes out!
For queries: We need answers fast. Batch queries (like monthly sales reports) can wait a bit, but real-time queries (like tracking live app users) need results in milliseconds. And answers have to be accurate---do you need exact numbers, or is an approximate result okay? To pull this off, data systems rely on key components: indexing (to speed up searches), metadata (data about data, like file sizes), query processing (how to run the query), and query optimization (finding the fastest way to run it).
Storage Device Hierarchy: Speed vs. Capacity
Not all storage is created equal. Imagine a pyramid: at the top, super-fast but tiny devices; at the bottom, slow but massive ones. This is the storage-device hierarchy, and it's how data systems decide where to put data based on how often it's used.
At the very top are registers ---built into the CPU, they're the fastest (nanoseconds!) but can only hold a few bytes. Next is cache (L1, L2, L3)---still fast, but bigger than registers. Then main memory (RAM) ---faster than disks but loses data when power is off. Below that are solid-state disks (SSDs) ---no moving parts, faster than hard drives. Then magnetic disks (HDDs) ---the classic spinning disks, slower but cheaper. At the bottom are optical disks (like DVDs) and magnetic tapes ---super cheap, huge capacity, but really slow (good for archiving old data).
To put speed differences in perspective: A CPU cycle takes 0.3 nanoseconds---if we scale that to 1 second, accessing main memory (120 ns) would take 6 minutes, and a rotational disk I/O (1-10 ms) would take 1-12 months! That's why data systems don't just dump everything into RAM---even though it's fast, it's too small and temporary. Most data lives on secondary storage (HDDs/SSDs), but we have to move it to RAM first before the CPU can process it.
The 3-Level Storage Hierarchy for Data Systems
Data systems simplify the device hierarchy into three levels, each with tradeoffs between capacity, speed, and cost:
-
Primary Storage: This is the "fast and small" tier---RAM and cache. It's used for data the CPU is actively processing. Speed is top-tier (nanoseconds), but capacity is low, and cost per byte is high. It's also volatile (loses data when power is off).
-
Secondary Storage: The "middle ground" for everyday data---HDDs and SSDs. It's non-volatile (keeps data when power is off), has more capacity than primary storage, and is cheaper per byte. But it's slower (microseconds to milliseconds). For databases, this is where most data lives by default---since databases are way too big to fit in RAM.
-
Tertiary Storage: The "slow and huge" tier---optical disks, tapes. It's for long-term archiving (data you rarely access). Capacity is massive, cost per byte is super low, but access speed is glacial (seconds to minutes).
The golden rule here: As you go down the hierarchy, capacity increases, speed decreases, and cost per byte decreases. Data systems move data between tiers automatically---frequently used data goes to primary storage for speed, while rarely used data drops to tertiary for cost savings.
The Big Bottleneck: Secondary Storage Access
Here's a problem: Databases are too large to fit in RAM, so they live on secondary storage (HDDs/SSDs). But the CPU can't access data directly from secondary storage---it has to first load data into RAM, then into registers. This means secondary storage access time is the biggest bottleneck for data systems.
Let's crunch the numbers: Accessing data from RAM takes 30-120 nanoseconds. From an HDD? 10-60 milliseconds---10,000x slower. From an SSD? Still 20x slower than RAM. So if we want fast queries, we need to minimize how often we access secondary storage. That's where smart file structures and optimization come in---but first, let's understand how disks store data.
Disk Anatomy: How Data Lives on HDDs
Let's focus on HDDs first (they're still widely used for large-scale storage). An HDD is a spinning platter with concentric circles called tracks . Each track is split into small chunks called sectors---usually 512 bytes to a few KB. To store database data, we follow a simple recipe:
- Take a magnetic disk (the platter).
- Draw concentric tracks on it.
- Split each track into sectors.
- Take the database data and organize it into records (each record is a row in a table, like an employee's name, SSN, and salary).
- Group records into files (a file is a collection of related records, like all employee records).
- Split each file into blocks---fixed-size chunks (usually 1-10 KB). Blocks are the smallest unit of I/O---when you read/write data, you move entire blocks, not individual records.
- Store each block across a set of sectors (since a block is bigger than a sector).
Pro tip: Databases don't interact directly with the disk---they use the operating system's file system. The file system handles the low-level details of where blocks are stored on tracks/sectors.
How We Access Data on Disks: Seek and Rotation Delay
When you need to read a block from an HDD, the disk has to do two things:
- Seek Delay: The read/write head moves to the correct track. This takes 3-6 milliseconds---think of it like moving a needle to the right groove on a vinyl record.
- Rotation Delay: The platter spins until the desired sector is under the read/write head. This takes 3-5 milliseconds---like waiting for the right part of the record to spin under the needle.
Once the head is in place, transferring data from the track to the disk buffer (and then to RAM via DMA) is super fast. So the total I/O time is mostly seek + rotation delay. The optimization goal here is simple: organize blocks on the disk so that seek and rotation delays are minimized. For example, storing related blocks next to each other (contiguous) means less seeking.
Record Types: Fixed-Length vs. Variable-Length
Records (rows of data) come in two flavors, and how we store them affects efficiency:
Fixed-Length Records
Every record has the same size---set by the designer. For example, an employee record might be 150 bytes: 30 bytes for name, 10 for SSN, 60 for address, 50 for other fields. This is easy to handle---you can calculate exactly where each record starts in a block (e.g., record 1 starts at byte 0, record 2 at byte 150, etc.).
Variable-Length Records
Record sizes vary because some fields are optional (e.g., a "middle name" field that's 0 bytes if empty) or have variable lengths (e.g., a "comments" field that can be 10 or 1000 bytes). Storing these is trickier---we need ways to tell where one field/record ends and the next begins.
Two common solutions:
- Magic Sequences: Use special characters as separators. For example, a "|" might separate fields, and a "#" might end a record. But if the actual data contains these characters (e.g., a comment like "Hello | World"), it breaks---so we need to "escape" them (e.g., use "\ |" instead).
- Prepend Field Sizes: For each field, store its length first (as a fixed-size number, like a 32-bit integer) followed by the field data. For example, a name field might be "11Smith, John" (11 bytes for the name, so length is 11). This avoids separator issues---you just read the length, then the next N bytes are the field data.
Blocking Factor: How Many Records Fit in a Block?
Blocks are fixed-size (set by the OS, usually 512 bytes to 4096 bytes). The blocking factor (bfr) tells us how many records fit into one block---it's the foundation of calculating I/O cost (since we access blocks, not records).
The formula is simple:
bfr = floor(B / R)
Where:
- B = block size (in bytes)
- R = record size (in bytes)
- floor() = round down to the nearest whole number (you can't fit a fraction of a record in a block).
For example, if B = 512 bytes and R = 100 bytes, bfr = 512 / 100 = 5.12 → floor to 5. So each block holds 5 records.
Why does this matter? If you have 1000 records, you'll need 1000 / 5 = 200 blocks. Fewer blocks mean fewer I/O operations---faster queries!
Allocating Blocks to Files: Contiguous, Linked, Indexed
Once we have blocks of records, we need to allocate them to files on the disk. There are three main methods, each with pros and cons:
1. Contiguous Allocation
All blocks of a file are stored next to each other (spatially consecutive) on the disk. Think of a book---pages are in order. To find a block, you just need the starting address and the number of blocks (e.g., start at block 10, 5 blocks total → blocks 10-14).
Pros : Super fast access---no seeking between blocks (since they're next to each other). Great for sequential access (e.g., reading all employee records in order).
Cons: Inflexible. If the file grows, you need to find contiguous free space (which might not exist). If the file shrinks, you leave "holes" of wasted space.
2. Linked Allocation
Blocks are scattered across the disk, but each block has a pointer to the next block in the file (like a linked list). The file header stores the address of the first block. To read the file, you start at the first block, follow the pointer to the second, and so on.
Pros : Flexible---you can add blocks anywhere there's free space (no need for contiguity). No wasted space from holes.
Cons: Slow sequential access---you have to follow pointers (each pointer means a seek delay). You can't jump to a specific block (e.g., block 5) directly---you have to start from the first block and iterate.
3. Indexed Allocation
There's a special "index block" that stores pointers to every block of the file. The file header just needs the address of the index block. To find any block, you read the index block (one I/O), look up the pointer, and go directly to the block.
Pros : Fast random access---jump to any block in one I/O (plus the index block). Flexible---add blocks by updating the index.
Cons: Uses extra space for the index block (e.g., a file with 1000 blocks needs an index block with 1000 pointers). For very large files, you might need multiple index blocks (a "multi-level index").
File Structures for Databases: Heap, Sequential, Hash
Now that we know how to allocate blocks, let's talk about file structures ---how we organize records within blocks to minimize I/O cost. The three most common structures are Heap (unordered), Sequential (ordered), and Hash (hashed). We'll judge each by how fast they are for three operations: retrieval (finding a record), insertion (adding a record), and deletion (removing a record).
1. Heap Files (Unordered Files)
A heap file is the simplest structure: new records are added to the end of the last block (append). There's no order---records are stored in the order they're inserted.
Insertion: Super Fast
To insert a record:
- Find the last block of the file (the file header stores its address).
- If there's space, add the record and write the block back to disk.
- If the last block is full, create a new block, add the record, and update the file header.
Complexity: O(1) block accesses---you only touch the last block (or one new block).
Retrieval: Slow (Linear Search)
To find a record (e.g., "find employee with SSN 123456789"):
- You have to scan every block from the first to the last.
- For each block, load it into RAM, search for the record, and move to the next block if not found.
On average, you access ~b/2 blocks (where b is the total number of blocks). If the record isn't in the file, you access all b blocks.
Complexity: O(b) block accesses---slow for large files.
Deletion: Slow (Linear Search + Update)
To delete a record:
- First, find the block with the record (O(b) accesses, same as retrieval).
- Load the block into RAM, remove the record, and write the block back to disk.
This creates "holes" (empty space) in the block. To fix this, you can either:
- Use tombstone markers: Mark the record as deleted (e.g., set a bit from 0 to 1) instead of removing it. Later, you can "compact" the file---remove tombstones and shift records to fill holes.
- Shift records in the block to fill the hole (but this takes extra I/O if the block has many records).
Complexity: O(b) + O(1) block accesses---slow because of the initial search.
2. Sequential Files (Ordered Files)
A sequential file stores records in sorted order based on an "ordering field" (e.g., SSN, name). If the ordering field is unique (like SSN), it's called the "ordering key." This structure is great for queries that need sorted data or range searches (e.g., "find all employees with SSN between 100000000 and 200000000").
Retrieval: Fast (If Using the Ordering Field)
If you're searching by the ordering field (e.g., "find employee with SSN 123456789"):
- Use binary search on the blocks. Binary search splits the search space in half each time---no need to scan all blocks.
For example, if there are 1000 blocks, binary search takes log₂(1000) ≈ 10 steps---way faster than linear search.
Complexity: O(log₂b) block accesses.
If you're searching by a non-ordering field (e.g., "find employee with name 'John'" when ordering is by SSN), you can't use binary search---you have to scan all blocks, same as a heap file. Complexity: O(b).
Range Queries: Efficient (If On Ordering Field)
For range queries (e.g., "find employees with SSN 100000000--200000000"):
- Use binary search to find the block with the lower bound (SSN 100000000) → O(log₂b) accesses.
- Then, load contiguous blocks until you reach the upper bound (SSN 200000000) → O(b) accesses (but usually fewer than scanning all blocks).
Complexity: O(log₂b) + O(b) ≈ O(b) (but faster than heap files for large ranges).
Insertion: Expensive
Inserting a record means keeping the file sorted. Here's how it works:
- Use binary search to find the block where the record should go (O(log₂b) accesses).
- If the block has space, insert the record and shift existing records to make room.
- If the block is full, you have to split the block (or use an overflow block):
- Overflow blocks: Add the new record to a separate overflow block, and link it to the original block (like a linked list). Later, you can reorganize the file to merge overflow blocks back into the main sequence.
On average, you have to move half the records in the block (or more) to make space---very slow for large files.
Complexity: O(log₂b) + O(1) (for the block split/overflow) → still expensive for big datasets.
Deletion: Expensive
To delete a record:
- Use binary search to find the block (O(log₂b) accesses).
- Mark the record as deleted (tombstone) or shift records to fill the hole.
- Periodically, reorganize the file to re-sort and remove tombstones (this uses external sorting, which is slow for large files).
Complexity: O(log₂b) + O(1) → expensive because of the reorganization step.
Updates: Depends on the Field
- Updating a non-ordering field (e.g., changing an employee's salary when ordering is by SSN): Fast. Find the block (O(log₂b)), update the field, and write the block back. Complexity: O(log₂b) + O(1).
- Updating the ordering field (e.g., changing an employee's SSN): Slow. You have to delete the record from its old position (O(log₂b)) and insert it into the new position (O(log₂b) + O(1)). Complexity: O(log₂b) * 2 → expensive.
3. Hash Files
Hash files are designed for equality queries (e.g., "find employee with SSN = 123456789"). They use a hash function to map each record to a specific "bucket" (a group of blocks). The goal is to spread records evenly across buckets so that each bucket has roughly the same number of records.
How Hashing Works
- Choose a hash field (e.g., SSN) and a number of buckets (M, e.g., 100).
- Use a hash function h(k) to compute a bucket ID for each record's hash field k. A common hash function is
h(k) = k mod M
(e.g., SSN 123456789 mod 100 = 89 → bucket 89). - Each bucket has one or more blocks. The file header stores a "hash map" (bucket ID → block address) for the main blocks of each bucket.
This is called external hashing (since hashing happens on disk, not in memory).
Retrieval: Super Fast (O(1) Normally)
To find a record with hash key k (e.g., SSN 123456789):
- Load the hash map from the file header into RAM (1 block access).
- Compute the bucket ID: y = h(k) (e.g., 89).
- Look up the block address for bucket y in the hash map.
- Load that block into RAM (1 block access).
- Search the block for the record (linear search in RAM, which is fast).
If the record is in the main bucket block, you only need 2 block accesses---O(1) complexity.
The Problem: Collisions
Hash functions aren't perfect---sometimes two records map to the same bucket (e.g., SSN 123456789 mod 100 = 89, and SSN 987654389 mod 100 = 89). This is a collision.
To fix collisions, we use overflow buckets:
- Each main bucket has a pointer to an overflow bucket (initially NULL).
- If a main bucket is full, the new record goes into the overflow bucket. If that's full, add another overflow bucket and link it to the first.
When retrieving a record from a bucket with overflows, you have to load the main block plus all overflow blocks (O(1) + O(n), where n is the number of overflow blocks). But if the hash function is good (spreads records evenly), collisions are rare.
Insertion: Fast (O(1) Normally)
To insert a record:
- Compute the bucket ID y = h(k) (using the hash map, 1 block access).
- Load the main block of bucket y (1 block access).
- If there's space, add the record and write the block back.
- If full, add the record to an overflow bucket (link it to the main bucket, 1 more block access).
Complexity: O(1) (2-3 block accesses) if no overflows; O(1) + O(n) if there are overflows.
Deletion: Fast (If Using Hash Field)
- Deleting by hash field: Find the bucket (O(1) accesses), load the block, delete the record, and write back. If the record is in an overflow bucket, you may need to load overflow blocks (O(n) accesses). You can also move a record from an overflow bucket to the main bucket to fill space.
- Deleting by non-hash field: Slow. You have to scan all buckets and overflow blocks---O(b) accesses, same as a heap file.
Complexity: O(1) (or O(1) + O(n) for overflows) if using the hash field; O(b) otherwise.
Updates: Depends on the Field
- Updating a non-hash field: Fast. Find the bucket (O(1)), update the field, write back. Complexity: O(1) + O(n) (for overflows).
- Updating the hash field: Slow. Delete the record from the old bucket (O(1)) and insert it into the new bucket (O(1))---total O(1) * 2, but still faster than sequential files.
The Catch: Bad for Range Queries
Hash files are terrible for range queries (e.g., "find employees with SSN 100000000--200000000"). Why? Because the hash function spreads records evenly---logically continuous values (like 100000000 and 100000001) can map to completely different buckets. To run a range query, you have to treat each value in the range as a separate equality query---O(n) complexity, where n is the number of values in the range.
Quiz Time! Let's Test Your Skills
We've covered a lot---now let's put your knowledge to the test with two quiz questions. Each has a step-by-step solution, so don't worry if you get stuck.
Quiz 1: Calculating Blocking Factor and Number of Blocks
Problem: The EMPLOYEE relation has 1103 fixed-length records. Each record has three fields: NAME (30 bytes), SSN (10 bytes), and ADDRESS (60 bytes). The OS uses a block size of 512 bytes.
a) What is the blocking factor (bfr) for the EMPLOYEE file?
b) How many blocks are needed to store all 1103 records?
Step-by-Step Solution
Part a: Calculate Blocking Factor (bfr)
First, find the size of one record ®:
R = NAME size + SSN size + ADDRESS size = 30 + 10 + 60 = 100 bytes.
The blocking factor formula is bfr = floor(B / R)
, where B = 512 bytes (block size).
Plug in the numbers:
bfr = floor(512 / 100) = floor(5.12) = 5.
So each block holds 5 records.
Part b: Calculate Number of Blocks
Total records ® = 1103.
Number of blocks = ceil(r / bfr) → we use ceil() because even if the last block is not full, we still need a block for the remaining records.
Plug in the numbers:
1103 / 5 = 220.6 → ceil(220.6) = 221.
So we need 221 blocks to store all records.
Quiz 2: Hash File Block Accesses (Worst Case)
Problem : A hash file for EMPLOYEE uses SSN as the hash field. The file has M = 3 buckets, 1 block per main bucket, and a blocking factor of 2 records per block. The SSN values of the 6 employees are: {1000, 4540, 4541, 4323, 1321, 1330}. The hash function is h(SSN) = SSN mod 3
.
a) Assign each employee to a bucket.
b) Calculate the expected number of block accesses for a random equality query (e.g., "SELECT * FROM EMPLOYEE WHERE SSN = k") in the worst case (assuming each SSN is equally likely to be queried).
Step-by-Step Solution
Part a: Assign Employees to Buckets
We calculate the bucket ID for each SSN using h(SSN) = SSN mod 3
:
- 1000 mod 3 = 1 (since 3*333 = 999, 1000-999=1) → Bucket 1
- 4540 mod 3 = 1 (3*1513=4539, 4540-4539=1) → Bucket 1
- 4541 mod 3 = 2 (3*1513=4539, 4541-4539=2) → Bucket 2
- 4323 mod 3 = 0 (3*1441=4323, remainder 0) → Bucket 0
- 1321 mod 3 = 1 (3*440=1320, 1321-1320=1) → Bucket 1
- 1330 mod 3 = 1 (3*443=1329, 1330-1329=1) → Bucket 1
So the bucket assignments are:
- Bucket 0: 1 record (SSN 4323) → 1 block (main bucket, no overflow)
- Bucket 1: 4 records (SSNs 1000, 4540, 1321, 1330) → 1 main block (holds 2 records) + 1 overflow block (holds 2 more records) → 2 blocks total
- Bucket 2: 1 record (SSN 4541) → 1 block (main bucket, no overflow)
Part b: Calculate Expected Block Accesses
First, recall that for a hash file:
- To access a record in a bucket with no overflow: 2 block accesses (load hash map + load main block).
- To access a record in a bucket with overflow: 2 + n block accesses (load hash map + load main block + load n overflow blocks).
In this case:
- Bucket 0 has 1 record (1 SSN) → 2 block accesses.
- Bucket 1 has 4 records (4 SSNs) → 2 + 1 = 3 block accesses (1 overflow block).
- Bucket 2 has 1 record (1 SSN) → 2 block accesses.
Since each SSN is equally likely (probability = 1/6 per SSN):
Expected accesses = (Number of SSNs in Bucket 0 * Accesses for Bucket 0 + Number of SSNs in Bucket 1 * Accesses for Bucket 1 + Number of SSNs in Bucket 2 * Accesses for Bucket 2) / Total SSNs
Plug in the numbers:
Expected accesses = (12 + 4 3 + 1*2) / 6
= (2 + 12 + 2) / 6
= 16 / 6 ≈ 2.67
Wait, let's double-check:
Bucket 0: 1 SSN → 12 = 2
Bucket 1: 4 SSNs → 4 3 = 12
Bucket 2: 1 SSN → 1*2 = 2
Total: 2 + 12 + 2 = 16
16 / 6 ≈ 2.67
So the expected number of block accesses is ~2.67.
Wrapping Up
Storage is the unsung hero of data systems - get it right, and your queries fly; get it wrong, and your system crawls. We've covered the basics: storage hierarchies, disk anatomy, record/block structures, and the three key file structures (heap, sequential, hash). Each structure has its sweet spot: heap for simple inserts, sequential for sorted/range queries, and hash for fast equality queries.
If you have questions (or want to debate the pros and cons of SSDs vs. HDDs), drop a comment below! Happy coding, and see you next time.