Exploring DBMS Storage Structures: Heap- vs Hash- vs Index-Organised Tables

·

9 min read

Introduction

Database Management Systems (DBMS) use different storage structures to organise and access data. Three of the most popular storage structures are Heap-Organised Tables, Hash-Organised Tables, and Index-Organised Tables. In this post, we’ll see their benefits, use cases and drawbacks.

Heap-Organised Tables aka Heap Files

When I think of Heap-organised tables, I imagine a heap of clothes before doing laundry. Heap-organised tables are typically associated with row-oriented DBMS. In heap-organised tables, records are not arranged in any specific order. Instead, data is inserted into these files sequentially, following the order in which it was received or written (write-order). Hence the nickname “unordered tables” is appropriate for heap-organised tables.

Heap Organised Tables

The lack of specific ordering in heap files allows for fast data insertion as the DBMS does not need to reorganise or maintain any inherent order. However, the downside lies in data retrieval. As data isn't sorted, any search operation would require a full table scan, which can be time-consuming for large tables.

To make heap files searchable, additional index structures can be created on specific columns of the table. These indexes contain pointers to the locations within the heap file where data records with matching values are stored.

While these indexes improve the search performance of heap files, they come with additional overhead in terms of storage space and maintenance. This is because indexes need to be updated whenever data records are inserted, updated or deleted from the table which can impact the overall performance of the DBMS.

Heap files are great if your operations are heavily focused on data insertion aka write-heavy workloads, and the table size remains relatively small.

Several DBMS, including PostgreSQL and Microsoft SQL Server, default to using heap-organised tables when a table is created without specifying a specific indexing method. In these systems, if no clustered index or primary key is defined during table creation, the table is typically created as a heap-organised table, storing data in the order of insertion.

Hash-Organised Tables aka Hashed Files

Hash-organised tables use a hash function to determine the storage location of data based on a hash of one or more key columns. Each hash value corresponds to a bucket or slot in the table where data records with that specific hash value are stored. Accessing data in a hashed file typically involves computing the hash value of the search key and then looking up the corresponding bucket to retrieve the data. This approach aims to distribute data evenly across the underlying storage structure, providing fast retrieval for exact match queries aka point lookup queries.

Hash organised tables

While hash-organised tables excel at point lookups, they may not perform as well for range scans or queries involving consecutive key values.

In practice, hash functions may lead to collisions, where multiple keys map to the same hash value. Collision resolution techniques, such as chaining or open addressing, address this issue. Chaining involves storing collided records in a linked list or similar structure within the bucket, while open addressing techniques search for an empty slot within the bucket for the collided record.

Data records within buckets can be stored in either append order or sorted by key to optimise lookup speed.

With the append-order approach, each new record is appended to the end of the bucket when it arrives. This technique is simple, straightforward and efficient for insertion operations. However, lookup operations in append-ordered buckets may require scanning through all the records in the bucket. As you can guess, this linear search is less efficient with a bucket storing a large number of records.

In contrast, the sorted-by-key approach allows for faster lookups using algorithms like binary search. However, maintaining a sorted order within the bucket may introduce additional overhead during insertion and deletion operations as records need to be inserted at the correct position to preserve sorted order.

If a bucket becomes full due to insertions, overflow handling mechanisms are necessary to accommodate additional records. These mechanisms may involve splitting the bucket into multiple buckets or reallocating records to other buckets through a rehashing process.

Some database management systems (DBMS) such as Oracle Berkeley DB and certain configurations of Cassandra, Amazon DynamoDB, and Redis utilise hash-organised tables or hash-based storage mechanisms to optimise for specific workloads. However, these systems often use a combination of data structures tailored to their performance requirements.

Index-Organised Tables aka IOTs

Index-organised tables store the index and data records in a single structure, typically a B-tree index. The primary key of the table is used as the index key, and the data rows are stored in the leaf nodes of the index.

Unlike heap files which can have a separate index structure, IOTs store the records directly within the index structure. This means that the leaf nodes contain the actual data record rather than pointers. This organisation reduces the number of disk seeks and allows for efficient data retrieval and storage.

Accessing data in an index-organised table involves traversing the B-tree index to locate the desired rows.

IOTs are better suited for OLAP (Online Analytical Processing) workloads where queries often involve range scans, ordered retrieval, and aggregation. They are great if your operations involve frequent data retrieval or the table size is large.

While IOTs get a lot of praise, there are some notable drawbacks such as:

  • Index Maintenance Overhead: Because the data and index are stored together in the same structure, any modification to the data (inserts, updates, deletes) may require corresponding updates to the index. This can lead to increased index maintenance overhead, especially for tables with frequent data modifications.

  • Storage Requirements: They require more storage space compared to heap-organised tables, especially if the table has a wide primary key or if there are many secondary indexes. Storing data within the index structure can lead to increased storage overhead, particularly for large tables.

  • Performance for OLTP Workloads: While IOTs are well-suited for OLAP (Online Analytical Processing) workloads with range scans and ordered retrieval, they may not perform as well for OLTP (Online Transaction Processing) workloads with frequent data modifications and single-row lookups. In some cases, heap-organised tables may offer better performance for OLTP scenarios.

  • Complexity: Being more complex to maintain, backup, and recovery operations may require more specialised knowledge and tools.

DBMS whose default table organisation is index-organised tables are relatively rare as most systems default to heap-organised tables or other structures unless explicitly configured otherwise.

Nonetheless, Oracle Database is one of the few DBMS that provides explicit support for IOTs. Oracle is well-known for its robust implementation and support for IOTs.

Clustered Index vs Index-Organised Table (IOT)

Clustered Indexes are often confused with Index-Organised Tables (IOTs). While both clustered indexes and Index-Organised Tables involve organising data based on an index, they are distinct concepts.

Clustered Index:

  • Clustered Indexes use a B-tree structure, where the data rows are stored in the same physical order as the index keys. In a table with a clustered index, the index not only stores pointers to the data rows but also determines the physical layout of the data rows themselves. The data is part of the index structure, there is no separate index structure pointing to the data.

  • Only one clustered index can exist per table because the data rows can be physically ordered in only one way. Typically, a clustered index is created on a primary key column by default. Additional or secondary indexes must be non-clustered, meaning they maintain separate index structures that point to the data rows indirectly.

  • Many DBMS systems, such as MySQL and its fork MariaDB (using InnoDB storage engine) and SQL Server, use clustered indexes to organise table data for efficient retrieval. In these systems, the primary key often serves as the clustered index by default, providing faster access to data based on the primary key.

Index-Organised Tables (IOTs):

  • Index-Organised Tables also use a B-tree structure, integrating both the index and data into a single structure. The data rows are stored directly within the B-tree index, with the leaf nodes containing the actual data records. There is no separate table storage outside the index.

  • An IOT inherently organises the entire table around the primary key index. The whole table’s data is stored within this integrated index structure, allowing for highly efficient access patterns, especially for range scans and ordered data access. However, this design requires careful consideration of the primary key's design, as it significantly impacts the physical storage of the entire table.

  • IOTs are specific to databases like Oracle, where they are designed for efficient range scans and ordered data access. The primary key in an IOT dictates the storage order of the data, which can significantly improve query performance for operations involving the primary key or a range of keys.

Conclusion

In this article, we learnt about three popular storage structures used in database systems to organise and access data: Heap-Organised Tables, Hash-Organised Tables, and Index-Organised Tables. Each of these storage structures has its strengths and weaknesses. Below is a summary of the pros and cons associated with each one:

Storage StructureProsCons
Heap-Organised TablesThey offer fast data insertion due to the lack of specific ordering. This structure can be made searchable with the addition of indexes, increasing its flexibility and adaptability.The major drawback is that data retrieval could be slow since it requires a full table scan. Additionally, the use of indexes introduces overhead in terms of storage space and maintenance.
Hash-Organised TablesThese structures provide fast retrieval for exact match queries, which can be a significant advantage in many use cases. The data is evenly distributed across the storage structure, which can help improve performance.However, they can perform poorly for range scans or queries involving consecutive key values. The use of hash functions may also lead to collisions, adding complexity to the data management process.
Index-Organised TablesThese are known for their efficient data retrieval and storage, as well as their ability to reduce the number of disk seeks. This can significantly improve performance, especially in large databases.On the downside, they have an increased index maintenance overhead. They also require more storage space, and may not perform well for workloads with frequent data modifications and single-row lookups.

Understanding the characteristics of these storage structures can help in making informed decisions when designing and managing databases. The goal is to leverage the strengths of each storage structure to meet your specific database requirements, while also being aware of their limitations. By doing so, you can design and manage your databases more effectively, ensuring optimal performance and efficiency.

Glossary

  • Point queries retrieve individual records from a DB based on certain criteria. They are simple and efficient since they access a specific record without scanning large portions of the DB. For example, retrieving a user’s information based on their unique user ID.
SELECT * FROM users WHERE user_id = 100;
  • Range scans on the other hand involve querying a range of records within a specified range of values. Range scans return multiple records that satisfy the conditions. For example, retrieving data based on date ranges.
SELECT * FROM sales_transactions
WHERE transaction_date BETWEEN '2022-01-01' AND '2022-03-31';

References and Additional Reading

  1. Database Internals Book

  2. Index-Organised Table VS. Clustered index

  3. Heap vs Index organized table

  4. Javatpoint.com

  5. GeeksforGeeks.com