How to Append a Table to an Existing HDF File?

HDF (Hierarchical Data Format) stores multiple large datasets inside a single file with a hierarchical structure similar to your computer’s file system (C:\Users\Profile\Desktop). When you work with interconnected datasets that exceed memory limits, HDF5 offers a practical alternative to CSVs.

I’ve found HDF5 particularly useful when dealing with scientific data, time-series measurements, or any scenario where you need to store related datasets together. The format saves files with .h5 or .hdf5 extensions and works across Python, R, C++, and other languages.

While formats like Parquet, Feather, and Pickle exist for specific use cases, HDF5 excels at storing hierarchical data structures.

Recommended: Introduction to data frames

What Is Hierarchical Data Storage?

HDF5 organizes data into groups, metadata, and datasets within a single parent file, letting you store images, text, numbers, and structured data together.

Think of an HDF5 file as a miniature file system. You create groups (like folders), add metadata (like file properties), and store datasets (like individual files). Python libraries like PyTables and h5pyViewer make reading and writing these files straightforward.

The format handles petabyte-scale datasets efficiently. I’ve used it for projects where CSVs would create hundreds of separate files, and HDF5 consolidated everything into one manageable file.

Also read: Know how to read the HDF files using pandas

How Does HDFStore.append Work?

The HDFStore.append() method adds a Series or DataFrame to an existing HDF file without overwriting existing data.

Here’s the complete syntax:

HDFStore.append(key, value, format=None, axes=None, index=True, 
                append=True, complib=None, complevel=None, 
                columns=None, min_itemsize=None, nan_rep=None, 
                chunksize=None, expectedrows=None, dropna=None, 
                data_columns=None, encoding=None, errors='strict')

The parameters you’ll use most often:

key: The identifier for your data in the HDF file
value: The Series or DataFrame you’re appending
format: Storage format (default is ‘table’)
index: Whether to include the DataFrame index as a column
append: Set to True to add data; False overwrites the file
nan_rep: String to replace NaN values

I typically stick with the defaults for most projects. The ‘table’ format allows querying later, which proves useful for large datasets.

How to Append a Series to an HDF File?

Open the HDF file in append mode, concatenate your new series with existing data, then write it back using store.put().

Let me show you the complete workflow. First, create an HDF file:

import pandas as pd

df = pd.DataFrame([[11, 12], [13, 14]], columns=['A', 'B'])
with pd.HDFStore("store1.h5", 'w') as store:
    store.put('data', df, format='table')

read = pd.read_hdf('store1.h5')
print(read)

Output:

Now append a series to the existing file:

with pd.HDFStore('store1.h5', mode='a') as store:
    existing_data = store['data']
    new_data = pd.DataFrame({'A': [10], 'B': [11]})
    updated_data = pd.concat([existing_data, new_data], ignore_index=True)
    store.put('data', updated_data, format='table', data_columns=True)

with pd.HDFStore('store1.h5', mode='r') as store:
    updated_data = store['data']
    print(updated_data)

I use pd.concat() when the new data might have different column structures. The ignore_index=True parameter prevents index conflicts.

Appending series to the HDF file

How to Append a Complete DataFrame to an HDF File

Use store.append() directly when your DataFrame has the same column structure as the existing data.

with pd.HDFStore('store1.h5', mode='a') as store:
    df = pd.DataFrame([[5, 6], [7, 8]], columns=['A', 'B'])
    store.append('data', df)

read = pd.read_hdf('store1.h5')
print(read)

This follows a direct approach as compared to the series append as the number of columns is the same. The updated file is read with the help of read_hdf.

Hence, it is advised to be careful when appending to the HDF file to not have duplicates in the file.

What Are the Limitations of HDFStore.append?

HDFStore.append() does not check for duplicate data. You can append the same DataFrame multiple times, creating duplicate rows.

Watch what happens when I append identical data:

with pd.HDFStore('store1.h5', mode='a') as store:
    df_new = pd.DataFrame([[5, 6], [7, 8]], columns=['A', 'B'])
    store.append('data', df_new)

read = pd.read_hdf('store1.h5')
print(read)

Does not check for duplicates — Handling Duplicates in HDF Appends

You need to implement your own duplicate checking logic before appending. Here’s how I handle it:

with pd.HDFStore('store1.h5', mode='a') as store:
    existing_data = store['data']
    new_data = pd.DataFrame([[5, 6], [7, 8]], columns=['A', 'B'])
    
    # Check for duplicates
    combined = pd.concat([existing_data, new_data])
    deduplicated = combined.drop_duplicates()
    
    # Overwrite with deduplicated data
    store.put('data', deduplicated, format='table')

When to Use HDF5 vs. Alternative Formats

Choose HDF5 when you need hierarchical data organization, partial data loading, or cross-language compatibility. Consider Parquet for columnar analytics or Pickle for pure Python projects.

Use HDF5 when you need:

Multiple related datasets in one file
Ability to read subsets without loading everything
Cross-language support (Python, R, C++, Julia)
Built-in metadata and documentation

Use Parquet when you need:

Column-oriented analytics
Better compression ratios
Compatibility with big data tools (Spark, Hadoop)
Immutable storage (no updates needed)

Use Pickle when you need:

Fastest Python-specific serialization
Simple implementation
No cross-language requirements

I keep HDF5 for experimental data where I store raw measurements, processed results, and metadata together. For production analytics pipelines, I prefer Parquet because of its compression and query performance.

Summary

Appending DataFrames to HDF files gives you efficient data management for large, interconnected datasets. The store.append() method works well when you understand its limitations, particularly around duplicate checking.

I recommend creating your own wrapper functions that handle duplicate detection automatically. The manual duplicate checking adds minimal overhead compared to the benefits of clean data.

HDF5 remains a solid choice for scientific computing and time-series data in 2025, though you should evaluate Parquet for analytics-focused workloads. The format’s hierarchical structure and partial I/O capabilities make it irreplaceable for certain use cases.

References

Append Documentation

How to Append a Table to an Existing HDF File?

42 Best Free Books About Self-Improvement – Be on the Right Side of Change

How Learning Python Can Make College Research Easier (and More Fun)

From Dorm Room to Developer: How Students Are Turning Coding Skills into Careers

How to Append a Table to an Existing HDF File?

What Is Hierarchical Data Storage?

How Does HDFStore.append Work?

How to Append a Series to an HDF File?

How to Append a Complete DataFrame to an HDF File

What Are the Limitations of HDFStore.append?

When to Use HDF5 vs. Alternative Formats

Summary

References

Related Posts

42 Best Free Books About Self-Improvement – Be on the Right Side of Change

How Learning Python Can Make College Research Easier (and More Fun)

From Dorm Room to Developer: How Students Are Turning Coding Skills into Careers