HDF (Hierarchical Data Format) stores multiple large datasets inside a single file with a hierarchical structure similar to your computer’s file system (C:\Users\Profile\Desktop). When you work with interconnected datasets that exceed memory limits, HDF5 offers a practical alternative to CSVs.
I’ve found HDF5 particularly useful when dealing with scientific data, time-series measurements, or any scenario where you need to store related datasets together. The format saves files with .h5 or .hdf5 extensions and works across Python, R, C++, and other languages.
While formats like Parquet, Feather, and Pickle exist for specific use cases, HDF5 excels at storing hierarchical data structures.
Recommended: Introduction to data frames
What Is Hierarchical Data Storage?
HDF5 organizes data into groups, metadata, and datasets within a single parent file, letting you store images, text, numbers, and structured data together.
Think of an HDF5 file as a miniature file system. You create groups (like folders), add metadata (like file properties), and store datasets (like individual files). Python libraries like PyTables and h5pyViewer make reading and writing these files straightforward.
The format handles petabyte-scale datasets efficiently. I’ve used it for projects where CSVs would create hundreds of separate files, and HDF5 consolidated everything into one manageable file.
Also read: Know how to read the HDF files using pandas
How Does HDFStore.append Work?
The HDFStore.append() method adds a Series or DataFrame to an existing HDF file without overwriting existing data.
Here’s the complete syntax:
HDFStore.append(key, value, format=None, axes=None, index=True,
append=True, complib=None, complevel=None,
columns=None, min_itemsize=None, nan_rep=None,
chunksize=None, expectedrows=None, dropna=None,
data_columns=None, encoding=None, errors='strict')
The parameters you’ll use most often:
- key: The identifier for your data in the HDF file
- value: The Series or DataFrame you’re appending
- format: Storage format (default is ‘table’)
- index: Whether to include the DataFrame index as a column
- append: Set to True to add data; False overwrites the file
- nan_rep: String to replace NaN values
I typically stick with the defaults for most projects. The ‘table’ format allows querying later, which proves useful for large datasets.
How to Append a Series to an HDF File?
Open the HDF file in append mode, concatenate your new series with existing data, then write it back using store.put().
Let me show you the complete workflow. First, create an HDF file:
import pandas as pd
df = pd.DataFrame([[11, 12], [13, 14]], columns=['A', 'B'])
with pd.HDFStore("store1.h5", 'w') as store:
store.put('data', df, format='table')
read = pd.read_hdf('store1.h5')
print(read)
Output:
Now append a series to the existing file:
with pd.HDFStore('store1.h5', mode='a') as store:
existing_data = store['data']
new_data = pd.DataFrame({'A': [10], 'B': [11]})
updated_data = pd.concat([existing_data, new_data], ignore_index=True)
store.put('data', updated_data, format='table', data_columns=True)
with pd.HDFStore('store1.h5', mode='r') as store:
updated_data = store['data']
print(updated_data)
I use pd.concat() when the new data might have different column structures. The ignore_index=True parameter prevents index conflicts.
How to Append a Complete DataFrame to an HDF File
Use store.append() directly when your DataFrame has the same column structure as the existing data.
with pd.HDFStore('store1.h5', mode='a') as store:
df = pd.DataFrame([[5, 6], [7, 8]], columns=['A', 'B'])
store.append('data', df)
read = pd.read_hdf('store1.h5')
print(read)
This follows a direct approach as compared to the series append as the number of columns is the same. The updated file is read with the help of read_hdf.

Hence, it is advised to be careful when appending to the HDF file to not have duplicates in the file.
What Are the Limitations of HDFStore.append?
HDFStore.append() does not check for duplicate data. You can append the same DataFrame multiple times, creating duplicate rows.
Watch what happens when I append identical data:
with pd.HDFStore('store1.h5', mode='a') as store:
df_new = pd.DataFrame([[5, 6], [7, 8]], columns=['A', 'B'])
store.append('data', df_new)
read = pd.read_hdf('store1.h5')
print(read)

You need to implement your own duplicate checking logic before appending. Here’s how I handle it:
with pd.HDFStore('store1.h5', mode='a') as store:
existing_data = store['data']
new_data = pd.DataFrame([[5, 6], [7, 8]], columns=['A', 'B'])
# Check for duplicates
combined = pd.concat([existing_data, new_data])
deduplicated = combined.drop_duplicates()
# Overwrite with deduplicated data
store.put('data', deduplicated, format='table')
When to Use HDF5 vs. Alternative Formats
Choose HDF5 when you need hierarchical data organization, partial data loading, or cross-language compatibility. Consider Parquet for columnar analytics or Pickle for pure Python projects.
Use HDF5 when you need:
- Multiple related datasets in one file
- Ability to read subsets without loading everything
- Cross-language support (Python, R, C++, Julia)
- Built-in metadata and documentation
Use Parquet when you need:
- Column-oriented analytics
- Better compression ratios
- Compatibility with big data tools (Spark, Hadoop)
- Immutable storage (no updates needed)
Use Pickle when you need:
- Fastest Python-specific serialization
- Simple implementation
- No cross-language requirements
I keep HDF5 for experimental data where I store raw measurements, processed results, and metadata together. For production analytics pipelines, I prefer Parquet because of its compression and query performance.
Summary
Appending DataFrames to HDF files gives you efficient data management for large, interconnected datasets. The store.append() method works well when you understand its limitations, particularly around duplicate checking.
I recommend creating your own wrapper functions that handle duplicate detection automatically. The manual duplicate checking adds minimal overhead compared to the benefits of clean data.
HDF5 remains a solid choice for scientific computing and time-series data in 2025, though you should evaluate Parquet for analytics-focused workloads. The format’s hierarchical structure and partial I/O capabilities make it irreplaceable for certain use cases.

