Pandas makes moving data around straightforward, and ORC has become a reliable choice for analytical workloads. It sits in the same space as Parquet, but ORC offers better compression and solid type preservation. Writing a Pandas DataFrame to ORC takes about three lines of code, and the round-trip is fast enough that it never slows things down.

Exactly how to write a DataFrame to ORC, practical examples that go beyond the basics, and the things that trip people up the first few times. At the end, the difference between ORC, Parquet, and CSV is clear, and getting data into ORC without friction is straightforward.

TLDR

  • Install PyArrow first with pip install pyarrow or conda install -c conda-forge pyarrow
  • Write with df.to_orc("output.orc") and read back with df = pd.read_orc("output.orc")
  • Set index=True to preserve the DataFrame index in the ORC file
  • Pass path=None to get bytes back instead of writing to disk
  • Convert categorical columns to str before writing to avoid NotImplementedError
  • ORC preserves data types across write-read cycles, unlike CSV

What Is ORC Format?

ORC stands for Optimized Row Columnar. It is a columnar storage format that was originally developed for Apache Hive, and it has become a general-purpose option for analytical data in the Python ecosystem. Spark pipelines use it heavily, but Pandas supports it natively through PyArrow without needing Spark.

Columnar storage means faster reads when only a subset of columns is needed. Built-in compression reduces storage footprint compared to row-based formats like CSV. ORC preserves data types across the write-read cycle, so an integer column stays an integer column, not a string that needs casting downstream.

ORC sits alongside Parquet and Feather as one of the columnar formats Pandas supports natively. The main difference from Parquet is that ORC has better support for Hive-style partitioning and is often the default choice in the Hadoop ecosystem. For Pandas users outside of Spark or Hive, Parquet is more common, but ORC is worth knowing. It becomes the right call when the data will eventually feed into a Hive or Spark job.

Prerequisites: Installing PyArrow

Pandas uses the PyArrow library as the engine for ORC reads and writes. Starting with Pandas 2.0, to_orc() is available out of the box, but it requires PyArrow to be installed. If a NotImplementedError appears or the method does not exist, PyArrow is usually the missing piece.

On systems with M-series Macs or custom Python builds, conda tends to handle the binary dependencies better:


conda install -c conda-forge pyarrow

Once installed, verify it works by running import pyarrow in a Python shell. No errors means the library is available.

Writing a DataFrame to ORC

Here is the simplest possible example. Create a DataFrame from a dictionary and write it to an ORC file:


import pandas as pd

data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "score": [88.5, 92.1, 78.3]
}
df = pd.DataFrame(data)

df.to_orc("output.orc")
print("File written successfully")

Pandas uses PyArrow under the hood by default, so no engine parameter is needed in most cases. The file output.orc gets created in the working directory.

Reading the file back is equally straightforward:


df_read = pd.read_orc("output.orc")
print(df_read)

The to_orc Method Syntax

Here is the full signature of the method:


DataFrame.to_orc(path=None, *, engine="pyarrow", index=None, engine_kwargs=None)

The most useful parameters:

  • path – Where to save the file. Can be a string path, a file-like object, or None (returns bytes in memory).
  • engine – Must be “pyarrow”. The only supported engine as of Pandas 2.x.
  • index – If True, writes the DataFrame index as a column in the ORC file. Defaults to None, which excludes the index.

Including the Index in ORC Output

The index is not written to the ORC file by default. To preserve it, set index=True:


df.to_orc("output_indexed.orc", index=True)
df_indexed = pd.read_orc("output_indexed.orc")
print(df_indexed)

A meaningful index like a DatetimeIndex from a time series needs to be preserved across the write-read cycle, and setting index=True handles that.

Writing to Bytes Instead of a File

Pass path=None and the method returns bytes instead of writing to disk. The bytes approach proves useful when sending data over a network or storing it in a database blob:


orc_bytes = df.to_orc(path=None)
print(type(orc_bytes))
print(f"Size in bytes: {len(orc_bytes)}")

Checking Data Type Preservation

ORC versus CSV type preservation is a common question. Here is exactly what that means in practice:


import pandas as pd
import pyarrow.orc as orc

data = {
    "ints": [1, 2, 3, 4, 5],
    "floats": [1.1, 2.2, 3.3, 4.4, 5.5],
    "strings": ["a", "b", "c", "d", "e"]
}
df = pd.DataFrame(data)

df.to_orc("typed.orc", index=True)

# Read back and check types
df_back = pd.read_orc("typed.orc")
print("Dtypes after round-trip:")
print(df_back.dtypes)

# Inspect the ORC schema directly
with open("typed.orc", "rb") as f:
    reader = orc.ORCFile(f)
    schema = reader.schema
    print("ORC Schema:")
    for field, dtype in zip(schema.names, schema.types):
        print(f"  {field}: {dtype}")

Integer and float types are preserved exactly. Int64 stays int64, float64 stays float64. With CSV, the result is strings that need manual casting on every read.

Handling Unsupported Column Types

ORC via PyArrow does not handle nullable integer types like unsigned ints, Categorical columns, or Interval types. When a NotImplementedError appears, convert those columns to supported types before writing.

Handling categorical data:


# Convert categorical column to string before writing
df["category"] = df["category"].astype(str)
df.to_orc("fixed.orc")

Handling nullable integer types:


# Handle nullable (UInt64) columns
for col in df.columns:
    if str(df[col].dtype).startswith("UInt"):
        df[col] = df[col].astype("int64")
df.to_orc("fixed.orc")

Common Errors and How to Handle Them

NotImplementedError appears when a DataFrame contains unsupported column types. The section above covers the most common cases.

ValueError about the engine means an engine other than “pyarrow” was specified. The engine parameter must be exactly engine="pyarrow" or omitted entirely.

AttributeError: 'DataFrame' object has no attribute 'to_orc' means Pandas is version 1.x. The method was added in Pandas 2.0. Run pip install --upgrade pandas to get there.

ORC vs Parquet vs CSV

CSV is the universal fallback. It works everywhere, but it has no compression by default and no type preservation. Every read requires casting columns manually.

Parquet is the most widely adopted columnar format outside of the Hadoop ecosystem. It has excellent compression, broad tool support, and works well with Pandas, Spark, DuckDB, and most other data tools. If there is uncertainty about which to pick and the environment is not Hadoop-based, Parquet is the safe starting point.

ORC is the right choice when working within the Hadoop or Hive ecosystem, or when Hive-style partitioning capabilities are specifically needed. ORC also handles complex nested types slightly better than Parquet in some cases. For typical Pandas workflows, the performance difference between ORC and Parquet is negligible.

FAQ

Q: Does ORC support compression?

Yes. ORC has built-in compression using Zlib or Snappy by default. PyArrow handles this automatically when writing. The exact compression used depends on the PyArrow version and settings, but for most use cases no configuration is needed.

Q: How does ORC compare to Parquet for Pandas?

Both are columnar formats with similar performance characteristics. Parquet tends to have broader support across languages and tools, while ORC has better compatibility with the Hadoop ecosystem. For pure Pandas workflows, Parquet is more commonly used, but ORC is equally valid and sometimes preferred when interoperability with Hive or Spark is a factor.

Q: Can I write to a specific location in an S3 bucket?

Yes. Pass an S3 path like s3://bucket/path/file.orc to to_orc() if s3fs is installed. PyArrow handles S3 writes natively in recent versions.

Q: What is the maximum DataFrame size for ORC?

There is no hard limit. It depends on available disk space and memory during the write operation. ORC files are splittable, so even very large files can be read in chunks without loading the entire file into memory.

Q: Does ORC support all Pandas data types?

No. Nullable integer types (UInt8, UInt16, etc.), Categorical, Interval, and Period types are not supported directly. These must be converted to supported types like int64 or str before writing. One of the main rough edges when working with ORC and Pandas.

ORC rewards having it in the toolkit. Once familiar with it, plenty of situations arise where it fits better than CSV or even Parquet. Type preservation alone makes it worth reaching for whenever data quality matters downstream. It is the go-to for any pipeline that will eventually touch Spark or Hive.

Share.
Leave A Reply