Excel files are everywhere in the professional world. Your manager sends you quarterly sales data as an .xlsx file. Your finance team exports trial balances from accounting software. The dataset you need for a machine learning project lives in a spreadsheet someone emailed you last week. If you are working with data in Python, at some point you will need to read an Excel file, and the standard way to do that is with pandas.read_excel().

In this tutorial, I will walk you through every important argument of pandas.read_excel(). You will learn how to read entire sheets, specific columns, date columns, multiple sheets from a single file, and even password-protected workbooks. Each section has runnable code with real output so you can see exactly what to expect.

Prerequisites

You need two packages: pandas and openpyxl. Pandas handles the data manipulation. Openpyxl is the engine that actually reads the .xlsx file format. Install both with pip.

pip install pandas openpyxl

The openpyxl package is important. Without it, pandas will raise an error when you try to read .xlsx files. If you are reading older .xls files, you need the xlrd package instead.

pip install xlrd

A Minimal Example First

Let me show you the simplest possible call. Suppose you have a file called sales.xlsx in your current directory. Reading it takes one line.

import pandas as pd

df = pd.read_excel("sales.xlsx")
print(df.head())

That is it. Pandas opens the file, reads the first sheet, treats the first row as column headers, and returns a DataFrame. Now let us build from here.

Specifying the Sheet

By default, read_excel reads the first sheet. An Excel file can contain multiple sheets, and you can target the one you want in three ways.

By Sheet Name

df = pd.read_excel("sales.xlsx", sheet_name="Q4 Data")

By Sheet Index

Sheet indices start at zero. Reading the third sheet in the file looks like this.

df = pd.read_excel("sales.xlsx", sheet_name=2)

Reading All Sheets at Once

Pass sheet_name=None to get a dictionary where each key is a sheet name and each value is the DataFrame for that sheet.

all_sheets = pd.read_excel("sales.xlsx", sheet_name=None)

for sheet_name, sheet_df in all_sheets.items():
    print(f"Sheet: {sheet_name}, Rows: {len(sheet_df)}")

This approach is useful when you need to process every sheet in a workbook without manually listing them.

Selecting Columns

You rarely need every column from a spreadsheet. Loading only what you want saves memory and speeds up your code.

Load Only Specific Columns with usecols

The usecols parameter accepts column names, column indices, or a callable function.

# By column names
df = pd.read_excel("sales.xlsx", usecols=["Date", "Revenue", "Region"])

# By column indices (A, B, C correspond to 0, 1, 2)
df = pd.read_excel("sales.xlsx", usecols=[0, 2, 3])

# Using a string range like Excel uses
df = pd.read_excel("sales.xlsx", usecols="B:D")

The string range notation “B:D” is handy when you want columns B through D without counting indices.

Skipping Rows and Loading a Header

Sometimes the actual column headers in your Excel file are not on row 1. The data might start on row 3 or 4, with metadata or blank rows above it.

# Skip the first 2 rows, use row 3 as column headers
df = pd.read_excel("sales.xlsx", header=2)

# Use a specific row as the header
df = pd.read_excel("sales.xlsx", header=0)  # same as default

# No header at all, generate column names manually
df = pd.read_excel("sales.xlsx", header=None)
df.columns = ["Date", "Product", "Revenue", "Region"]

Handling Dates

Date columns in Excel files are a common source of confusion. Excel stores dates as serial numbers. Pandas can convert those serial numbers to proper Python datetime objects automatically.

# Parse the 'Date' column as datetime
df = pd.read_excel("sales.xlsx", parse_dates=["Date"])

# Or parse all columns that look like dates
df = pd.read_excel("sales.xlsx", parse_dates=True)

Check the result. You will see datetime64[ns] as the dtype instead of generic objects.

print(df.dtypes)
# Date       datetime64[ns]
# Revenue           float64
# Region             object

Sometimes the date format in the Excel file is non-standard. You can pass a custom date parser for those cases.

from datetime import datetime

def custom_date_parser(date_string):
    return datetime.strptime(date_string, "%d/%m/%Y")

df = pd.read_excel("sales.xlsx", date_parser=custom_date_parser)

Skipping Rows and Footer Rows

Some Excel exports have summary rows at the bottom, notes, or blank filler rows. You can skip rows from the top and the bottom independently.

# Skip first 5 rows, do not read the last 3 rows
df = pd.read_excel("sales.xlsx", skiprows=5, skipfooter=3)

Both parameters work together. You might skiprows=5 to jump past a title block, then skipfooter=2 to drop a total row that would otherwise corrupt your numeric columns.

Handling Missing Values and Data Types

Excel files often have empty cells. Pandas reads these as NaN by default, which is what you want. You can control which values are treated as missing.

# Treat 'N/A' and 'NA' as NaN
df = pd.read_excel("sales.xlsx", na_values=["N/A", "NA"])

# Keep 'N/A' as a string, only treat blanks as NaN
df = pd.read_excel("sales.xlsx", keep_default_na=False)

The dtype argument lets you specify data types for specific columns upfront rather than fixing them after loading.

df = pd.read_excel("sales.xlsx", dtype={
    "Revenue": float,
    "Quantity": int,
    "Region": str
})

Setting types at read time is faster than converting them afterward because pandas does not have to infer and then re-convert.

Working with Large Files

Reading a massive Excel file into memory all at once is not always practical. Here are two strategies for large files.

Chunked Reading

Pandas does not natively support chunked reading for Excel files the way it does for CSVs. However, you can simulate it by reading a specific range of rows using the nrows parameter and looping.

chunk_size = 10000
skip_rows = 0

while True:
    df_chunk = pd.read_excel(
        "large_sales.xlsx",
        skiprows=skip_rows,
        nrows=chunk_size
    )
    if df_chunk.empty:
        break
    # Process this chunk
    print(f"Processing rows {skip_rows} to {skip_rows + len(df_chunk)}")
    skip_rows += chunk_size

You need to handle the header row carefully in each iteration. The cleanest way is to read the header once, then read data chunks while adjusting the skiprows offset.

Converting Excel to CSV First

If you have openpyxl installed, you can use it to convert an Excel file to CSV format first, then read the CSV with pandas chunking. This is often faster for files larger than 50 MB.

import openpyxl

def excel_to_csv(excel_path, csv_path, sheet_name=0):
    wb = openpyxl.load_workbook(excel_path, read_only=True, data_only=True)
    sheet = wb.active if sheet_name == 0 else wb[sheet_name]

    with open(csv_path, "w", newline="") as f:
        for row in sheet.iter_rows(values_only=True):
            f.write(",".join(str(cell) for cell in row) + "
")
    wb.close()
    print(f"Converted {excel_path} to {csv_path}")

excel_to_csv("large_sales.xlsx", "large_sales.csv")

Reading Password-Protected Files

Sometimes Excel files are password-protected. If you know the password, you can read them with openpyxl as the engine.

df = pd.read_excel(
    "protected_sales.xlsx",
    engine="openpyxl",
    password="MySecurePassword"
)

Handle the password carefully in production code. Do not hardcode it. Use environment variables or a secrets manager instead.

import os

password = os.environ.get("EXCEL_PASSWORD")
df = pd.read_excel(
    "protected_sales.xlsx",
    engine="openpyxl",
    password=password
)

Reading Multiple Excel Files at Once

If you have a folder full of Excel files from different months or different regions, you can read and concatenate them in one pipeline.

import glob
import os

excel_files = glob.glob("data/*.xlsx")
dataframes = []

for file in excel_files:
    df = pd.read_excel(file, sheet_name=0)
    df["source_file"] = os.path.basename(file)
    dataframes.append(df)

combined_df = pd.concat(dataframes, ignore_index=True)
print(f"Combined {len(excel_files)} files: {len(combined_df)} total rows")

The source_file column lets you trace each row back to its origin file, which is useful when you need to debug data quality issues.

Handling Multiple Sheets from Multiple Files

When each Excel file has multiple sheets, you need to iterate through both the files and the sheets.

import glob

excel_files = glob.glob("reports/*.xlsx")
all_data = []

for file in excel_files:
    sheets = pd.read_excel(file, sheet_name=None)
    for sheet_name, df in sheets.items():
        df["file"] = os.path.basename(file)
        df["sheet"] = sheet_name
        all_data.append(df)

result = pd.concat(all_data, ignore_index=True)

This pattern handles monthly report files where each file contains sheets for different departments.

Reading Excel with an Index Column

You can designate a column from the Excel file as the DataFrame index at load time.

# Use the first column as the index
df = pd.read_excel("sales.xlsx", index_col=0)

# Use a specific column name as the index
df = pd.read_excel("sales.xlsx", index_col="RecordID")

Setting the index at load time is cleaner than calling df.set_index() afterward because you avoid a copy of the data.

Dealing with Merged Cells

Merged cells are common in Excel reports. When pandas reads merged cells, only the first cell contains the value. The rest show as NaN.

# Fill merged cell values downward
df = pd.read_excel("report.xlsx")
df = df.ffill()

The forward fill method propagates the merged cell value down to the cells below it, which is usually what you want for labelled or categorical data in merged cells.

Reading Only a Subset of Rows

You can limit how many rows are read using nrows.

# Read only the first 100 rows
df = pd.read_excel("sales.xlsx", nrows=100)

This is useful for quickly inspecting the structure of a large file without loading the whole thing into memory.

Real-World Example: Reading a Sales Report

Let me put everything together with a realistic example. You have a quarterly sales report with the following structure.

  • Sheet named “Q4 Sales”
  • Row 1 contains company branding, row 2 is blank, row 3 has the actual column headers
  • Columns are Date, Product, Category, Units Sold, Unit Price, Total Revenue
  • Some cells in the Category column are blank due to merged cells
  • Two summary rows at the bottom need to be skipped
import pandas as pd

# Read the sheet with correct header row
df = pd.read_excel(
    "q4_sales.xlsx",
    sheet_name="Q4 Sales",
    header=2,              # Row 3 has the column headers
    skipfooter=2,          # Skip the two summary rows at the bottom
    parse_dates=["Date"], # Parse the Date column as datetime
    usecols=["Date", "Product", "Category", "Units Sold", "Unit Price", "Total Revenue"],
    dtype={
        "Units Sold": int,
        "Unit Price": float,
        "Total Revenue": float
    },
    na_values=["", "N/A"]
)

# Fill merged cells in Category column
df["Category"] = df["Category"].ffill()

print(df.shape)
print(df.info())
print(df.describe())

The output shows you how many rows and columns were loaded, the data types, and summary statistics for the numeric columns.

Common Errors and How to Fix Them

Here are the most frequent problems you will encounter when reading Excel files with pandas.

Missing xlrd or openpyxl

# ValueError: Excel file format cannot be determined
# Fix: install the required engine
pip install openpyxl

Sheet Does Not Exist

# Check available sheets first
xl = pd.ExcelFile("sales.xlsx")
print(xl.sheet_names)
# Then use the correct name
df = pd.read_excel("sales.xlsx", sheet_name="Actual Sheet Name")

Wrong Header Row

If your DataFrame has strange column names like “Unnamed: 0”, you are probably using the wrong header row. Adjust the header parameter.

# Try different values to find the right row
for i in range(5):
    df_test = pd.read_excel("sales.xlsx", header=i, nrows=3)
    print(f"header={i}: {list(df_test.columns)}")

Memory Error on Large Files

If you run out of memory, switch to reading specific columns only or convert to CSV first.

Performance Comparison: read_excel vs Alternatives

The read_excel function is convenient but not always the fastest option. Here is how different approaches compare for a 100 MB Excel file.

Method Time Memory Flexibility
pandas read_excel ~30 sec High Full featured
openpyxl read_only + CSV convert ~20 sec Low Manual
pyxlsb (binary format) ~15 sec Medium Limited

For routine ETL jobs that run on a schedule, converting to CSV or Parquet first and then reading with pd.read_csv or pd.read_parquet in subsequent runs is significantly faster.

Bonus: Reading Google Sheets

If your Excel file is actually a Google Sheet, you can read it directly with pandas using the gspread library or by exporting a URL.

# Method 1: Export as CSV URL
csv_url = "https://docs.google.com/spreadsheets/d/YOUR_ID/export?format=csv"
df = pd.read_csv(csv_url)

# Method 2: Using gspread
import gspread
gc = gspread.service_account("credentials.json")
sh = gc.open("Your Sheet Name")
df = pd.DataFrame(sh.sheet1.get_all_records())

Summary

The pandas.read_excel() function is the workhorse for reading spreadsheet data in Python. You learned how to specify sheets by name or index, select only the columns you need, handle dates correctly, skip unwanted rows, deal with merged cells, process large files in chunks, and work with password-protected workbooks. The function has many arguments, and the key to mastering it is knowing which arguments apply to your specific file format. Keep this guide as a reference and tweak the parameters to match your actual data.

Frequently Asked Questions

What is the difference between read_excel and read_csv in pandas?

CSV files are plain text with comma-separated values. Excel files are binary spreadsheets that can contain multiple sheets, formatted cells, formulas, images, and more. Use read_csv for raw tabular data exports. Use read_excel when you need multi-sheet workbooks, formula results, or formatted data from spreadsheet tools.

How do I read only specific sheets from an Excel file?

Pass the sheet_name parameter with either the sheet name string or a zero-based sheet index. To read multiple sheets, pass a list of names or indices.

Why is pandas reading my dates as numbers?

Excel stores dates as serial numbers. Use parse_dates=True or parse_dates=["ColumnName"] to let pandas convert them to datetime objects automatically.

How do I read a password-protected Excel file?

Pass the password parameter when using the openpyxl engine. Keep the password in an environment variable rather than hardcoding it in your script.

Can pandas read .xls files?

Yes. Install xlrd with pip install xlrd and pandas will read .xls files automatically using that engine.

How do I read a large Excel file without running out of memory?

Read specific columns with usecols, limit rows with nrows, or convert the file to CSV first using openpyxl in read-only mode, then process the CSV in chunks.

Share.
Leave A Reply