How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python

By

Introduction

Fetching a million rows from SQL Server into a Polars DataFrame used to mean creating a million Python objects, triggering countless garbage-collector allocations, and then discarding it all to build the DataFrame. That overhead is now gone. The mssql-python driver, thanks to a community contribution by Felix Graßl (@ffelixg), can fetch SQL Server data directly as Apache Arrow structures. This guide walks you through using Arrow support in mssql-python to make your data pipelines faster and more memory-efficient, whether you work with Polars, Pandas, DuckDB, or any other Arrow-native library.

How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python
Source: devblogs.microsoft.com

What You Need

  • Python 3.8+ installed on your system.
  • mssql-python driver (version 1.2.0 or later) with Arrow support. Install via pip install mssql-python[arrow].
  • A target library that can consume Arrow data: Polars, Pandas (with ArrowDtype), DuckDB, or similar.
  • SQL Server database with appropriate credentials and network access.

Step-by-Step Guide

Step 1: Install Required Packages

First, ensure you have the latest mssql-python package with Arrow support. Open your terminal and run:

pip install mssql-python[arrow] polars

This installs both the driver and Polars as an example consumer. For Pandas or DuckDB, adjust accordingly.

Step 2: Import Libraries

Create a Python script and import the necessary modules:

import mssql
import polars as pl
from mssql import connect

The mssql module provides the driver; polars will be used to verify the zero-copy integration.

Step 3: Establish a Connection

Use your SQL Server credentials to create a connection. Replace placeholders with your actual server, database, username, and password:

connection = connect(
    server="your_server.database.windows.net",
    database="your_database",
    username="your_username",
    password="your_password"
)
cursor = connection.cursor()

Step 4: Execute a Query and Fetch as Arrow

Execute a SELECT query. The key new method is fetcharrow(), which returns an Arrow Table directly—no Python objects per row are created:

cursor.execute("SELECT TOP 1000000 * FROM large_table")
arrow_table = cursor.fetcharrow()

This call runs entirely in C++, writing values into contiguous, typed Arrow buffers. Nulls are tracked with a compact bitmap—no None objects per cell.

Step 5: Convert to Your DataFrame Library

Because Arrow uses a stable ABI (the Arrow C Data Interface), conversion is instantaneous—zero-copy:

How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python
Source: devblogs.microsoft.com
  • To Polars: df = pl.from_arrow(arrow_table)
  • To Pandas: df = arrow_table.to_pandas(types_mapper=pd.ArrowDtype)
  • To DuckDB: duckdb.sql("SELECT * FROM arrow_table")

No serialization, no copies, no re-parsing—just a pointer exchange.

Step 6: Perform Operations Without Python Overhead

Once data is in an Arrow-native DataFrame, operations like filters, joins, and aggregations also work in-place on the same shared memory buffers. For example, in Polars:

result = df.filter(pl.col("datetime_col") > "2023-01-01").group_by("category").agg(pl.col("value").sum())

No intermediate Python objects are materialized at any stage—this is the foundation for high-throughput pipelines.

Tips and Best Practices

  • Speed gains are most noticeable with temporal types (like DATETIME and DATETIMEOFFSET) because per-value Python-side conversions are eliminated entirely.
  • Memory usage drops dramatically: A column of one million integers becomes a single contiguous C array, not a million individual Python objects. Great for large datasets.
  • Seamless interoperability with other Arrow-native tools—Polars, Pandas (with ArrowDtype), DuckDB, and even tools like Hugging Face datasets—means you can mix and match libraries without conversion costs.
  • Ensure you're using a recent version of mssql-python (1.2.0+) to have the fetcharrow() method available.
  • For best results, avoid fetching rows individually—always use bulk fetch methods like fetcharrow() when working with Arrow.
  • Monitor your garbage collector; with Arrow, you'll see far less GC pressure, making your overall application more predictable.

Related Articles

Recommended

Discover More

Block Protocol Progress Revives Semantic Web Promise After Two Decades of Stalled AdoptionEuropa Universalis 5's 72-Page Patch 1.2 'Echinades' Released – Biggest Update YetMastering NYT Strands: Guide to Game #798 Hints and AnswersHow to Fortify Your Medical Device Company Against Iran-Linked Wiper Attacks7 Key Insights from the JetStream 3 Benchmark Overhaul