Supercharging SQL Server Data Pipelines: Arrow Support in mssql-python

By

Faster, Leaner Data Fetching with Apache Arrow

Retrieving a million rows from SQL Server into a Polars DataFrame used to be a sluggish process: each row required creating a Python object, allocating memory in the garbage collector, and then discarding those objects to build the DataFrame. That era is over. The mssql-python driver now supports fetching SQL Server data directly as Apache Arrow structures, offering a faster and more memory-efficient path for anyone working with Polars, Pandas, DuckDB, or other Arrow-native libraries. This feature was contributed by community developer Felix Graßl (@ffelixg), and we are excited to make it available.

Supercharging SQL Server Data Pipelines: Arrow Support in mssql-python
Source: devblogs.microsoft.com

Understanding the Core Concepts

Before diving into the benefits, it helps to clarify a few terms that underpin this new capability:

  • API (Application Programming Interface): A source-code contract that defines how to call a function or library.
  • ABI (Application Binary Interface): A binary-level contract specifying how compiled code is laid out in memory. Two programs built in different languages can share an ABI and exchange data directly — no serialization is needed.
  • Arrow C Data Interface: Apache Arrow’s ABI specification — the standard that makes zero-copy data exchange between languages possible.

What Is Apache Arrow?

The key insight behind Apache Arrow is zero-copy language interoperability. Arrow defines a stable shared-memory layout — the Arrow C Data Interface, a cross-language ABI — that any language can produce or consume by exchanging a pointer. This eliminates serialization, copies, and re-parsing. A C++ database driver and a Python DataFrame library can operate on the exact same memory without either one knowing about the other.

Built on this foundation, Arrow uses a columnar in-memory format: instead of representing a table as a list of rows (each row a collection of Python objects), Arrow stores all values for a column contiguously in a typed buffer. Nulls are tracked in a compact bitmap rather than per-cell None objects.

For a database driver, this means the entire fetch loop can run in C++ and write values directly into Arrow buffers — no Python object creation per row, no garbage-collector pressure. The DataFrame library receives a pointer to that memory and can begin operating on it immediately. Crucially, subsequent operations — filters, joins, aggregations — also work in-place on those same buffers. A Polars pipeline reading from mssql-python never needs to materialize intermediate Python objects at any stage, making Arrow the right foundation for high-throughput data processing pipelines.

Four Concrete Benefits for Users

1. Speed

The columnar fetch path avoids Python object creation per row, making fetching noticeably faster for many SQL Server types — especially temporal types like DATETIME and DATETIMEOFFSET, where Python-side per-value conversions are eliminated entirely.

Supercharging SQL Server Data Pipelines: Arrow Support in mssql-python
Source: devblogs.microsoft.com

2. Lower Memory Usage

A column of one million integers is a single contiguous C array, not a million individual Python objects. This drastically reduces memory consumption and GC overhead.

3. Seamless Interoperability

Arrow-native libraries like Polars, Pandas (via ArrowDtype), DuckDB, Hugging Face datasets, and others can consume the data directly without conversion. This streamlines multi-tool workflows.

4. Simplified Data Pipelines

By eliminating intermediate serialization steps, you can build cleaner, more efficient ETL processes. The same Arrow buffers can be passed between systems without re-formatting.

Getting Started with Arrow in mssql-python

To enable Arrow fetching, use the fetch_arrow=True parameter when executing a query. For example:

import mssql

conn = mssql.connect(server='localhost', database='test')
cursor = conn.cursor()
cursor.execute('SELECT * FROM large_table')
arrow_table = cursor.fetch_arrow()  # Returns a PyArrow Table

You can then pass arrow_table directly to Polars, Pandas, or other Arrow-compatible tools.

Polars Integration Example

import polars as pl
df = pl.from_arrow(arrow_table)  # zero-copy conversion

This avoids the overhead of creating a Python list of tuples and then constructing a DataFrame. For large datasets, the performance improvement is dramatic.

What’s Next?

The mssql-python team continues to enhance Arrow support, with plans to cover more SQL Server data types and optimize further. Community contributions are welcome — Felix Graßl’s work is a great example of how open-source collaboration accelerates innovation.

Explore the mssql-python repository for full documentation and examples.

Related Articles

Recommended

Discover More

Understanding Inheritance in Java: A Complete Guide8 Enduring Lessons from The Mythical Man-Month for Software Teams TodayOpenAI Integrates C2PA and Google SynthID to Verify AI Image OriginsHow to Continuously Train Custom AI Models Using Your Existing Production Workflows7 Key Insights into Swift's Expanding IDE Ecosystem