How to Use Apache Arrow for Lightning-Fast Data Fetching from SQL Server with mssql-python

By

Introduction

Fetching millions of rows from SQL Server used to be a bottleneck: each row became a Python object, triggering garbage collector overhead and memory bloat. With the latest mssql-python driver, you can now retrieve data directly as Apache Arrow structures—a zero-copy, columnar format that bypasses Python object creation entirely. This feature, contributed by community developer Felix Graßl (@ffelixg), integrates seamlessly with Polars, Pandas (via ArrowDtype), DuckDB, and other Arrow-native tools. In this guide, you'll learn how to set up and use this high-performance fetch path, step by step.

How to Use Apache Arrow for Lightning-Fast Data Fetching from SQL Server with mssql-python
Source: devblogs.microsoft.com

What You Need

Step-by-Step Guide

Step 1: Install the Required Packages

Open your terminal and install the latest mssql-python driver and your preferred Arrow-native library. For this guide, we'll use Polars as an example.

pip install mssql-python polars

If you plan to use Pandas with Arrow support, also install pyarrow:

pip install pandas pyarrow

For DuckDB:

pip install duckdb mssql-python

Step 2: Establish a Connection to SQL Server

Create a connection object using mssql.connect() with your server details. The connection handles authentication and session state—same as the non-Arrow path.

import mssql conn = mssql.connect( host='your_server.database.windows.net', database='your_database', user='your_username', password='your_password', port=1433 )

Note: Use an environment variable or a secrets manager for credentials in production.

Step 3: Execute a Query with Arrow Output

The magic happens in the cursor.execute() method. Set the arrow=True parameter to instruct the driver to fetch results as Apache Arrow structures.

with conn.cursor() as cursor: cursor.execute('SELECT * FROM your_table', arrow=True) result = cursor.fetchall() # returns an Arrow Table (pyarrow.lib.Table)

Behind the scenes, the entire fetch loop runs in C++, writing column values directly into Arrow buffers. No Python objects are created per row—only the final Arrow Table object materializes in Python. This dramatically reduces garbage-collector pressure and speeds up data retrieval, especially for temporal types like DATETIME and DATETIMEOFFSET.

Step 4: Convert Arrow Table to Your DataFrame

The returned object is a PyArrow Table. Convert it to your preferred DataFrame format with zero additional copies.

How to Use Apache Arrow for Lightning-Fast Data Fetching from SQL Server with mssql-python
Source: devblogs.microsoft.com

For example, with Polars:

import polars as pl df = pl.from_arrow(result) print(df)

The DataFrame library receives a pointer to the same memory—no serialization, no copies. Subsequent operations like filters, joins, and aggregations also work in-place on those Arrow buffers, keeping your pipeline efficient.

Tips for Optimal Use

By following these steps, you can unlock the full potential of Apache Arrow in your SQL Server data pipelines—faster fetches, lower memory usage, and seamless interoperability with modern DataFrame libraries.

Tags:

Related Articles

Recommended

Discover More

How to Examine Declassified Apollo 12 Mission Photos for Anomalous LightsMastering Coding Agents: The Power of Harness Engineering5 Key Facts About the DDoS Attack That Crippled Ubuntu ServicesSetting Up a Hands-Free Charging Depot for Robotaxis: A Step-by-Step Guide with Rocsys M1How to Achieve High Accuracy AI-Assisted Vulnerability Detection: Lessons from Mozilla's Mythos Deployment