Add DB-API 2.0 cursor support to pandas DataFrame constructor `data` parameter #54376

kesmit13 · 2023-08-02T18:18:12Z

The current DataFrame constructor takes many objects as the data to use including a list-like object which can contain other list-like objects as rows. Most DB-API clients support iterating over a cursor object that has data to download from a database. If the rows of data are in namedtuples, the column names are extracted from the _fields attribute of the namedtuple, but normal tuples and StructSequences are left without column names. The feature in this code adds support for detecting cursor objects and extracting the column names from the cursor if they haven't been explicitly specified.

import sqlite3

conn = sqlite3.connect('aquarium.db')

cursor = conn.cursor()
cursor.execute("CREATE TABLE fish (name TEXT, species TEXT, tank_number INTEGER)")

cursor.execute("INSERT INTO fish VALUES ('Sammy', 'shark', 1)")
cursor.execute("INSERT INTO fish VALUES ('Jamie', 'cuttlefish', 7)")

cursor.execute("SELECT * FROM fish")

import pandas as pd

print(pd.DataFrame(cursor))

Output:

    name     species  tank_number
0  Sammy       shark            1
1  Jamie  cuttlefish            7

The current way of doing this is:

import pandas as pd

columns = [x[0] for x in cursor.description]

print(pd.DataFrame(cursor, columns=columns))

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

WillAyd · 2023-08-02T20:02:52Z

pandas/_libs/lib.pyx

@@ -1219,6 +1219,35 @@ cdef bint c_is_list_like(object obj, bint allow_sets) except -1:
    )


+def is_cursor(obj: object) -> bool:


You can just move this to the sql.py file itself; I don't think this would be used elsewhere

The intent is that it would be used by the DataFrame constructor after seeing if the data is list-like (see pandas/core/frame.py diff). I realize that is a bit invasive for such a special case, but it's the place where it is most generally useful for this use-case.

WillAyd · 2023-08-02T20:03:59Z

pandas/tests/dtypes/test_inference.py

@@ -131,6 +131,42 @@ def shape(self):
        return self._values.shape


+class MockDBCursor:


Do we really need this or can we just set up a test that assigns this to the cursor?

Not sure what you mean, but I could drop this specific of a test and add a test with the rest of the database tests. That would be more of a real-world use of the feature.

mroeschke

Is there an issue discussing this feature? We usually require an issue where enhancements are discussed with buy-in from the core team.

At first brush. I think the work-around seems adequate and not sure if the DataFrame constructor needs to support this

kesmit13 · 2023-08-02T21:28:05Z

@mroeschke There isn't an open issue. I didn't realize that opening an issue was required before submitting a PR.

asishm · 2023-08-03T02:03:32Z

Just a general question - why this as opposed to pd.read_sql ?

kesmit13 · 2023-08-03T02:12:07Z

Just a general question - why this as opposed to pd.read_sql ?

All of the pandas integrations with databases require SQLAlchemy. Supporting plain DB-API cursors in the DataFrame constructor allows this technique to work with any database client without that extra dependency.

I guess my main motivation is consistency between database clients for our customers. For example, you can do pd.DataFrame(cursor-or-result) on a SQLAlchemy object or the output of %%sql magics, but it only works on DB-API cursors if they use namedtuples.

My particular issue is with returning StructSequences (because they are much faster than namedtuples), but unfortunately even though they are advertised as equivalent to namedtuples, they don't have a _fields attribute (this is a long outstanding issue in Python itself).

WillAyd · 2023-08-04T19:21:05Z

I missed the overall goal of this on initial review but I also am not really sure we want the DataFrame constructor to support this. The fact that you can call the DataFrame constructor on a SQLAlchemy object is just an implementation detail at the moment right? I don't recall that being a documented feature and think we should just stick to the read_sql interface

kesmit13 · 2023-08-04T19:32:37Z

I missed the overall goal of this on initial review but I also am not really sure we want the DataFrame constructor to support this. The fact that you can call the DataFrame constructor on a SQLAlchemy object is just an implementation detail at the moment right? I don't recall that being a documented feature and think we should just stick to the read_sql interface

SQLAlchemy uses namedtuples in its results and pandas looks for a _fields attribute on the first row of a list-like object to extract the column names. Unfortunately, StructSequences which claim to be equivalent to namedtuples (but are much faster and used by our database connector and others) don't have that attribute and the column names don't get applied. I wish there was a way to extract the field names from StructSequences because that would be the better solution for this case, but alas, there isn't...

kesmit13 · 2023-08-08T14:04:55Z

Here's another idea. Rather than having something specific to the DB-API, how about adding support for some other arbitrary attribute / property on the data parameter of the DataFrame constructor (maybe _row_fields). Then any package that wants to surface column names to a DataFrame can add that property to their class.

asishm-wk · 2023-08-11T18:39:09Z

while the official stance is that DBAPI2 connections (other than sqlite3) are untested (and raises a UserWarning), it works fine (as long as you're using a connection that returns a list of lists/tuples (dictcursors don't work since 2.0 - xref #53028)

in your case, you can run pd.read_sql("SELECT * FROM fish", con=conn)

mroeschke · 2023-08-25T18:14:38Z

Thanks for the PR, but it appears that there isn't sufficient enthusiasm for this feature yet in the PR. I would recommend opening an issue first so this can be discussed more thoroughly before moving forward with a PR so closing

Add DB-API 2.0 cursor support to pandas DataFrame constructor

ac52dfa

kesmit13 requested a review from WillAyd as a code owner August 2, 2023 18:18

pre-commit fixes

a9bbbb2

WillAyd requested changes Aug 2, 2023

View reviewed changes

Export is_cursor

4c62bab

mroeschke requested changes Aug 2, 2023

View reviewed changes

mroeschke closed this Aug 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DB-API 2.0 cursor support to pandas DataFrame constructor `data` parameter #54376

Add DB-API 2.0 cursor support to pandas DataFrame constructor `data` parameter #54376

kesmit13 commented Aug 2, 2023 •

edited

Loading

WillAyd Aug 2, 2023

kesmit13 Aug 2, 2023

WillAyd Aug 2, 2023

kesmit13 Aug 4, 2023

mroeschke left a comment

kesmit13 commented Aug 2, 2023 •

edited

Loading

asishm commented Aug 3, 2023

kesmit13 commented Aug 3, 2023 •

edited

Loading

WillAyd commented Aug 4, 2023

kesmit13 commented Aug 4, 2023

kesmit13 commented Aug 8, 2023

asishm-wk commented Aug 11, 2023

mroeschke commented Aug 25, 2023

		@@ -1219,6 +1219,35 @@ cdef bint c_is_list_like(object obj, bint allow_sets) except -1:
		)


		def is_cursor(obj: object) -> bool:

		@@ -131,6 +131,42 @@ def shape(self):
		return self._values.shape


		class MockDBCursor:

Add DB-API 2.0 cursor support to pandas DataFrame constructor data parameter #54376

Add DB-API 2.0 cursor support to pandas DataFrame constructor data parameter #54376

Conversation

kesmit13 commented Aug 2, 2023 • edited Loading

WillAyd Aug 2, 2023

Choose a reason for hiding this comment

kesmit13 Aug 2, 2023

Choose a reason for hiding this comment

WillAyd Aug 2, 2023

Choose a reason for hiding this comment

kesmit13 Aug 4, 2023

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

kesmit13 commented Aug 2, 2023 • edited Loading

asishm commented Aug 3, 2023

kesmit13 commented Aug 3, 2023 • edited Loading

WillAyd commented Aug 4, 2023

kesmit13 commented Aug 4, 2023

kesmit13 commented Aug 8, 2023

asishm-wk commented Aug 11, 2023

mroeschke commented Aug 25, 2023

Add DB-API 2.0 cursor support to pandas DataFrame constructor `data` parameter #54376

Add DB-API 2.0 cursor support to pandas DataFrame constructor `data` parameter #54376

kesmit13 commented Aug 2, 2023 •

edited

Loading

kesmit13 commented Aug 2, 2023 •

edited

Loading

kesmit13 commented Aug 3, 2023 •

edited

Loading