Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DB-API 2.0 cursor support to pandas DataFrame constructor data parameter #54376

Closed
wants to merge 3 commits into from

Conversation

kesmit13
Copy link

@kesmit13 kesmit13 commented Aug 2, 2023

The current DataFrame constructor takes many objects as the data to use including a list-like object which can contain other list-like objects as rows. Most DB-API clients support iterating over a cursor object that has data to download from a database. If the rows of data are in namedtuples, the column names are extracted from the _fields attribute of the namedtuple, but normal tuples and StructSequences are left without column names. The feature in this code adds support for detecting cursor objects and extracting the column names from the cursor if they haven't been explicitly specified.

import sqlite3

conn = sqlite3.connect('aquarium.db')

cursor = conn.cursor()
cursor.execute("CREATE TABLE fish (name TEXT, species TEXT, tank_number INTEGER)")

cursor.execute("INSERT INTO fish VALUES ('Sammy', 'shark', 1)")
cursor.execute("INSERT INTO fish VALUES ('Jamie', 'cuttlefish', 7)")

cursor.execute("SELECT * FROM fish")

import pandas as pd

print(pd.DataFrame(cursor))

Output:

    name     species  tank_number
0  Sammy       shark            1
1  Jamie  cuttlefish            7

The current way of doing this is:

import pandas as pd

columns = [x[0] for x in cursor.description]

print(pd.DataFrame(cursor, columns=columns))

@kesmit13 kesmit13 requested a review from WillAyd as a code owner August 2, 2023 18:18
@@ -1219,6 +1219,35 @@ cdef bint c_is_list_like(object obj, bint allow_sets) except -1:
)


def is_cursor(obj: object) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just move this to the sql.py file itself; I don't think this would be used elsewhere

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is that it would be used by the DataFrame constructor after seeing if the data is list-like (see pandas/core/frame.py diff). I realize that is a bit invasive for such a special case, but it's the place where it is most generally useful for this use-case.

@@ -131,6 +131,42 @@ def shape(self):
return self._values.shape


class MockDBCursor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this or can we just set up a test that assigns this to the cursor?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean, but I could drop this specific of a test and add a test with the rest of the database tests. That would be more of a real-world use of the feature.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue discussing this feature? We usually require an issue where enhancements are discussed with buy-in from the core team.

At first brush. I think the work-around seems adequate and not sure if the DataFrame constructor needs to support this

@kesmit13
Copy link
Author

kesmit13 commented Aug 2, 2023

@mroeschke There isn't an open issue. I didn't realize that opening an issue was required before submitting a PR.

@asishm
Copy link
Contributor

asishm commented Aug 3, 2023

Just a general question - why this as opposed to pd.read_sql ?

@kesmit13
Copy link
Author

kesmit13 commented Aug 3, 2023

Just a general question - why this as opposed to pd.read_sql ?

All of the pandas integrations with databases require SQLAlchemy. Supporting plain DB-API cursors in the DataFrame constructor allows this technique to work with any database client without that extra dependency.

I guess my main motivation is consistency between database clients for our customers. For example, you can do pd.DataFrame(cursor-or-result) on a SQLAlchemy object or the output of %%sql magics, but it only works on DB-API cursors if they use namedtuples.

My particular issue is with returning StructSequences (because they are much faster than namedtuples), but unfortunately even though they are advertised as equivalent to namedtuples, they don't have a _fields attribute (this is a long outstanding issue in Python itself).

@WillAyd
Copy link
Member

WillAyd commented Aug 4, 2023

I missed the overall goal of this on initial review but I also am not really sure we want the DataFrame constructor to support this. The fact that you can call the DataFrame constructor on a SQLAlchemy object is just an implementation detail at the moment right? I don't recall that being a documented feature and think we should just stick to the read_sql interface

@kesmit13
Copy link
Author

kesmit13 commented Aug 4, 2023

I missed the overall goal of this on initial review but I also am not really sure we want the DataFrame constructor to support this. The fact that you can call the DataFrame constructor on a SQLAlchemy object is just an implementation detail at the moment right? I don't recall that being a documented feature and think we should just stick to the read_sql interface

SQLAlchemy uses namedtuples in its results and pandas looks for a _fields attribute on the first row of a list-like object to extract the column names. Unfortunately, StructSequences which claim to be equivalent to namedtuples (but are much faster and used by our database connector and others) don't have that attribute and the column names don't get applied. I wish there was a way to extract the field names from StructSequences because that would be the better solution for this case, but alas, there isn't...

@kesmit13
Copy link
Author

kesmit13 commented Aug 8, 2023

Here's another idea. Rather than having something specific to the DB-API, how about adding support for some other arbitrary attribute / property on the data parameter of the DataFrame constructor (maybe _row_fields). Then any package that wants to surface column names to a DataFrame can add that property to their class.

@asishm-wk
Copy link

while the official stance is that DBAPI2 connections (other than sqlite3) are untested (and raises a UserWarning), it works fine (as long as you're using a connection that returns a list of lists/tuples (dictcursors don't work since 2.0 - xref #53028)

in your case, you can run pd.read_sql("SELECT * FROM fish", con=conn)

@mroeschke
Copy link
Member

Thanks for the PR, but it appears that there isn't sufficient enthusiasm for this feature yet in the PR. I would recommend opening an issue first so this can be discussed more thoroughly before moving forward with a PR so closing

@mroeschke mroeschke closed this Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants