Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to implement ADBC driver for Apache Cassandra #2245

Open
SChakravorti21 opened this issue Oct 13, 2024 · 2 comments
Open

Proposal to implement ADBC driver for Apache Cassandra #2245

SChakravorti21 opened this issue Oct 13, 2024 · 2 comments
Labels
Type: enhancement New feature or request

Comments

@SChakravorti21
Copy link

What feature or improvement would you like to see?

There isn't an existing ADBC driver for Cassandra as far as I can tell, and it would be great to have one! I'm interested in starting this effort as I have experience getting Arrow data to/from Cassandra, and have a little experience working on an ADBC driver for a different database (comdb2). I met @zeroshade at Community Over Code recently, who inspired me to start the discussion around creating a Cassandra driver :)

Some initial thoughts:

  • Choice of language

    • I'm personally most familiar with the Cassandra C/C++ driver as well as Arrow C++. However, if there's good reason to implement the driver in a different language, I'm open to that and happy to get up to speed.

    • Matt explained that it would be better to use nanoarrow rather than Arrow C++ as the latter is a heavy dependency and can complicate building/deploying drivers. Using nanoarrow sounds like a good idea to me.

  • Implementation considerations

    • Cassandra currently does not offer any native mechanism for fetching/ingesting data in Arrow format, so we would likely to have to implement row ↔ column transposition on the client side (in the driver).

    • The Cassandra Query Language (CQL) can be thought of as an extremely limited subset of SQL. This StackOverflow answer is a good overview of the general limitations. I figure this shouldn't matter as far as implementing an ADBC driver is concerned, but thought it was worth mentioning in case I'm wrong.

    • Matt also mentioned that there is now an ADBC driver framework. I don't see any reason not to use this. If we find any gaps in the framework while implementing the driver, I'm happy to help fill them in.

  • First step(s)

    • Matt mentioned that, before implementing anything, it would be good to stand up a Cassandra node/cluster in CI so that others can also play around with and contribute to the driver.

    • I suppose the next step would be to configure the build system to pull in the necessary dependencies (like the Cassandra C/C++ driver).

    • ... Start implementing the driver along with integration tests?

I'd love to hear any other considerations for implementing this ADBC driver and/or recommendations on getting started!

@SChakravorti21 SChakravorti21 added the Type: enhancement New feature or request label Oct 13, 2024
@paleolimbot
Copy link
Member

paleolimbot commented Oct 13, 2024

I'm personally most familiar with the Cassandra C/C++ driver as well as Arrow C++.

Matt also mentioned that there is now an ADBC driver framework.

I'm hoping to finish it this week, but there's a work-in-progress tutorial of how to get started building a driver in C++ using nanoarrow/the framework here! #2186 . Arrow C++ presents a packaging problem (e.g., difficult/impossible to make an R driver wrapper, Python wrapper would require pinning a version of pyarrow until we sort out how to put two different Arrow C++ versions in the same process), which is why Matt probably recommended nanoarrow.

However, if there's good reason to implement the driver in a different language, I'm open to that and happy to get up to speed.

It's a bit subjective, but all our existing drivers lean on the most arrowish SDK available for the driver (e.g., Postgres has libpq for C, so we implemented that in C++; Snowflake and BigQuery have Arrow integrations in their Go connectors, so we wrote those in Go). I have no idea what Cassandra provides, but if it had a fairly complete Go or Rust client already and nothing for C++ that might be a good reason to implement it in those languages. The fact that you know C++ and you're motivated counts for a lot, though!

Matt mentioned that, before implementing anything, it would be good to stand up a Cassandra node/cluster in CI

We have some docker compose services for databases for this purposes. You could do a PR first that makes it so that we can do docker compose up apache-cassandra-test. (Since I think you would be a "first time contributor", this would also make it so that the PR where you actually implement the driver doesn't require one of us to OK the CI jobs after every push). (Apologies if I understand Cassandra too poorly and this is not a good fit!)

so we would likely to have to implement row ↔ column transposition on the client side (in the driver).

The Postgres driver has an example of writing tests for this without a live connection to the database (the "copy" tests). No pressure to do it exactly like that but I found it useful to accelerate the process of adding full type support there.

I'd love to hear any other considerations for implementing this ADBC driver

Where to put it is a good thing to think about...ideally we'd (maybe just speaking for me here) like for ADBC connectors to live with the project instead of with us to spread out the maintenance load (e.g., like DuckDB), but there is also not a straightforward way to implement the validation suite outside this repository (or if there is, nobody has tried it yet!). Probably the easiest place to start is as a PR into apache/arrow-adbc and move it when we sort out those details.

Feel free to ping me early and often as you get started (probably everybody else is game too, but I'll let the volunteer themselves 🙂 ). All of this is helpful for us too since we've all had build setups for ADBC since the beginning and forget the issues encountered by those new to the project.

Start implementing the driver along with integration tests?

🚀

If I had to suggest a place to start it would be to get a "hello world" example running where you can open and close a connection to the database. Then you could perhaps follow that up with implementing the statement's ExecuteQuery for a case of a single type (int32 or string maybe). All just suggestsions (do your thing!)

@lidavidm
Copy link
Member

Thanks Dewey for the detailed guide!

I'd say so long as Cassandra in C++ doesn't require gRPC, and you can use nanoarrow (instead of Arrow C++ for the aforementioned reasons), then C++ would be great.

there is also not a straightforward way to implement the validation suite outside this repository (or if there is, nobody has tried it yet!)

We briefly did this when I tried to get Flight SQL into pyarrow instead of having a separate wheel. I think it wasn't complicated, we'd have to probably fix up the CMake definitions again though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants