Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ <-> Python conversion performance #3

Open
benbovy opened this issue Nov 29, 2022 · 5 comments
Open

C++ <-> Python conversion performance #3

benbovy opened this issue Nov 29, 2022 · 5 comments

Comments

@benbovy
Copy link
Owner

benbovy commented Nov 29, 2022

I did a quick benchmark to evaluate the overhead of C++ -> Python conversion:

py::array_t<PyObjectGeography> create(py::array_t<double> xs, py::array_t<double> ys) {
    py::buffer_info xbuf = xs.request(), ybuf = ys.request();

    auto result = py::array_t<PyObjectGeography>(xbuf.size);
    py::buffer_info rbuf = result.request();

    double *xptr = static_cast<double *>(xbuf.ptr);
    double *yptr = static_cast<double *>(ybuf.ptr);
    py::object *rptr = static_cast<py::object *>(rbuf.ptr);

    py::gil_scoped_release();
    for (size_t i = 0; i < xbuf.shape[0]; i++) {
        auto point_ptr = PointFactory::FromLatLonDegrees(xptr[i], yptr[i]);
        
        // either one or the other of the two code lines below are commented out.
        // "cast" benchmark (calls py::cast)
        rptr[i] = py::cast(std::move(point_ptr));
        // or
        // "no_cast" benchmark (just create an empty python object)
        rptr[i] = py::object()
    }

    return result;

...

m.def("create", &create);

Here are the results:

x = np.random.rand(10_000)
y = np.random.rand(10_000)

# "cast" benchmark
%timeit s2shapely.create(x, y)
# 12.6 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# "no cast" benchmark
%timeit s2shapely.create(x, y)   
# 4.46 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The cast version is almost 3x slower than the no cast version. That's a lot for "just" wrapping the C++ geography in a Python object. Not sure what's happening (the Geography (sub)classes only implement move semantics, so there shouldn't be any data copy at this level), but there must be something wrong.

@benbovy
Copy link
Owner Author

benbovy commented Nov 29, 2022

Tracked this down in pybind11's type_caster_generic::cast.

The bottleneck seems that pybind11 keeps a map of all registered instances and first tries to find and return the corresponding instance from the map, creating a new one if not found.

In the example above (more generally in geography creation functions), we know that no instance already exists, we could skip this step.

EDIT:

Other potential optimizations:

  • In certain geography creation functions, we may know exactly the type of the (derived) geography object to create in Python, so we might bypass pybind11's automatic downcasting (see https://pybind11.readthedocs.io/en/stable/advanced/classes.html#custom-automatic-downcasters)
  • type_caster_generic::cast handles multiple inheritance via values_and_holders. This is unlikely necessary for Geography classes and we may want to skip creating 1-sized std::vector objects (+ executing other logic) each time a Python geography object is created.
  • Repeated look-up for Python types in pybind11::detail::get_type_info(), which is not necessary here.

Note also: the benchmark above may be biased as pybind11 probably optimizes the empty py::object() case.

@benbovy
Copy link
Owner Author

benbovy commented Nov 30, 2022

Now a quick benchmark to evaluate the overhead of Python -> C++ conversion:

int dummy(PyObjectGeography obj) {
    // equivalent to `obj.cast<Geography*>()` where obj is a `py::object`
    obj.as_geog_ptr();
    return 0;
}

int dummy_no_cast(PyObjectGeography obj) {
    return 0;
}

int get_dimensions(PyObjectGeography obj) {
    return obj.as_geog_ptr()->dimension();
}

...

m.def("dummy", py::vectorize(&dummy));
m.def("dummy_no_cast", py::vectorize(&dummy_no_cast));
m.def("get_dimensions", py::vectorize(&get_dimensions));

Results:

x = np.random.rand(100_000)
y = np.random.rand(100_000)

points = s2shapely.create(x, y)

%timeit s2shapely.dummy_no_cast(points)
# 542 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit s2shapely.dummy(points)
# 8.39 ms ± 29.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit s2shapely.get_dimensions(points)
# 8.93 ms ± 36.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Significant overhead here too.

@benbovy
Copy link
Owner Author

benbovy commented Nov 30, 2022

After more investigation, it looks like most of the conversion overhead is coming from calls to pybind11::detail::get_type_info() to get the C++ type from a Python type or vice-versa. This helper function looks into pybind11 registered mappings of C++ / Python types and is called in various places like py::isinstance<T>(handle), py::object::cast(), py::cast(), etc.

In our case (at least for the Python -> C++ conversion within a vectorized function operating on geographies), we don't need to perform this lookup for each array element. It is enough (and almost free!) to just extract the pointer to the C++ object from the Python object and cast it to the Geography base class:

auto inst = reinterpret_cast<py::detail::instance *>(py_obj.ptr());
return reinterpret_cast<Geography*>(inst->simple_value_holder[0]);

"Just extract" is not really right, we also want to check first if we can safely cast the Python object to a Geography object (or raise), and check that in an efficient way. Unfortunately, this doesn't seem to be supported in pybind11. Maybe we could wrap pybind11::detail::get_type_info() and cache the results so the lookup is done only once.

This still uses pybind11's internals, though (pybind/pybind11#1572).

@benbovy
Copy link
Owner Author

benbovy commented Dec 1, 2022

@benbovy
Copy link
Owner Author

benbovy commented Dec 1, 2022

Nanobind has a low-level instance interface that could make things much easier: https://github.com/wjakob/nanobind/blob/master/docs/lowlevel.md. However, it doesn't support automatic vectorization (yet?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant