-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: geoarrow export #30
feat: geoarrow export #30
Conversation
src/s2geography/geoarrow.cc
Outdated
case OutputType::kPoints: | ||
impl_->Init(GEOARROW_TYPE_INTERLEAVED_POINT, options, out_schema); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added kPoints
(mapping to interleaved points) for a quick test. I assume we might want to add some generic "native geoarrow" option like kGeoArrow(Interleaved)
, and then infer the schema from the geographies? (with something similar like GeoArrowGEOSSchemaCalculator
you did in geoarrow-c-geos)
It might still be worth allowing the user to specify the type themselves to avoid inference / choose a specific type (e.g. always the multi-version regardless of presence of any multi geom)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
geoarrow-c-geos also has a function to help you make a schema:
The right entry point I think is still ArrowSchema*
(until a point where geoarrow-c exposes an ABI for specifying geometry type constraints, which is I think a ways off). It's definitely good to let the user specify this (often they already know because they're calculating a centroid or something).
The calculator is a definite must! It can probably use geography->num_shape()
and geography->dimension()
:
s2geography/src/s2geography/geography.h
Lines 32 to 48 in 829a0fe
virtual int dimension() const { | |
if (num_shapes() == 0) { | |
return -1; | |
} | |
int dim = Shape(0)->dimension(); | |
for (int i = 1; i < num_shapes(); i++) { | |
if (dim != Shape(i)->dimension()) { | |
return -1; | |
} | |
} | |
return dim; | |
} | |
// The number of S2Shape objects needed to represent this Geography | |
virtual int num_shapes() const = 0; |
5fb0170
to
269e5b9
Compare
src/s2geography/geoarrow.cc
Outdated
void Writer::Init(OutputType output_type, const ImportOptions& options, | ||
struct ArrowSchema* out_schema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of initializing the writer with just a type, I added an ArrowSchema as output argument, because for exporting the array to another producer, we need both the array and schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this above, too, but I think you want a const ArrowSchema*
as your entry point into the Writer
(you can expose a function to generate the appropriate ArrowSchema*
so that those APIs are decoupled).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, if we have a separate function to construct this schema from a type / encoding (or those inferred from an array of gemetries), then it is perfectly fine to only have the Init
here starting from a schema
Pushed a small update:
|
src/s2geography/geoarrow.cc
Outdated
for (int i = 0; i < poly->num_vertices(); i++) { | ||
S2LatLng ll(poly->vertex(i)); | ||
coords[0] = ll.lng().degrees(); | ||
coords[1] = ll.lat().degrees(); | ||
GEOARROW_RETURN_NOT_OK(visitor_.coords(&visitor_, &coords_view_)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently for geometries with >1 coordinates (here for the example of a linestring), I loop over the s2points and call the coords()
callback multiple times.
The alternative is to build up a single GeoArrowCoordView
with the n_coords
set to something larger than 1, and then only do a single call to coords()
.
Now, given that we don't start from a contiguous array of coords, I suppose it will (performance wise) not matter that much if we would do the conversion to a single array ourselves and do a single coords()
call, or or call coords()
multiple times and let the underlying writer do the conversion?
Especially since we don't know for sure if the output actually needs single buffers (like for WKT/WKB it would iterate over the coords pairs anyway then, when passing a single CoordView).
Unless the typicaly coords()
visitor implementation might have a bunch of overhead per call? In that case it might be beneficial to limit the number of calls to it (I suppose I should just test the difference .. ;))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably better to limit the calls to the coords()
callback if it's easy to do it (but not necessary for this PR as long as it is correct!). On other end, we also have to make a copy of the coordinates we send via the callback (i.e., to populate the requisite double[]
), so you don't want to send all of them at once. One other slight wrench in that is that some visitor implementations can "return early" (e.g., the WKT formatter, which has a character limit after which it returns EAGAIN
to force the reader to stop). As a data point, geoarrow-c's WKB reader buffers 3072 ordinates (had to be divisible by 2, 3, and 4) at a time:
Adding the ability to transmit more than one coordinate on this callback was one of the things I changed between wk's design (early in my career writing any C code whatsoever) and this visitor. I also never measured the difference comprehensively though...in theory this is in such a tight loop that the function pointer call does matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then I think I will leave it at a coords()
call per point for this PR, and we can optimize that later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that some basic tests are going to be needed at some point. At least for this PR, I think you could replace the WKTWriter
and add a few cases to the "wkt roundtrip" tests. The nice part about the visitor pattern is that you can basically just check WKT output and in theory everything else "just works" (or is a problem with code that isn't yours 🙂 ).
Yes I know ;) For now I have been testing locally directly with wiring it up in spherely and so tests in python, and this confirms the basics are working. But definitely will have to add tests in s2geography as well! |
src/vendored/geoarrow/geoarrow.c
Outdated
writer->precision = 16; | ||
private->precision = 16; | ||
writer->use_flat_multipoint = 1; | ||
private->use_flat_multipoint = 1; | ||
// TODO should be changed upstream to make this configurable | ||
writer->precision = 6; | ||
private->precision = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paleolimbot what would you recommend here?
Pass that as an extra param to GeoArrowArrayWriterInitFromSchema, which can then pass it down here?
Or a setter after the object has been initialized? (I suppose that is more backwards compatible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Hang tight, I'll fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this is merged into s2geography (or should be!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
OK, I deleted most of the WKT writer implementation (which I essentially had moved to the geoarrow writer already, updating the "handler" pattern for "visitor" code), and so that this now reuses the geoarrow based writer, like is done for the WKTReader as well. And expanded the WKT roundtrip tests which thus indirectly test the GeoArrow::Writer |
S2LatLng ll(loop->vertex(loop->num_vertices() - 1)); | ||
coords_[0] = ll.lng().degrees(); | ||
coords_[1] = ll.lat().degrees(); | ||
GEOARROW_RETURN_NOT_OK(visitor_.coords(&visitor_, &coords_view_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paleolimbot In the case of a hole, the code here (which I essentially took from the wkt writer) explicitly repeats the first coordinate pair (i.e. in the last vertex because we are iterating reversed, so we started with the last).
But for VisitLoopShell
above, the code does not do that (it just iterates over the vertices, and s2 doesn't duplicate the first/last).
In practice, though (at least in case of WKT), it seems both work fine, and so I assume that the visitor implemented by GeoArrowArrayWriter (or for WKT the GeoArrowWKTWriter used under the hood) checks if the first coordinate was repeated, and if not repeats it themselves? But is that meant to be a guarantee? Or should ideally the one calling the visitor ensure they are providing closed ring coordinates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, nevermind, I just noticed that in the case for the shell, we are looping until i <= loop->num_vertices()
and not until num_vertices() - 1
. So the loop.vertex(i)
call wraps around, and so we are actually explicitly adding the first coordinate as well
(that trick just not works when iterating in reverse)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember fiddling a lot with that, but I think it exploits the special behaviour of vertex()
:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that's the explanation I also had found
S2LatLng ll(loop->vertex(loop->num_vertices() - 1)); | ||
coords_[0] = ll.lng().degrees(); | ||
coords_[1] = ll.lat().degrees(); | ||
GEOARROW_RETURN_NOT_OK(visitor_.coords(&visitor_, &coords_view_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember fiddling a lot with that, but I think it exploits the special behaviour of vertex()
:
GEOARROW_DIMENSIONS_XY)); | ||
|
||
for (const auto& child_geog : geog.Features()) { | ||
auto child_point = dynamic_cast<const PointGeography*>(child_geog.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These dynamic casts are necessary given the current design of the Geography
but hopefully I'll find a way to clean that up (maybe making an internal but virtual Visit(void* visitor)
method on the Geography).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For spherely, we recently added the type as essentially an attribute on the spherely::Geography C++ class (i.e. which otherwise wraps the s2geography::Geography class): https://github.com/benbovy/spherely/blob/4e39dee1bf46c9591c68bd2b64541ff7c2f8b14f/src/geography.cpp#L116-L162, so whenever a Geography object gets constructed we also store the type (https://github.com/benbovy/spherely/blob/4e39dee1bf46c9591c68bd2b64541ff7c2f8b14f/src/geography.hpp#L55-L59). Not sure if something similar could/should be moved upstream at some point.
(the clone method a bit higher up in the file is certainly a candidate of moving into s2geography)
|
||
void Init(OutputType output_type, const ExportOptions& options); | ||
|
||
void WriteGeography(const Geography& geog); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also need a WriteNull()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also need a WriteNull()?
That's a good point (for spherely actually not yet, because we don't yet support missing values, but we should support it here in general).
Add a separate method for it, or allow the geog
to be a nullptr? (and check that inside WriteGeography
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a separate method!
src/vendored/geoarrow/geoarrow.c
Outdated
writer->precision = 16; | ||
private->precision = 16; | ||
writer->use_flat_multipoint = 1; | ||
private->use_flat_multipoint = 1; | ||
// TODO should be changed upstream to make this configurable | ||
writer->precision = 6; | ||
private->precision = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WKTWriter writer_default; | ||
EXPECT_EQ(writer_default.write_feature(*geog), "POINT (0 3.333333333333334)"); | ||
EXPECT_EQ(writer_default.write_feature(*geog), "POINT (0 3.3333333333333344)"); | ||
|
||
WKTWriter writer_6digits(6); | ||
EXPECT_EQ(writer_6digits.write_feature(*geog), "POINT (0 3.33333)"); | ||
|
||
EXPECT_EQ(writer_6digits.write_feature(*geog), "POINT (0 3.333333)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is the main difference I see compared to the previous WKTWriter implementation (maybe because of using ryu?), so that the significant digits are interpreted differently.
(actually, the significant_digits
name is then not really correct any more, and should we use precision
here as well like is used in geoarrow-c?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, precision is a better name. I think the previous version was using the sprintf()
(or maybe the ostream
)-based implementation, which uses slightly different words for all of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this!
There are a few follow-ups (null writing, maybe more efficient coordinate things, tessellation, more comprehensive testing), but I think those are more effectively handled one at a time based on this work.
Yep, indeed, and for the tessellation I already have a WIP branch Will just rename the keyword to |
OK, all green! |
(I think you have write access too? Feel free to merge things!) |
I think I forgot to accept your invite and it expired .. :) |
Similar to #28, just opening to be able to add some comments / discuss the design.
For now it just can process points, without any options (but the points do work, as being tested with benbovy/spherely#53