Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make it possible to use rowid and rowaddr in filters #2973

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

westonpace
Copy link
Contributor

This is particularly useful for operations like "delete by id"

I ran into a fair bit of difficulty with this PR and punted a few things to follow-ups (#2971 #2972).

@github-actions github-actions bot added enhancement New feature or request python java labels Oct 3, 2024
Comment on lines 272 to +285
batch_size: Optional[int] = None,
batch_readahead: Optional[int] = None,
fragment_readahead: Optional[int] = None,
scan_in_order: bool = True,
scan_in_order: bool = None,
fragments: Optional[Iterable[LanceFragment]] = None,
full_text_query: Optional[Union[str, dict]] = None,
*,
prefilter: bool = False,
with_row_id: bool = False,
with_row_address: bool = False,
use_stats: bool = True,
fast_search: bool = False,
prefilter: bool = None,
with_row_id: bool = None,
with_row_address: bool = None,
use_stats: bool = None,
fast_search: bool = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes in defaults should not be breaking changes since the defaults in ScannerBuilder match the defaults that used to be here.

By using None we can easily tell if the user is specifying a non-default value, in which case we will override whatever is in default_scan_options.

/// the dataset schema (`dataset_schema`). This means that Substrait will
/// not be able to access columns that are not in the dataset schema (e.g.
/// _rowid, _rowaddr, etc.)
#[allow(unused)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed because dataset_schema is unused if the substrait feature is not specified.

///
/// The schema for this conversion should be the full schema available to
/// the filter (`full_schema`). However, due to a limitation in the way
/// we do Substrait conversion today we can only do Substrait conversion with
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lilmitation has been filed as a follow-up in #2972

@codecov-commenter
Copy link

codecov-commenter commented Oct 3, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 20 lines in your changes missing coverage. Please review.

Project coverage is 78.77%. Comparing base (cdac5de) to head (1a142da).

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 66.00% 10 Missing and 7 partials ⚠️
java/core/lance-jni/src/blocking_scanner.rs 0.00% 1 Missing ⚠️
rust/lance-core/src/datatypes/schema.rs 87.50% 1 Missing ⚠️
rust/lance/src/dataset/fragment.rs 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2973   +/-   ##
=======================================
  Coverage   78.77%   78.77%           
=======================================
  Files         237      237           
  Lines       74099    74143   +44     
  Branches    74099    74143   +44     
=======================================
+ Hits        58374    58409   +35     
- Misses      12703    12711    +8     
- Partials     3022     3023    +1     
Flag Coverage Δ
unittests 78.77% <66.66%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RT.block_on(async { scanner.filter_substrait(substrait).await })?;
RT.block_on(async { scanner.filter_substrait(substrait) })?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Why shouldn't we await that future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is no longer async. Now we don't actually compile the filters until scan time (since the input schema may depend on other calls to the scanner builder). The scanner.filter and scanner.filter_substrait methods just record whatever the user passes in.

Copy link
Contributor

@wjones127 wjones127 Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's no longer async, you could also consider removing the wrapping RT.block_on. I think that would be a lot clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦

python/python/tests/test_integration.py Show resolved Hide resolved
Comment on lines +13 to +14
ds = lance.write_dataset(tab, str(tmp_path))
ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unimportant, but these should take paths as-is:

Suggested change
ds = lance.write_dataset(tab, str(tmp_path))
ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})
ds = lance.write_dataset(tab, tmp_path)
ds = lance.dataset(tmp_path, default_scan_options={"with_row_id": True})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request java python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants