feat: make it possible to use rowid and rowaddr in filters #2973

westonpace · 2024-10-03T16:53:17Z

This is particularly useful for operations like "delete by id"

I ran into a fair bit of difficulty with this PR and punted a few things to follow-ups (#2971 #2972).

westonpace · 2024-10-03T16:54:46Z

python/python/lance/dataset.py

 batch_size: Optional[int] = None,
 batch_readahead: Optional[int] = None,
 fragment_readahead: Optional[int] = None,
- scan_in_order: bool = True,
+ scan_in_order: bool = None,
 fragments: Optional[Iterable[LanceFragment]] = None,
 full_text_query: Optional[Union[str, dict]] = None,
 *,
- prefilter: bool = False,
- with_row_id: bool = False,
- with_row_address: bool = False,
- use_stats: bool = True,
- fast_search: bool = False,
+ prefilter: bool = None,
+ with_row_id: bool = None,
+ with_row_address: bool = None,
+ use_stats: bool = None,
+ fast_search: bool = None,


These changes in defaults should not be breaking changes since the defaults in ScannerBuilder match the defaults that used to be here.

By using None we can easily tell if the user is specifying a non-default value, in which case we will override whatever is in default_scan_options.

westonpace · 2024-10-03T16:55:25Z

rust/lance/src/dataset/scanner.rs

+ /// the dataset schema (`dataset_schema`). This means that Substrait will
+ /// not be able to access columns that are not in the dataset schema (e.g.
+ /// _rowid, _rowaddr, etc.)
+ #[allow(unused)]


This was needed because dataset_schema is unused if the substrait feature is not specified.

westonpace · 2024-10-03T16:55:56Z

rust/lance/src/dataset/scanner.rs

+ ///
+ /// The schema for this conversion should be the full schema available to
+ /// the filter (`full_schema`). However, due to a limitation in the way
+ /// we do Substrait conversion today we can only do Substrait conversion with


This lilmitation has been filed as a follow-up in #2972

codecov-commenter · 2024-10-03T17:17:35Z

Codecov Report

Attention: Patch coverage is 66.66667% with 20 lines in your changes missing coverage. Please review.

Project coverage is 78.77%. Comparing base (cdac5de) to head (1a142da).

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/scanner.rs	66.00%	10 Missing and 7 partials ⚠️
java/core/lance-jni/src/blocking_scanner.rs	0.00%	1 Missing ⚠️
rust/lance-core/src/datatypes/schema.rs	87.50%	1 Missing ⚠️
rust/lance/src/dataset/fragment.rs	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2973   +/-   ##
=======================================
  Coverage   78.77%   78.77%           
=======================================
  Files         237      237           
  Lines       74099    74143   +44     
  Branches    74099    74143   +44     
=======================================
+ Hits        58374    58409   +35     
- Misses      12703    12711    +8     
- Partials     3022     3023    +1

Flag	Coverage Δ
unittests	`78.77% <66.66%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127 · 2024-10-03T17:03:21Z

java/core/lance-jni/src/blocking_scanner.rs

- RT.block_on(async { scanner.filter_substrait(substrait).await })?;
+ RT.block_on(async { scanner.filter_substrait(substrait) })?;


🤔 Why shouldn't we await that future?

The method is no longer async. Now we don't actually compile the filters until scan time (since the input schema may depend on other calls to the scanner builder). The scanner.filter and scanner.filter_substrait methods just record whatever the user passes in.

If it's no longer async, you could also consider removing the wrapping RT.block_on. I think that would be a lot clearer.

python/python/tests/test_integration.py

wjones127 · 2024-10-03T17:08:32Z

python/python/tests/test_integration.py

+ ds = lance.write_dataset(tab, str(tmp_path))
+ ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})


Unimportant, but these should take paths as-is:

Suggested change

ds = lance.write_dataset(tab, str(tmp_path))

ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})

ds = lance.write_dataset(tab, tmp_path)

ds = lance.dataset(tmp_path, default_scan_options={"with_row_id": True})

github-actions bot added enhancement New feature or request python java labels Oct 3, 2024

westonpace commented Oct 3, 2024

View reviewed changes

wjones127 reviewed Oct 3, 2024

View reviewed changes

wjones127 approved these changes Oct 4, 2024

View reviewed changes

westonpace added 2 commits October 13, 2024 04:44

Make it possible to use rowid and rowaddr in filters

d3da352

Remove block_on since filter_substrait is now a synchronous method

1a142da

westonpace force-pushed the feat/meta-cols-in-filter branch from d231010 to 1a142da Compare October 13, 2024 10:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make it possible to use rowid and rowaddr in filters #2973

feat: make it possible to use rowid and rowaddr in filters #2973

westonpace commented Oct 3, 2024

westonpace Oct 3, 2024

westonpace Oct 3, 2024

westonpace Oct 3, 2024

codecov-commenter commented Oct 3, 2024 •

edited

Loading

wjones127 Oct 3, 2024

westonpace Oct 4, 2024

wjones127 Oct 4, 2024 •

edited

Loading

westonpace Oct 13, 2024

wjones127 Oct 3, 2024

		RT.block_on(async { scanner.filter_substrait(substrait).await })?;
		RT.block_on(async { scanner.filter_substrait(substrait) })?;

		ds = lance.write_dataset(tab, str(tmp_path))
		ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})

feat: make it possible to use rowid and rowaddr in filters #2973

Are you sure you want to change the base?

feat: make it possible to use rowid and rowaddr in filters #2973

Conversation

westonpace commented Oct 3, 2024

westonpace Oct 3, 2024

Choose a reason for hiding this comment

westonpace Oct 3, 2024

Choose a reason for hiding this comment

westonpace Oct 3, 2024

Choose a reason for hiding this comment

codecov-commenter commented Oct 3, 2024 • edited Loading

Codecov Report

wjones127 Oct 3, 2024

Choose a reason for hiding this comment

westonpace Oct 4, 2024

Choose a reason for hiding this comment

wjones127 Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

westonpace Oct 13, 2024

Choose a reason for hiding this comment

wjones127 Oct 3, 2024

Choose a reason for hiding this comment

codecov-commenter commented Oct 3, 2024 •

edited

Loading

wjones127 Oct 4, 2024 •

edited

Loading