Skip to content

Commit

Permalink
Apply filters to a Hugging Face dataset to avoid repeating all varian…
Browse files Browse the repository at this point in the history
…ts. (#719)

The only issue for now is that `regex` is a regex while `includes` is a
glob... So I use heuristics to convert from one to another. I think it's
not a problem for hugging face datasets as we control the form they
have. But it can be challenging to have a generic conversion. The best
would be to use either regular expressions or glob patterns everywhere.
  • Loading branch information
marcenacp authored Jul 19, 2024
1 parent eaf6d61 commit 6fc0adb
Show file tree
Hide file tree
Showing 7 changed files with 35 additions and 404 deletions.
238 changes: 0 additions & 238 deletions datasets/0.8/huggingface-c4/metadata.json

This file was deleted.

1 change: 0 additions & 1 deletion datasets/0.8/huggingface-c4/output/en.jsonl

This file was deleted.

Loading

0 comments on commit 6fc0adb

Please sign in to comment.