Skip to content

Commit

Permalink
Merge pull request #100 from mlcommons/refacto/pierremarcenac-2
Browse files Browse the repository at this point in the history
Simplify code and add developer-friendly documentation for nodes.
  • Loading branch information
marcenacp authored Jul 7, 2023
2 parents c1fcbb5 + bd5fc2f commit 646d560
Show file tree
Hide file tree
Showing 35 changed files with 785 additions and 1,301 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,10 @@ jobs:
run: pip install .

- name: Validate JSON-LD files
# wiki-text is excluded at the moment. See: https://github.com/mlcommons/croissant/issues/101.
# movielens is excluded at the moment. See: https://github.com/mlcommons/croissant/issues/103.
run: |
JSON_FILES=$(python -c "import os; from etils import epath; [print(os.fspath(path)) for path in epath.Path('../../datasets').glob('*/*.json')]")
JSON_FILES=$(find ../../datasets/ -type f -name "*.json" ! -path '*wiki-text*' ! -path '*movielens*')
for file in ${JSON_FILES}
do
echo "Validating ${file}..."
Expand Down
4 changes: 1 addition & 3 deletions datasets/movielens/metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@
]
},
{
"name": "movies+ratings+tags",
"name": "movies_with_ratings_with_tags",
"@type": "ml:RecordSet",
"source": "#{movies}",
"key": "#{movie_id}",
Expand All @@ -209,7 +209,6 @@
"dataType": "ml:RecordSet",
"source": "#{ratings}",
"parentField": {
"@type": "ml:Field",
"source": "#{ratings/movie_id}",
"references": "#{movies}"
},
Expand Down Expand Up @@ -237,7 +236,6 @@
"dataType": "ml:RecordSet",
"source": "#{tags}",
"parentField": {
"@type": "ml:Field",
"source": "#{tags/movie_id}",
"references": "#{movies}"
},
Expand Down
2 changes: 1 addition & 1 deletion datasets/recipes/compressed_archive.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"source": "ml:source"
},
"@type": "sc:Dataset",
"name": "Compressed archive example",
"name": "compressed_archive_example",
"description": "This is a fairly minimal example, showing a way to describe archive files.",
"url": "https://example.com/datasets/recipes/compressed_archive/about",
"distribution": [
Expand Down
2 changes: 1 addition & 1 deletion datasets/recipes/enum.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"references": "ml:references"
},
"@type": "sc:Dataset",
"name": "Enum example",
"name": "enum_example",
"description": "This is a fairly minimal example, showing a way to describe enumerations.",
"url": "https://example.com/datasets/enum/about",
"distribution": [
Expand Down
2 changes: 1 addition & 1 deletion datasets/recipes/minimal.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"sc": "https://schema.org/"
},
"@type": "sc:Dataset",
"name": "Minimal example",
"name": "minimal_example",
"description": "This is a very minimal example, with only the required fields.",
"url": "https://example.com/dataset/minimal/about"
}
2 changes: 1 addition & 1 deletion datasets/recipes/minimal_recommended.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"references": "ml:references"
},
"@type": "sc:Dataset",
"name": "Minimal example with recommended fields",
"name": "minimal_example_with_recommended_fields",
"description": "This is a minimal example, including the required and the recommended fields.",
"url": "https://example.com/dataset/recipes/minimal-recommended",
"license": "https://creativecommons.org/licenses/by/4.0/",
Expand Down
1 change: 1 addition & 0 deletions datasets/wiki-text/metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
"applyTransform": "ml:applyTransform",
"format": "ml:format",
"regex": "ml:regex",
"replace": "ml:replace",
"separator": "ml:separator",
"references": "ml:references"
},
Expand Down
50 changes: 46 additions & 4 deletions python/ml_croissant/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,52 @@ python -m pip install ".[dev]"
pytest .
```

## Roadmap
## Design

Refer to the [design doc](https://docs.google.com/document/d/1zYQIUX9ae1sZOOBq9OCsJ8JW8-Ejy3NLSeqaI5LtOEM/edit?resourcekey=0-CK78DfFvF7fnufyZqF3h3Q) for an overview of the implementation.
The most important modules in the library are:

Refer to the [GitHub project](https://github.com/orgs/mlcommons/projects/26) for more detailed user stories.
- [`ml_croissant/_src/structure_graph`](./ml_croissant/_src/structure_graph/graph.py) is responsible for the **static analysis** of the Croissant files. We convert Croissant files to a Python representation called "**structure graph**" (using [NetworkX](https://networkx.org/)). In the process, we catch any static analysis issues (e.g., a missing mandatory field or a logic problem in the file).
- [`ml_croissant/_src/operation_graph`](./ml_croissant/_src/operation_graph/graph.py) is responsible for the **dynamic analysis** of the Croissant files (i.e., actually loading the dataset by yielding examples). We convert the structure graph into an "**operation graph**". Operations are the unit transformation that allow to build the dataset (like [`Download`](./ml_croissant/_src/operation_graph/operations/download.py), [`Extract`](./ml_croissant/_src/operation_graph/operations/extract.py), etc).

All contributions are welcome! We even have [good first issues](https://github.com/mlcommons/croissant/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) to start in the project.
Other important modules are:

- [`ml_croissant/_src/core`](./ml_croissant/_src/core) defines all needed core internals. For instance, [`Issues`](./ml_croissant/_src/core/issues.py) are a way to track errors and warning during the analysis of Croissant files.
- [`ml_croissant/__init__`](./ml_croissant/__init__.py) declares the public API with [`ml_croissant.Dataset`](./ml_croissant/_src/datasets.py).

For the full design, refer to the [design doc](https://docs.google.com/document/d/1zYQIUX9ae1sZOOBq9OCsJ8JW8-Ejy3NLSeqaI5LtOEM/edit?resourcekey=0-CK78DfFvF7fnufyZqF3h3Q) for an overview of the implementation.

## Contribute

All contributions are welcome! We even have [good first issues](https://github.com/mlcommons/croissant/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) to start in the project. Refer to the [GitHub project](https://github.com/orgs/mlcommons/projects/26) for more detailed user stories.

The development workflow goes as follow:

- [Fork](https://docs.github.com/en/get-started/quickstart/fork-a-repo) the repository: https://github.com/mlcommons/croissant.
- Clone the newly forked repository:
```bash
git clone [email protected]:<YOUR_GITHUB_LDAP>/croissant.git
```
- Create a new branch:
```bash
cd croissant/
git checkout -b feature/my-awesome-new-feature
```
- Code the feature. We support [VS Code](https://code.visualstudio.com) with pre-set settings.
- Push to GitHub:
```bash
git add .
git push --set-upstream origin feature/my-awesome-new-feature
```
- Open a pull request (PR) with the main branch of https://github.com/mlcommons/croissant, and ask for feedback!

## Debug

You can debug the validation of the file with the `--debug` flag:

```bash
python scripts/validate.py --file ../../datasets/titanic/metadata.json --debug
```

This will:
1. print extra information, like the generated nodes;
2. save the generated structure graph to a folder indicated in the logs.
Loading

0 comments on commit 646d560

Please sign in to comment.