Merge pull request #100 from mlcommons/refacto/pierremarcenac-2

Simplify code and add developer-friendly documentation for nodes.
mlcommons · Jul 7, 2023 · 646d560 · 646d560
2 parents c1fcbb5 + bd5fc2f
commit 646d560
Show file tree

Hide file tree

Showing 35 changed files with 785 additions and 1,301 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -56,8 +56,10 @@ jobs:
  run: pip install .
 
  - name: Validate JSON-LD files
+ # wiki-text is excluded at the moment. See: https://github.com/mlcommons/croissant/issues/101.
+ # movielens is excluded at the moment. See: https://github.com/mlcommons/croissant/issues/103.
  run: |
- JSON_FILES=$(python -c "import os; from etils import epath; [print(os.fspath(path)) for path in epath.Path('../../datasets').glob('*/*.json')]")
+ JSON_FILES=$(find ../../datasets/ -type f -name "*.json" ! -path '*wiki-text*' ! -path '*movielens*')
  for file in ${JSON_FILES}
  do
  echo "Validating ${file}..."

diff --git a/datasets/movielens/metadata.json b/datasets/movielens/metadata.json
@@ -183,7 +183,7 @@
  ]
  },
  {
- "name": "movies+ratings+tags",
+ "name": "movies_with_ratings_with_tags",
  "@type": "ml:RecordSet",
  "source": "#{movies}",
  "key": "#{movie_id}",
@@ -209,7 +209,6 @@
  "dataType": "ml:RecordSet",
  "source": "#{ratings}",
  "parentField": {
- "@type": "ml:Field",
  "source": "#{ratings/movie_id}",
  "references": "#{movies}"
  },
@@ -237,7 +236,6 @@
  "dataType": "ml:RecordSet",
  "source": "#{tags}",
  "parentField": {
- "@type": "ml:Field",
  "source": "#{tags/movie_id}",
  "references": "#{movies}"
  },

diff --git a/datasets/recipes/compressed_archive.json b/datasets/recipes/compressed_archive.json
@@ -10,7 +10,7 @@
  "source": "ml:source"
  },
  "@type": "sc:Dataset",
- "name": "Compressed archive example",
+ "name": "compressed_archive_example",
  "description": "This is a fairly minimal example, showing a way to describe archive files.",
  "url": "https://example.com/datasets/recipes/compressed_archive/about",
  "distribution": [

diff --git a/datasets/recipes/enum.json b/datasets/recipes/enum.json
@@ -11,7 +11,7 @@
  "references": "ml:references"
  },
  "@type": "sc:Dataset",
- "name": "Enum example",
+ "name": "enum_example",
  "description": "This is a fairly minimal example, showing a way to describe enumerations.",
  "url": "https://example.com/datasets/enum/about",
  "distribution": [

diff --git a/datasets/recipes/minimal.json b/datasets/recipes/minimal.json
@@ -4,7 +4,7 @@
  "sc": "https://schema.org/"
  },
  "@type": "sc:Dataset",
- "name": "Minimal example",
+ "name": "minimal_example",
  "description": "This is a very minimal example, with only the required fields.",
  "url": "https://example.com/dataset/minimal/about"
 }
diff --git a/datasets/recipes/minimal_recommended.json b/datasets/recipes/minimal_recommended.json
@@ -10,7 +10,7 @@
  "references": "ml:references"
  },
  "@type": "sc:Dataset",
- "name": "Minimal example with recommended fields",
+ "name": "minimal_example_with_recommended_fields",
  "description": "This is a minimal example, including the required and the recommended fields.",
  "url": "https://example.com/dataset/recipes/minimal-recommended",
  "license": "https://creativecommons.org/licenses/by/4.0/",

diff --git a/datasets/wiki-text/metadata.json b/datasets/wiki-text/metadata.json
@@ -14,6 +14,7 @@
  "applyTransform": "ml:applyTransform",
  "format": "ml:format",
  "regex": "ml:regex",
+ "replace": "ml:replace",
  "separator": "ml:separator",
  "references": "ml:references"
  },

diff --git a/python/ml_croissant/README.md b/python/ml_croissant/README.md
@@ -35,10 +35,52 @@ python -m pip install ".[dev]"
 pytest .
 ```
 
-## Roadmap
+## Design
 
-Refer to the [design doc](https://docs.google.com/document/d/1zYQIUX9ae1sZOOBq9OCsJ8JW8-Ejy3NLSeqaI5LtOEM/edit?resourcekey=0-CK78DfFvF7fnufyZqF3h3Q) for an overview of the implementation.
+The most important modules in the library are:
 
-Refer to the [GitHub project](https://github.com/orgs/mlcommons/projects/26) for more detailed user stories.
+- [`ml_croissant/_src/structure_graph`](./ml_croissant/_src/structure_graph/graph.py) is responsible for the **static analysis** of the Croissant files. We convert Croissant files to a Python representation called "**structure graph**" (using [NetworkX](https://networkx.org/)). In the process, we catch any static analysis issues (e.g., a missing mandatory field or a logic problem in the file).
+- [`ml_croissant/_src/operation_graph`](./ml_croissant/_src/operation_graph/graph.py) is responsible for the **dynamic analysis** of the Croissant files (i.e., actually loading the dataset by yielding examples). We convert the structure graph into an "**operation graph**". Operations are the unit transformation that allow to build the dataset (like [`Download`](./ml_croissant/_src/operation_graph/operations/download.py), [`Extract`](./ml_croissant/_src/operation_graph/operations/extract.py), etc).
 
-All contributions are welcome! We even have [good first issues](https://github.com/mlcommons/croissant/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) to start in the project.
+Other important modules are:
+
+- [`ml_croissant/_src/core`](./ml_croissant/_src/core) defines all needed core internals. For instance, [`Issues`](./ml_croissant/_src/core/issues.py) are a way to track errors and warning during the analysis of Croissant files.
+- [`ml_croissant/__init__`](./ml_croissant/__init__.py) declares the public API with [`ml_croissant.Dataset`](./ml_croissant/_src/datasets.py).
+
+For the full design, refer to the [design doc](https://docs.google.com/document/d/1zYQIUX9ae1sZOOBq9OCsJ8JW8-Ejy3NLSeqaI5LtOEM/edit?resourcekey=0-CK78DfFvF7fnufyZqF3h3Q) for an overview of the implementation.
+
+## Contribute
+
+All contributions are welcome! We even have [good first issues](https://github.com/mlcommons/croissant/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) to start in the project. Refer to the [GitHub project](https://github.com/orgs/mlcommons/projects/26) for more detailed user stories.
+
+The development workflow goes as follow:
+
+- [Fork](https://docs.github.com/en/get-started/quickstart/fork-a-repo) the repository: https://github.com/mlcommons/croissant.
+- Clone the newly forked repository:
+ ```bash
+ git clone [email protected]:<YOUR_GITHUB_LDAP>/croissant.git
+ ```
+- Create a new branch:
+ ```bash
+ cd croissant/
+ git checkout -b feature/my-awesome-new-feature
+ ```
+- Code the feature. We support [VS Code](https://code.visualstudio.com) with pre-set settings.
+- Push to GitHub:
+ ```bash
+ git add .
+ git push --set-upstream origin feature/my-awesome-new-feature
+ ```
+- Open a pull request (PR) with the main branch of https://github.com/mlcommons/croissant, and ask for feedback!
+
+## Debug
+
+You can debug the validation of the file with the `--debug` flag:
+
+```bash
+python scripts/validate.py --file ../../datasets/titanic/metadata.json --debug
+```
+
+This will:
+1. print extra information, like the generated nodes;
+2. save the generated structure graph to a folder indicated in the logs.