Ensembl · ens-LCampbell · Nov 15, 2023 · Nov 14, 2023 · Nov 14, 2023 · Nov 14, 2023
diff --git a/.github/workflows/sphinx_doc_generation.yaml → ...hub/workflows/mkdocs_docs_generation.yaml b/.github/workflows/sphinx_doc_generation.yaml → ...hub/workflows/mkdocs_docs_generation.yaml
@@ -1,6 +1,6 @@
-name: Documentation_deploy
-run-name: ${{ github.actor }} triggered doc generation
-on: 
+name: Documentation_deploy_mkdocs
+run-name: ${{ github.actor }} triggered mkdocs generation
+on:
   pull_request:
     types:
       - closed
@@ -16,9 +16,9 @@ on:
       - 'docs/**'
 permissions:
     contents: write
-
+    
 jobs:
-  Sphinx_Doc_generation:
+  Mkdocs_Doc_generation:
     if: github.event.pull_request.merged == true
     runs-on: ubuntu-latest
     steps:
@@ -34,19 +34,30 @@ jobs:
         cache-dependency-path: '**/pip'
         run: echo '${{ steps.cp38.outputs.cache-hit }}'
 
+    - name: Set pip cache directory path
+      id: pip-cache-dir-path
+      run: |
+        echo "PIPCACHE=(`pip cache dir`)" >> "$GITHUB_OUTPUT"
+
+    - name: Get pip cache dir 
+      env:
+        PIPCACHE: ${{ steps.pip-cache-dir-path.outputs.PIPCACHE }}
+      run: echo "The pip cache dir located is $PIPCACHE"
+
     - name: Install Dependencies
       run: |
         pip install -e .[doc]
 
-    - name: Sphinx Build
+    - name: mkdocs deploy
       run: |
-        sh scripts/setup/docs/build_sphinx_docs.sh
+        mkdocs build
 
     - name: Deploy to GitHub Pages
       uses: peaceiris/actions-gh-pages@v3
       if: ${{ github.event_name == 'pull_request' && github.ref == 'refs/heads/main' }}
       with:
-        publish_branch: gh-pages
+        publish_branch: mkdocs
         github_token: ${{ secrets.GITHUB_TOKEN }}
-        publish_dir: docs/build/html
+        publish_dir: ./site
         force_orphan: true
+
diff --git a/docs/gen_ref_pages.py b/docs/gen_ref_pages.py
@@ -0,0 +1,32 @@
+"""Generate the code reference pages for mkdocs."""
+
+from pathlib import Path
+
+import mkdocs_gen_files
+
+nav = mkdocs_gen_files.Nav()
+
+for path in sorted(Path("src").rglob("*.py")):
+    module_path = path.with_suffix("")
+    doc_path = path.relative_to("src").with_suffix(".md")
+    full_doc_path = Path("reference", doc_path)
+
+    parts = tuple(module_path.parts)
+
+    if parts[-1] == "__init__":
+        parts = parts[:-1]
+        doc_path = doc_path.with_name("index.md")
+        full_doc_path = full_doc_path.with_name("index.md")
+    elif parts[-1] == "__main__":
+        continue
+
+    nav[parts] = doc_path.as_posix()
+
+    with mkdocs_gen_files.open(full_doc_path, "w") as fd:
+        identifier = ".".join(parts)
+        print("::: " + identifier, file=fd)
+
+    mkdocs_gen_files.set_edit_path(full_doc_path, Path("../") / path)
+
+with mkdocs_gen_files.open("reference/SUMMARY.md", "w") as nav_file:
+    nav_file.writelines(nav.build_literate_nav())
diff --git a/docs/img/logo.png b/docs/img/logo.png
diff --git a/docs/img/metazoa_logo.png b/docs/img/metazoa_logo.png
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,94 @@
+# [Ensembl GenomIO](https://github.com/Ensembl/ensembl-genomio)
+
+*Ensembl-genomIO Base Library Documentation*
+
+A repository dedicated to pipelines used to turn basic genomic data into formatted 
+Ensembl core databases. Also allow users to dump core databases into various formats.
+
+File formats handled : FastA, GFF3, JSON (*following BRC4 specifications*).
+
+Contents
+--------
+Check out [installation](install.md) section for further information on how 
+to install the project.
+
+1. [Usage](usage.md)
+2. [Install](install.md)
+
+Ehive pipelines
+-------------------------------------------
+Check out the [usage](usage.md) section for further information of requirements to
+run ensembl-genomio pipelines.
+
+1. __Genome loader__: Creates an Ensembl core database from a set of flat files.
+2. __Genome dumper__: Dumps flat files from an Ensembl core database.
+
+Nextflow pipelines
+-------------------------------------------
+1. __Additional seq prepare__: BRC/Ensembl metazoa pipeline. Preparation of genome data loading files for new sequence(s) to existing species databases.  
+2. __Genome Prepare__: BRC/Ensembl metazoa pipeline. Retrieve data for genome(s), obtained from INSDC and RefSeq, validate and prepare GFF3, FASTA, JSON files for each genome accession.
+
+
+## Project layout
+	src/ensembl/
+	├── brc4
+	│   └── runnable
+	│       ├── compare_fasta.py
+	│       ├── compare_report.py
+	│       ├── core_server.py
+	│       ├── download_genbank.py
+	│       ├── dump_stable_ids.py
+	│       ├── extract_from_gb.py
+	│       ├── fill_metadata.py
+	│       ├── gff3_specifier.py
+	│       ├── integrity.py
+	│       ├── json_schema_factory.py
+	│       ├── load_sequence_data.py
+	│       ├── manifest.py
+	│       ├── manifest_stats.py
+	│       ├── prepare_genome.py
+	│       ├── read_json.py
+	│       ├── say_accession.py
+	│       └── seqregion_parser.py
+	└── io
+	    └── genomio
+	        ├── assembly
+	        │   └── download.py
+	        ├── database
+	        │   └── factory.py
+	        ├── events
+	        │   ├── dump.py
+	        │   ├── format.py
+	        │   └── load.py
+	        ├── fasta
+	        │   └── process.py
+	        ├── genbank
+	        │   ├── download.py
+	        │   └── extract_data.py
+	        ├── genome_metadata
+	        │   ├── dump.py
+	        │   ├── extend.py
+	        │   └── prepare.py
+	        ├── genome_stats
+	        │   ├── compare.py
+	        │   └── dump.py
+	        ├── gff3
+	        │   ├── extract_annotation.py
+	        │   └── process.py
+	        ├── manifest
+	        │   ├── check_integrity.py
+	        │   ├── compute_stats.py
+	        │   └── generate.py
+	        ├── schemas
+	        │   └── json
+	        │       ├── factory.py
+	        │       └── validate.py
+	        ├── seq_region
+	        │   ├── dump.py
+	        │   └── prepare.py
+	        └── utils
+	            ├── archive_utils.py
+	            └── json_utils.py
+
+## License 
+Software as part of [Ensembl GenomIO](https://github.com/Ensembl/ensembl-genomio) is distributed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.txt).
diff --git a/docs/install.md b/docs/install.md
@@ -0,0 +1,53 @@
+API Setup and installation
+===========================
+
+Requirements
+--------------
+
+An Ensembl API checkout including:
+
+- [ensembl-genomio](https://github.com/Ensembl/ensembl-genomio)  (export /src/perl into PERL5LIB)
+- [ensembl-hive](https://github.com/Ensembl/ensembl-hive)
+- [ensembl-production](https://github.com/Ensembl/ensembl-production)
+- [ensembl-analysis](https://github.com/Ensembl/ensembl-analysis/tree/dev/hive_master) (on dev/hive_master branch)
+- [ensembl-taxonomy](https://github.com/Ensembl/ensembl-taxonomy)
+- [ensembl-orm](https://github.com/Ensembl/ensembl-orm)
+
+Software
+--------------
+
+- Python 3.8+
+- Perl 5.26
+- Bioperl 1.6.9+
+
+Python Modules
+--------------
+- bcbio-gff
+- biopython
+- jsonschema
+- intervaltree
+- mysql-connector-python
+- python-redmine
+- requests
+
+
+## Installation
+--------------
+### Directly from GitHub:
+```
+git clone https://github.com/Ensembl/ensembl-genomio
+git clone https://github.com/Ensembl/ensembl-analysis -b dev/hive_master
+git clone https://github.com/Ensembl/ensembl-production
+git clone https://github.com/Ensembl/ensembl-hive
+git clone https://github.com/Ensembl/ensembl-taxonomy
+git clone https://github.com/Ensembl/ensembl-orm
+```
+
+
+### Documentation
+Documentation for Ensembl-genomio generated using _mkdocs_. For full information visit [mkdocs.org](https://www.mkdocs.org).
+#### Commands
+* `mkdocs new [dir-name]` - Create a new project.
+* `mkdocs serve` - Start the live-reloading docs server.
+* `mkdocs build` - Build the documentation site.
+* `mkdocs -h` - Print help message and exit.
diff --git a/docs/pipelines.md b/docs/pipelines.md
@@ -0,0 +1,47 @@
+# Ensembl Genomio Pipelines:
+
+## Genomio prepare pipeline
+_Module [Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_prepare_conf]_
+
+**Genome prepare pipeline for BRC/Metazoa**
+
+#### Description
+Retrieve data for a genome from INSDC and prepare the following files in a separate folder
+for each genome:
+
+- FASTA for DNA sequences
+- FASTA for protein sequences
+- GFF gene models
+- JSON functional annotation
+- JSON seq_region
+- JSON genome
+- JSON manifest
+
+The JSON files follow the schemas defined in the /schemas folder.
+
+These files can then be fed to the Genome loader pipeline.
+
+### How to run
+
+```
+init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_prepare_conf \
+    --host $HOST --port $PORT --user $USER --pass $PASS \
+    --hive_force_init 1 \
+    --pipeline_dir temp/prepare \
+    --data_dir $INPUT \
+    --output_dir $OUTPUT \
+    ${OTHER_OPTIONS}
+```
+
+### Parameters
+
+| option | default value |  meaning |
+| - | - | - |
+| `--pipeline_name` | brc4_genome_prepare | name of the hive pipeline
+| `--pipeline_dir` | | temp directory for this pipeline run
+| `--data_dir` | | directory with json files for each genome to prepare, following the format set by schemas/genome_schema.json
+| `--output_dir` | | directory where the prepared files are to be stored
+| `--merge_split_genes` | 0 | Sometimes the gene features are split in a gff file. Ensembl expects genes to be contiguous, so this option merge the parts into 1.
+| `--exclude_seq_regions` |  | Do not include those seq_regions (apply to all genomes, this should be seldom used)
+| `--validate_gene_id` | 0 | Enforce a strong gene ID pattern (replace by GeneID if available)
+| `--ensembl_mode` |  0 | By default, set additional metadata for BRC genomes. With this parameter, use vanilla Ensembl metadata.