From 6d1231762e5c7e0f0ba07b718f3f88faeb6ed6af Mon Sep 17 00:00:00 2001 From: Keshav Priyadarshi Date: Thu, 17 Oct 2024 23:50:40 +0530 Subject: [PATCH] Add documentation for new pipeline design - Add tutorials on how to write new importer/ improver pipeline Signed-off-by: Keshav Priyadarshi --- docs/source/contributing.rst | 587 +----------------- docs/source/index.rst | 8 + .../source/tutorial_add_importer_pipeline.rst | 366 +++++++++++ .../source/tutorial_add_improver_pipeline.rst | 267 ++++++++ 4 files changed, 651 insertions(+), 577 deletions(-) create mode 100644 docs/source/tutorial_add_importer_pipeline.rst create mode 100644 docs/source/tutorial_add_improver_pipeline.rst diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst index fa6e7075b..6f57bab21 100644 --- a/docs/source/contributing.rst +++ b/docs/source/contributing.rst @@ -18,9 +18,9 @@ Do Your Homework ---------------- Before adding a contribution or create a new issue, take a look at the project’s -`README `_, read through our +`README `_, read through our `documentation `_, -and browse existing `issues `_, +and browse existing `issues `_, to develop some understanding of the project and confirm whether a given issue/feature has previously been discussed. @@ -35,7 +35,7 @@ First Timers You are here to help, but you are a new contributor! No worries, we always welcome newcomer contributors. We maintain some -`good first issues `_ +`good first issues `_ and encourage new contributors to work on those issues for a smooth start. .. tip:: @@ -47,15 +47,18 @@ Code Contributions For more established contributors, you can contribute to the codebase in several ways: -- Report a `bug `_; just remember to be as +- Report a `bug `_; just remember to be as specific as possible. -- Submit a `bug fix `_ for any existing +- Submit a `bug fix `_ for any existing issue. -- Create a `new issue `_ to request a +- Create a `new issue `_ to request a feature, submit a feedback, or ask a question. +* Want to add support for a new importer pipeline? See the detailed tutorial here: :ref:`tutorial_add_importer_pipeline`. +* Interested adding a new improver pipeline? Check out the tutorial here: :ref:`tutorial_add_improver_pipeline`. + .. note:: - Make sure to check existing `issues `_, + Make sure to check existing `issues `_, to confirm whether a given issue or a question has previously been discussed. @@ -90,576 +93,6 @@ Helpful Resources - `Pro Git book `_ - `How to write a good bug report `_ -.. _tutorial_add_a_new_importer: - -Add a new importer -------------------- - -This tutorial contains all the things one should know to quickly implement an importer. -Many internal details about importers can be found inside the -:file:`vulnerabilites/importer.py` file. -Make sure to go through :ref:`importer-overview` before you begin writing one. - -TL;DR -------- - -#. Create a new :file:`vulnerabilities/importers/{importer_name.py}` file. -#. Create a new importer subclass inheriting from the ``Importer`` superclass defined in - ``vulnerabilites.importer``. It is conventional to end an importer name with *Importer*. -#. Specify the importer license. -#. Implement the ``advisory_data`` method to process the data source you are - writing an importer for. -#. Add the newly created importer to the importers registry at - ``vulnerabilites/importers/__init__.py`` - -.. _tutorial_add_a_new_importer_prerequisites: - -Prerequisites --------------- - -Before writing an importer, it is important to familiarize yourself with the following concepts. - -PackageURL -^^^^^^^^^^^^ - -VulnerableCode extensively uses Package URLs to identify a package. See the -`PackageURL specification `_ and its `Python implementation -`_ for more details. - -**Example usage:** - -.. code:: python - - from packageurl import PackageURL - purl = PackageURL(name="ffmpeg", type="deb", version="1.2.3") - - -AdvisoryData -^^^^^^^^^^^^^ - -``AdvisoryData`` is an intermediate data format: -it is expected that your importer will convert the raw scraped data into ``AdvisoryData`` objects. -All the fields in ``AdvisoryData`` dataclass are optional; it is the importer's resposibility to -ensure that it contains meaningful information about a vulnerability. - -AffectedPackage -^^^^^^^^^^^^^^^^ - -``AffectedPackage`` data type is used to store a range of affected versions and a fixed version of a -given package. For all version-related data, `univers `_ library -is used. - -Univers -^^^^^^^^ - -`univers `_ is a Python implementation of the `vers specification `_. -It can parse and compare all the package versions and all the ranges, -from debian, npm, pypi, ruby and more. -It processes all the version range specs and expressions. - -Importer -^^^^^^^^^ - -All the generic importers need to implement the ``Importer`` class. -For ``Git`` or ``Oval`` data source, ``GitImporter`` or ``OvalImporter`` could be implemented. - -.. note:: - - ``GitImporter`` and ``OvalImporter`` need a complete rewrite. - Interested in :ref:`contributing` ? - -Writing an importer ---------------------- - -Create Importer Source File -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -All importers are located in the :file:`vulnerabilites/importers` directory. -Create a new file to put your importer code in. -Generic importers are implemented by writing a subclass for the ``Importer`` superclass and -implementing the unimplemented methods. - -Specify the Importer License -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Importers scrape data off the internet. In order to make sure the data is useable, a license -must be provided. -Populate the ``spdx_license_expression`` with the appropriate value. -The SPDX license identifiers can be found at https://spdx.org/licenses/. - -.. note:: - An SPDX license identifier by itself is a valid licence expression. In case you need more complex - expressions, see https://spdx.github.io/spdx-spec/v2.3/SPDX-license-expressions/ - -Implement the ``advisory_data`` Method -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The ``advisory_data`` method scrapes the advisories from the data source this importer is -targeted at. -It is required to return an *Iterable of AdvisoryData objects*, and thus it is a good idea to yield -from this method after creating each AdvisoryData object. - -At this point, an example importer will look like this: - -:file:`vulnerabilites/importers/example.py` - -.. code-block:: python - - from typing import Iterable - - from packageurl import PackageURL - - from vulnerabilities.importer import AdvisoryData - from vulnerabilities.importer import Importer - - - class ExampleImporter(Importer): - - spdx_license_expression = "BSD-2-Clause" - - def advisory_data(self) -> Iterable[AdvisoryData]: - return [] - -This importer is only a valid skeleton and does not import anything at all. - -Let us implement another dummy importer that actually imports some data. - -Here we have a ``dummy_package`` which follows ``NginxVersionRange`` and ``SemverVersion`` for -version management from `univers `_. - -.. note:: - - It is possible that the versioning scheme you are targeting has not yet been - implemented in the `univers `_ library. - If this is the case, you will need to head over there and implement one. - -.. code-block:: python - - from datetime import datetime - from datetime import timezone - from typing import Iterable - - import requests - from packageurl import PackageURL - from univers.version_range import NginxVersionRange - from univers.versions import SemverVersion - - from vulnerabilities.importer import AdvisoryData - from vulnerabilities.importer import AffectedPackage - from vulnerabilities.importer import Importer - from vulnerabilities.importer import Reference - from vulnerabilities.importer import VulnerabilitySeverity - from vulnerabilities.severity_systems import SCORING_SYSTEMS - - - class ExampleImporter(Importer): - - spdx_license_expression = "BSD-2-Clause" - - def advisory_data(self) -> Iterable[AdvisoryData]: - raw_data = fetch_advisory_data() - for data in raw_data: - yield parse_advisory_data(data) - - - def fetch_advisory_data(): - return [ - { - "id": "CVE-2021-23017", - "summary": "1-byte memory overwrite in resolver", - "advisory_severity": "medium", - "vulnerable": "0.6.18-1.20.0", - "fixed": "1.20.1", - "reference": "http://mailman.nginx.org/pipermail/nginx-announce/2021/000300.html", - "published_on": "14-02-2021 UTC", - }, - { - "id": "CVE-2021-1234", - "summary": "Dummy advisory", - "advisory_severity": "high", - "vulnerable": "0.6.18-1.20.0", - "fixed": "1.20.1", - "reference": "http://example.com/cve-2021-1234", - "published_on": "06-10-2021 UTC", - }, - ] - - - def parse_advisory_data(raw_data) -> AdvisoryData: - purl = PackageURL(type="example", name="dummy_package") - affected_version_range = NginxVersionRange.from_native(raw_data["vulnerable"]) - fixed_version = SemverVersion(raw_data["fixed"]) - affected_package = AffectedPackage( - package=purl, affected_version_range=affected_version_range, fixed_version=fixed_version - ) - severity = VulnerabilitySeverity( - system=SCORING_SYSTEMS["generic_textual"], value=raw_data["advisory_severity"] - ) - references = [Reference(url=raw_data["reference"], severities=[severity])] - date_published = datetime.strptime(raw_data["published_on"], "%d-%m-%Y %Z").replace( - tzinfo=timezone.utc - ) - - return AdvisoryData( - aliases=[raw_data["id"]], - summary=raw_data["summary"], - affected_packages=[affected_package], - references=references, - date_published=date_published, - ) - - -.. note:: - - | Use ``make valid`` to format your new code using black and isort automatically. - | Use ``make check`` to check for formatting errors. - -Register the Importer -^^^^^^^^^^^^^^^^^^^^^^ - -Finally, register your importer in the importer registry at -:file:`vulnerabilites/importers/__init__.py` - -.. code-block:: python - :emphasize-lines: 1, 4 - - from vulnerabilities.importers import example - from vulnerabilities.importers import nginx - - IMPORTERS_REGISTRY = [nginx.NginxImporter, example.ExampleImporter] - - IMPORTERS_REGISTRY = {x.qualified_name: x for x in IMPORTERS_REGISTRY} - -Congratulations! You have written your first importer. - -Run Your First Importer -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -If everything went well, you will see your importer in the list of available importers. - -.. code-block:: console - :emphasize-lines: 5 - - $ ./manage.py import --list - - Vulnerability data can be imported from the following importers: - vulnerabilities.importers.nginx.NginxImporter - vulnerabilities.importers.example.ExampleImporter - -Now, run the importer. - -.. code-block:: console - - $ ./manage.py import vulnerabilities.importers.example.ExampleImporter - - Importing data using vulnerabilities.importers.example.ExampleImporter - Successfully imported data using vulnerabilities.importers.example.ExampleImporter - -See :ref:`command_line_interface` for command line usage instructions. -Enable Debug Logging (Optional) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -For more visibility, turn on debug logs in :file:`vulnerablecode/settings.py`. - -.. code-block:: python - - DEBUG = True - LOGGING = { - 'version': 1, - 'disable_existing_loggers': False, - 'handlers': { - 'console': { - 'class': 'logging.StreamHandler', - }, - }, - 'root': { - 'handlers': ['console'], - 'level': 'DEBUG', - }, - } - -Invoke the import command now and you will see (in a fresh database): - -.. code-block:: console - - $ ./manage.py import vulnerabilities.importers.example.ExampleImporter - - Importing data using vulnerabilities.importers.example.ExampleImporter - Starting import for vulnerabilities.importers.example.ExampleImporter - [*] New Advisory with aliases: ['CVE-2021-23017'], created_by: vulnerabilities.importers.example.ExampleImporter - [*] New Advisory with aliases: ['CVE-2021-1234'], created_by: vulnerabilities.importers.example.ExampleImporter - Finished import for vulnerabilities.importers.example.ExampleImporter. Imported 2 advisories. - Successfully imported data using vulnerabilities.importers.example.ExampleImporter - -.. _tutorial_add_a_new_improver: - -Add a new improver ---------------------- - -This tutorial contains all the things one should know to quickly -implement an improver. -Many internal details about improvers can be found inside the -:file:`vulnerabilites/improver.py` file. -Make sure to go through :ref:`improver-overview` before you begin writing one. - -TL;DR -------- - -#. Locate the importer that this improver will be improving data of at - :file:`vulnerabilities/importers/{importer_name.py}` file. -#. Create a new improver subclass inheriting from the ``Improver`` superclass defined in - ``vulnerabilites.improver``. It is conventional to end an improver name with *Improver*. -#. Implement the ``interesting_advisories`` property to return a QuerySet of imported data - (``Advisory``) you are interested in. -#. Implement the ``get_inferences`` method to return an iterable of ``Inference`` objects for the - given ``AdvisoryData``. -#. Add the newly created improver to the improvers registry at - ``vulnerabilites/improvers/__init__.py``. - -Prerequisites --------------- - -Before writing an improver, it is important to familiarize yourself with the following concepts. - -Importer -^^^^^^^^^^ - -Importers are responsible for scraping vulnerability data from various data sources without creating -a complete relational model between vulnerabilites and their fixes and storing them in a structured -fashion. These data are stored in the ``Advisory`` model and can be converted to an equivalent -``AdvisoryData`` for various use cases. -See :ref:`importer-overview` for a brief overview on importers. - -Importer Prerequisites -^^^^^^^^^^^^^^^^^^^^^^^ - -Improvers consume data produced by importers, and thus it is important to familiarize yourself with -:ref:`Importer Prerequisites `. - -Inference -^^^^^^^^^^^ - -Inferences express the contract between the improvers and the improve runner framework. -An inference is intended to contain data points about a vulnerability without any uncertainties, -which means that one inference will target one vulnerability with the specific relevant affected and -fixed packages (in the form of `PackageURLs `_). -There is no notion of version ranges here: all package versions must be explicitly specified. - -Because this concrete relationship is rarely available anywhere upstream, we have to *infer* -these values, thus the name. -As inferring something is not always perfect, an Inference also comes with a confidence score. - -Improver -^^^^^^^^^ - -All the Improvers must inherit from ``Improver`` superclass and implement the -``interesting_advisories`` property and the ``get_inferences`` method. - -Writing an improver ---------------------- - -Locate the Source File -^^^^^^^^^^^^^^^^^^^^^^^^ - -If the improver will be working on data imported by a specific importer, it will be located in -the same file at :file:`vulnerabilites/importers/{importer-name.py}`. Otherwise, if it is a -generic improver, create a new file :file:`vulnerabilites/improvers/{improver-name.py}`. - -Explore Package Managers (Optional) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -If your Improver depends on the discrete versions of a package, the package managers' VersionAPI -located at :file:`vulnerabilites/package_managers.py` could come in handy. You will need to -instantiate the relevant ``VersionAPI`` in the improver's constructor and use it later in the -implemented methods. See an already implemented improver (NginxBasicImprover) for an example usage. - -Implement the ``interesting_advisories`` Property -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This property is intended to return a QuerySet of ``Advisory`` on which the ``Improver`` is -designed to work. - -For example, if the improver is designed to work on Advisories imported by ``ExampleImporter``, -the property can be implemented as - -.. code-block:: python - - class ExampleBasicImprover(Improver): - - @property - def interesting_advisories(self) -> QuerySet: - return Advisory.objects.filter(created_by=ExampleImporter.qualified_name) - -Implement the ``get_inferences`` Method -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The framework calls ``get_inferences`` method for every ``AdvisoryData`` that is obtained from -the ``Advisory`` QuerySet returned by the ``interesting_advisories`` property. - -It is expected to return an iterable of ``Inference`` objects for the given ``AdvisoryData``. To -avoid storing a lot of Inferences in memory, it is preferable to yield from this method. - -A very simple Improver that processes all Advisories to create the minimal relationships that can -be obtained by existing data can be found at :file:`vulnerabilites/improvers/default.py`, which is -an example of a generic improver. For a more sophisticated and targeted example, you can look -at an already implemented improver (e.g., :file:`vulnerabilites/importers/nginx.py`). - -Improvers are not limited to improving discrete versions and may also improve ``aliases``. -One such example, improving the importer written in the :ref:`importer tutorial -`, is shown below. - -.. code-block:: python - - from datetime import datetime - from datetime import timezone - from typing import Iterable - - import requests - from django.db.models.query import QuerySet - from packageurl import PackageURL - from univers.version_range import NginxVersionRange - from univers.versions import SemverVersion - - from vulnerabilities.importer import AdvisoryData - from vulnerabilities.improver import MAX_CONFIDENCE - from vulnerabilities.improver import Improver - from vulnerabilities.improver import Inference - from vulnerabilities.models import Advisory - from vulnerabilities.severity_systems import SCORING_SYSTEMS - - - class ExampleImporter(Importer): - ... - - - class ExampleAliasImprover(Improver): - @property - def interesting_advisories(self) -> QuerySet: - return Advisory.objects.filter(created_by=ExampleImporter.qualified_name) - - def get_inferences(self, advisory_data) -> Iterable[Inference]: - for alias in advisory_data.aliases: - new_aliases = fetch_additional_aliases(alias) - aliases = new_aliases + [alias] - yield Inference(aliases=aliases, confidence=MAX_CONFIDENCE) - - - def fetch_additional_aliases(alias): - alias_map = { - "CVE-2021-23017": ["PYSEC-1337", "CERTIN-1337"], - "CVE-2021-1234": ["ANONSEC-1337", "CERTDES-1337"], - } - return alias_map.get(alias) - - -.. note:: - - | Use ``make valid`` to format your new code using black and isort automatically. - | Use ``make check`` to check for formatting errrors. - -Register the Improver -^^^^^^^^^^^^^^^^^^^^^^ - -Finally, register your improver in the improver registry at -:file:`vulnerabilites/improvers/__init__.py`. - -.. code-block:: python - :emphasize-lines: 7 - - from vulnerabilities import importers - from vulnerabilities.improvers import default - - IMPROVERS_REGISTRY = [ - default.DefaultImprover, - importers.nginx.NginxBasicImprover, - importers.example.ExampleAliasImprover, - ] - - IMPROVERS_REGISTRY = {x.qualified_name: x for x in IMPROVERS_REGISTRY} - -Congratulations! You have written your first improver. - -Run Your First Improver -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -If everything went well, you will see your improver in the list of available improvers. - -.. code-block:: console - :emphasize-lines: 6 - - $ ./manage.py improve --list - - Vulnerability data can be processed by these available improvers: - vulnerabilities.improvers.default.DefaultImprover - vulnerabilities.importers.nginx.NginxBasicImprover - vulnerabilities.importers.example.ExampleAliasImprover - -Before running the improver, make sure you have imported the data. An improver cannot improve if -there is nothing imported. - -.. code-block:: console - - $ ./manage.py import vulnerabilities.importers.example.ExampleImporter - - Importing data using vulnerabilities.importers.example.ExampleImporter - Successfully imported data using vulnerabilities.importers.example.ExampleImporter - -Now, run the improver. - -.. code-block:: console - - $ ./manage.py improve vulnerabilities.importers.example.ExampleAliasImprover - - Improving data using vulnerabilities.importers.example.ExampleAliasImprover - Successfully improved data using vulnerabilities.importers.example.ExampleAliasImprover - -See :ref:`command_line_interface` for command line usage instructions. - -Enable Debug Logging (Optional) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -For more visibility, turn on debug logs in :file:`vulnerablecode/settings.py`. - -.. code-block:: python - - DEBUG = True - LOGGING = { - 'version': 1, - 'disable_existing_loggers': False, - 'handlers': { - 'console': { - 'class': 'logging.StreamHandler', - }, - }, - 'root': { - 'handlers': ['console'], - 'level': 'DEBUG', - }, - } - -Invoke the improve command now and you will see (in a fresh database, after importing): - -.. code-block:: console - - $ ./manage.py improve vulnerabilities.importers.example.ExampleAliasImprover - - Improving data using vulnerabilities.importers.example.ExampleAliasImprover - Running improver: vulnerabilities.importers.example.ExampleAliasImprover - Improving advisory id: 1 - New alias for : PYSEC-1337 - New alias for : CVE-2021-23017 - New alias for : CERTIN-1337 - Improving advisory id: 2 - New alias for : CERTDES-1337 - New alias for : ANONSEC-1337 - New alias for : CVE-2021-1234 - Finished improving using vulnerabilities.importers.example.ExampleAliasImprover. - Successfully improved data using vulnerabilities.importers.example.ExampleAliasImprover - -.. note:: - Even though CVE-2021-23017 and CVE-2021-1234 are not supplied by this improver, the output above shows them - because we left out running the ``DefaultImprover`` in the example. The ``DefaultImprover`` - inserts minimal data found via the importers in the database (here, the above two CVEs). Run - importer, DefaultImprover and then your improver in this sequence to avoid this anomaly. diff --git a/docs/source/index.rst b/docs/source/index.rst index be51eca80..b20b1c9b5 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -29,6 +29,14 @@ In this documentation you will find information on: faq misc +.. toctree:: + :maxdepth: 2 + :caption: Tutorial + + tutorial_add_importer_pipeline + tutorial_add_improver_pipeline + + .. toctree:: :maxdepth: 2 :caption: Reference Documentation diff --git a/docs/source/tutorial_add_importer_pipeline.rst b/docs/source/tutorial_add_importer_pipeline.rst new file mode 100644 index 000000000..1374164d3 --- /dev/null +++ b/docs/source/tutorial_add_importer_pipeline.rst @@ -0,0 +1,366 @@ +.. _tutorial_add_importer_pipeline: + +Add a new pipeline to import advisories +======================================== + + +TL;DR +------- + +#. Create a new file ``{name}_importer.py`` inside **vulnerabilities/pipelines/**. +#. Create a new importer pipeline by inheriting **VulnerableCodeBaseImporterPipeline** + defined in **vulnerabilities.pipelines**. By convention the importer pipeline + name should end with **ImporterPipeline**. +#. Specify the license of upstream data being imported. +#. Implement the ``advisories_count`` and ``collect_advisories`` methods. +#. Add the newly created importer pipeline to the importers registry at + **vulnerabilities/importers/__init__.py** + + +Pipeline +-------- + +We use `aboutcode.pipeline `_ +for importing and improving data. At a very high level, a working pipeline contains classmethod +``steps`` that defines what steps to run and in what order. These steps are essentially just +functions. Pipeline provides an easy and effective way to log events inside these steps (it +automatically handles rendering and dissemination for these logs.) + +It also includes built-in progress indicator, which is essential since some of the jobs we run +in the pipeline are long-running tasks that require proper progress indicators. Pipeline provides +way to seamlessly records the progress (it automatically takes care of rendering and dissemination +of these progress). + +Additionally, the pipeline offers a consistent structure, making it easy to run these pipeline steps +with message queue like RQ and store all events related to a particular pipeline for +debugging/improvements. + +This tutorial contains all the things one should know to quickly implement an importer pipeline. +Many internal details about importer pipeline can be found inside the `vulnerabilities/pipelines/__init__.py +`_ file. + + +.. _tutorial_add_importer_pipeline_prerequisites: + +Prerequisites +-------------- + +Before writing pipeline to import advisories, it is important to familiarize yourself with +the following concepts. + +PackageURL +~~~~~~~~~~ + +VulnerableCode extensively uses Package URLs to identify a package. See the +`PackageURL specification `_ and its `Python implementation +`_ for more details. + +**Example usage:** + +.. code:: python + + from packageurl import PackageURL + purl = PackageURL(name="ffmpeg", type="deb", version="1.2.3") + + +AdvisoryData +~~~~~~~~~~~~~ + +``AdvisoryData`` is an intermediate data format: +it is expected that your importer will convert the raw scraped data into ``AdvisoryData`` objects. +All the fields in ``AdvisoryData`` dataclass are optional; it is the importer's responsibility to +ensure that it contains meaningful information about a vulnerability. + +AffectedPackage +~~~~~~~~~~~~~~~ + +``AffectedPackage`` data type is used to store a range of affected versions and a fixed version of a +given package. For all version-related data, `univers `_ library +is used. + +Univers +~~~~~~~ + +`univers `_ is a Python implementation of the `vers specification `_. +It can parse and compare all the package versions and all the ranges, +from debian, npm, pypi, ruby and more. +It processes all the version range specs and expressions. + + +Writing an Importer Pipeline +----------------------------- + + +Create file for the new importer pipeline +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +All pipelines, including the importer pipeline, are located in the +`vulnerabilities/pipelines/ +`_ directory. + +The importer pipeline is implemented by subclassing **VulnerableCodeBaseImporterPipeline** +and implementing the unimplemented methods. Since most tasks, such as inserting **AdvisoryData** +into the database and creating package-vulnerability relationships, are the same regardless of +the source of the advisory, these tasks are already taken care of in the base importer pipeline, +i.e., **VulnerableCodeBaseImporterPipeline**. You can simply focus on collecting the raw data and +parsing it to create proper **AdvisoryData** objects. + + +Specify the importer license +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The pipeline scrape data off the internet. In order to make sure the data is useable, a license +must be provided. + +Populate the ``spdx_license_expression`` with the appropriate value. The SPDX license identifiers +can be found at `ScanCode LicenseDB `_. + +.. note:: + An SPDX license identifier by itself is a valid license expression. In case you need more + complex expressions, see https://spdx.github.io/spdx-spec/v2.3/SPDX-license-expressions/ + + +Implement the ``advisories_count`` method +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``advisories_count`` method returns the total number of advisories that will be collected by +this pipeline. + +Suppose the upstream data is a single JSON file containing a list of security advisories; +in that case, you can simply return the count of security advisories in the JSON file, +and that's it. + +.. note:: + In some cases, it could be difficult to get the exact total number of advisories that would + be collected without actually processing the advisories. In such case returning the best + estimate will also work. + + **advisories_count** is used to enable a proper progress indicator and is not used beyond that. + If it is impossible (a super rare case) to compute the total advisory count beforehand, + just return ``0``. + + +Implement the ``collect_advisories`` method +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``collect_advisories`` method collects and parses the advisories from the data source and +yield an *AdvisoryData*. + +At this point, an example importer will look like this: + +.. code-block:: python + :caption: vulnerabilities/pipelines/example_importer.py + :linenos: + :emphasize-lines: 16-17, 20-21, 23-24 + + from vulnerabilities.pipelines import VulnerableCodeBaseImporterPipeline + + class ExampleImporterPipeline(VulnerableCodeBaseImporterPipeline): + """Collect advisories Example.""" + + pipeline_id = "example_importer" + + root_url = "https://example.org/path/to/advisories/" + license_url = "https://exmaple.org/license/" + spdx_license_expression = "CC-BY-4.0" + importer_name = "Example Importer" + + @classmethod + def steps(cls): + return ( + cls.collect_and_store_advisories, + cls.import_new_advisories, + ) + + def advisories_count(self) -> int: + raise NotImplementedError + + def collect_advisories(self) -> Iterable[AdvisoryData]: + raise NotImplementedError + + +This pipeline is only a valid skeleton and does not import anything at all. + +Let us implement a working pipeline that actually imports some data. + +Here we have a ``dummy_package`` which follows ``NginxVersionRange`` and ``SemverVersion`` for +version management from `univers `_. + +.. note:: + + It is possible that the versioning scheme you are targeting has not yet been + implemented in the `univers `_ library. + If this is the case, you will need to head over there and implement one. + +.. code-block:: python + :caption: vulnerabilities/pipelines/example_importer.py + :linenos: + :emphasize-lines: 34-35, 37-40 + + from datetime import datetime + from datetime import timezone + from typing import Iterable + + from packageurl import PackageURL + from univers.version_range import NginxVersionRange + from univers.versions import SemverVersion + + from vulnerabilities.importer import AdvisoryData + from vulnerabilities.importer import AffectedPackage + from vulnerabilities.importer import Reference + from vulnerabilities.importer import VulnerabilitySeverity + from vulnerabilities.pipelines import VulnerableCodeBaseImporterPipeline + from vulnerabilities.severity_systems import SCORING_SYSTEMS + + + class ExampleImporterPipeline(VulnerableCodeBaseImporterPipeline): + """Collect advisories Example.""" + + pipeline_id = "example_importer" + + root_url = "https://example.org/path/to/advisories/" + license_url = "https://example.org/license/" + spdx_license_expression = "CC-BY-4.0" + importer_name = "Example Importer" + + @classmethod + def steps(cls): + return ( + cls.collect_and_store_advisories, + cls.import_new_advisories, + ) + + def advisories_count(self) -> int: + return len(fetch_advisory_data()) + + def collect_advisories(self) -> Iterable[AdvisoryData]: + raw_data = fetch_advisory_data() + for data in raw_data: + yield parse_advisory_data(data) + + + def fetch_advisory_data(): + return [ + { + "id": "CVE-2021-23017", + "summary": "1-byte memory overwrite in resolver", + "advisory_severity": "medium", + "vulnerable": "0.6.18-1.20.0", + "fixed": "1.20.1", + "reference": "http://mailman.nginx.org/pipermail/nginx-announce/2021/000300.html", + "published_on": "14-02-2021 UTC", + }, + { + "id": "CVE-2021-1234", + "summary": "Dummy advisory", + "advisory_severity": "high", + "vulnerable": "0.6.18-1.20.0", + "fixed": "1.20.1", + "reference": "http://example.org/cve-2021-1234", + "published_on": "06-10-2021 UTC", + }, + ] + + + def parse_advisory_data(raw_data) -> AdvisoryData: + purl = PackageURL(type="example", name="dummy_package") + affected_version_range = NginxVersionRange.from_native(raw_data["vulnerable"]) + fixed_version = SemverVersion(raw_data["fixed"]) + affected_package = AffectedPackage( + package=purl, affected_version_range=affected_version_range, fixed_version=fixed_version + ) + severity = VulnerabilitySeverity( + system=SCORING_SYSTEMS["generic_textual"], value=raw_data["advisory_severity"] + ) + references = [Reference(url=raw_data["reference"], severities=[severity])] + date_published = datetime.strptime(raw_data["published_on"], "%d-%m-%Y %Z").replace( + tzinfo=timezone.utc + ) + advisory_url = f"https://example.org/advisory/{raw_data['id']}" + + return AdvisoryData( + aliases=[raw_data["id"]], + summary=raw_data["summary"], + affected_packages=[affected_package], + references=references, + url=advisory_url, + date_published=date_published, + ) + + +.. important:: + Steps should include ``collect_and_store_advisories`` and ``import_new_advisories`` + in the order shown above. They are defined in **VulnerableCodeBaseImporterPipeline**. + + It is the **collect_and_store_advisories** that is responsible for making calls to + **collect_advisories** and **advisories_count**, and hence **collect_advisories** and + **advisories_count** should never be directly added in steps. + + + +.. note:: + + | Use ``make valid`` to format your code using black and isort automatically. + | Use ``make check`` to check for formatting errors. + +Register the Importer Pipeline +------------------------------ + +Finally, register your pipeline in the importer registry at +`vulnerabilities/importers/__init__.py +`_ + +.. code-block:: python + :caption: vulnerabilities/importers/__init__.py + :linenos: + :emphasize-lines: 1, 6 + + from vulnerabilities.pipelines import example_importer + from vulnerabilities.pipelines import nginx_importer + + IMPORTERS_REGISTRY = [ + nginx_importer.NginxImporterPipeline, + example_importer.ExampleImporterPipeline, + ] + + IMPORTERS_REGISTRY = { + x.pipeline_id if issubclass(x, VulnerableCodeBaseImporterPipeline) else x.qualified_name: x + for x in IMPORTERS_REGISTRY + } + +Congratulations! You have written your first importer pipeline. + +Run Your First Importer Pipeline +-------------------------------- + +If everything went well, you will see your pipeline in the list of available importers. + +.. code-block:: console + :emphasize-lines: 5 + + $ ./manage.py import --list + + Vulnerability data can be imported from the following importers: + nginx_importer + example_importer + +Now, run the importer. + +.. code-block:: console + + $ ./manage.py import example_importer + + Importing data using example_importer + INFO 2024-10-16 10:15:10.483 Pipeline [ExampleImporterPipeline] starting + INFO 2024-10-16 10:15:10.483 Step [collect_and_store_advisories] starting + INFO 2024-10-16 10:15:10.483 Collecting 2 advisories + INFO 2024-10-16 10:15:10.498 Successfully collected 2 advisories + INFO 2024-10-16 10:15:10.498 Step [collect_and_store_advisories] completed in 0 seconds + INFO 2024-10-16 10:15:10.498 Step [import_new_advisories] starting + INFO 2024-10-16 10:15:10.499 Importing 2 new advisories + INFO 2024-10-16 10:15:10.562 Successfully imported 2 new advisories + INFO 2024-10-16 10:15:10.563 Step [import_new_advisories] completed in 0 seconds + INFO 2024-10-16 10:15:10.563 Pipeline completed in 0 seconds + + +See :ref:`command_line_interface` for command line usage instructions. \ No newline at end of file diff --git a/docs/source/tutorial_add_improver_pipeline.rst b/docs/source/tutorial_add_improver_pipeline.rst new file mode 100644 index 000000000..855d109a2 --- /dev/null +++ b/docs/source/tutorial_add_improver_pipeline.rst @@ -0,0 +1,267 @@ +.. _tutorial_add_improver_pipeline: + +Add pipeline to improve/enhance data +===================================== + +TL;DR +------- + +#. Create a new file ``{improver_name}.py`` inside **vulnerabilities/pipelines/**. +#. Create a new importer pipeline by inheriting **VulnerableCodePipeline** defined + in **vulnerabilities.pipelines**. +#. Implement ``steps`` **classmethod** to define what function to run and in which order. +#. Implement the individual function defined in ``steps`` +#. Add the newly created pipeline to the improvers registry at + **vulnerabilities/improvers/__init__.py**. + +Pipeline +-------- + +We use `aboutcode.pipeline `_ +for importing and improving data. At a very high level, a working pipeline contains classmethod +``steps`` that defines what steps to run and in what order. These steps are essentially just +functions. Pipeline provides an easy and effective way to log events inside these steps (it +automatically handles rendering and dissemination for these logs.) + +It also includes built-in progress indicator, which is essential since some of the jobs we run +in the pipeline are long-running tasks that require proper progress indicators. Pipeline provides +way to seamlessly records the progress (it automatically takes care of rendering and dissemination +of these progress). + +Additionally, the pipeline offers a consistent structure, making it easy to run these pipeline steps +with message queue like RQ and store all events related to a particular pipeline for +debugging/improvements. + +This tutorial contains all the things one should know to quickly implement an improver pipeline. + + +Prerequisites +------------- + +The new improver design lets you do all sorts of cool improvements and enhancements. +Some of those are: + +* Let's suppose you have a certain number of packages and vulnerabilities in your database, + and you want to make sure that the packages being shown in VulnerableCode do indeed exist upstream. + Oftentimes, we come across advisory data that contains made-up package versions. We can write + (well, we already have) a pipeline that iterates through all the packages in VulnerableCode and + labels them as ghost packages if they don't exist upstream. + + +- A basic security advisory only contains CVE/aliases, summary, fixed/affected version, and + severity. But now we can use the new pipeline to enhance the vulnerability info with exploits from + various sources like ExploitDB, Metasploit, etc. + + +* Likewise, we can have more pipelines to flag malicious/yanked packages. + + +So you see, the new improver pipeline is very powerful in what you can achieve, but as always, with +great power comes great responsibility. By design, the new improver are unconstrained, and you must +be absolutely sure of what you're doing and should have robust tests for these pipelines in place. + + +Writing an Improver Pipeline +----------------------------- + +**Scenario:** Suppose we come around a source that curates and stores the list of packages that don't +exist upstream and makes it available through the REST API endpoint https://example.org/api/non-existent-packages, +which gives a JSON response with a list of non-existent packages. +Let's write a pipeline that will use this source to flag these non-existent package as ghost package. + + +Create file for the new improver pipeline +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +All pipelines, including the improver pipeline, are located in the +`vulnerabilities/pipelines/ +`_ directory. + +The improver pipeline is implemented by subclassing `VulnerableCodePipeline`. + +Specify the importer license +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If the improver pipeline scrapes data off the internet, we need to track the license for +the scraped data to make sure that we can legally use it. + +Populate the ``spdx_license_expression`` with the appropriate value. The SPDX license identifiers +can be found at `ScanCode LicenseDB `_. + +.. note:: + An SPDX license identifier by itself is a valid license expression. In case you need more + complex expressions, see https://spdx.github.io/spdx-spec/v2.3/SPDX-license-expressions/ + + +Add skeleton for new pipeline +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this scenario pipeline needs to do two thing fetch raw data and use that to flag those packages. + +At this point improver will look like this: + +.. code-block:: python + :caption: vulnerabilities/pipelines/flag_ghost_package_with_example_org.py + :linenos: + :emphasize-lines: 14-15, 17-18, 20-21 + + from vulnerabilities.pipelines import VulnerableCodeBaseImporterPipeline + + class FlagGhostPackagesWithExampleOrg(VulnerableCodePipeline): + """Example improver pipeline to flag ghost packages.""" + + pipeline_id = "flag_ghost_package_with_example_org" + + license_url = "https://exmaple.org/license/" + spdx_license_expression = "CC-BY-4.0" + + @classmethod + def steps(cls): + return ( + cls.fetch_response, + cls.flag_ghost_packages, + ) + + def fetch_response(self): + raise NotImplementedError + + def flag_ghost_packages(self): + raise NotImplementedError + + +Implement the steps +~~~~~~~~~~~~~~~~~~~ + +We will evolve our high level design by implementing ``fetch_response`` and ``flag_ghost_packages`` +methods. + +.. code-block:: python + :caption: vulnerabilities/pipelines/flag_ghost_package_with_example_org.py + :linenos: + :emphasize-lines: 20-32, 34-42 + + from vulnerabilities.models import Package + from vulnerabilities.pipelines import VulnerableCodePipeline + + + class FlagGhostPackagesWithExampleOrg(VulnerableCodePipeline): + """Example improver pipeline to flag ghost packages.""" + + pipeline_id = "flag_ghost_package_with_example_org" + + license_url = "https://exmaple.org/license/" + spdx_license_expression = "CC-BY-4.0" + + @classmethod + def steps(cls): + return ( + cls.fetch_response, + cls.flag_ghost_packages, + ) + + def fetch_response(self): + # Since this is imaginary source we will mock the response + # In actual implementation you need to use request library to get data. + mock_response = { + "non-existent": [ + "pkg:npm/626@1.1.1", + "pkg:npm/bootstrap-tagsinput@0.8.0", + "pkg:npm/dojo@1.0.0", + "pkg:npm/dojo@1.1.0", + "pkg:npm/electron@1.8.0", + ] + } + self.fetched_data = mock_response + + def flag_ghost_packages(self): + non_existent_packages = self.fetched_data.get("non-existent", []) + + ghost_packages = Package.objects.filter(package_url__in=non_existent_packages) + ghost_package_count = ghost_packages.count() + + ghost_packages.update(is_ghost=True) + + self.log(f"Successfully flagged {ghost_package_count:,d} ghost Packages") + + +.. note:: + + | Use ``make valid`` to format your new code using black and isort automatically. + | Use ``make check`` to check for formatting errors. + + +Register the Improver Pipeline +------------------------------ + +Finally, register your improver in the improver registry at +`vulnerabilities/improvers/__init__.py +`_ + + +.. code-block:: python + :caption: vulnerabilities/improvers/__init__.py + :linenos: + :emphasize-lines: 2, 6 + + from vulnerabilities.pipeline import enhance_with_kev + from vulnerabilities.pipeline import flag_ghost_package_with_example_org + + IMPROVERS_REGISTRY = [ + enhance_with_kev.VulnerabilityKevPipeline, + flag_ghost_package_with_example_org.FlagGhostPackagesWithExampleOrg, + ] + + IMPROVERS_REGISTRY = { + x.pipeline_id if issubclass(x, VulnerableCodePipeline) else x.qualified_name: x + for x in IMPROVERS_REGISTRY + } + + +Congratulations! You have written your first improver pipeline. + +Run Your First Improver Pipeline +-------------------------------- + +If everything went well, you will see your improver in the list of available improvers. + +.. code-block:: console + :emphasize-lines: 5 + + $ ./manage.py improve --list + + Vulnerability data can be processed by these available improvers: + enhance_with_kev + flag_ghost_package_with_example_org + +Now, run the improver. + +.. code-block:: console + + $ ./manage.py improve flag_ghost_package_with_example_org + + Improving data using flag_ghost_package_with_example_org + INFO 2024-10-17 14:37:54.482 Pipeline [FlagGhostPackagesWithExampleOrg] starting + INFO 2024-10-17 14:37:54.482 Step [fetch_response] starting + INFO 2024-10-17 14:37:54.482 Step [fetch_response] completed in 0 seconds + INFO 2024-10-17 14:37:54.482 Step [flag_ghost_packages] starting + INFO 2024-10-17 14:37:54.488 Successfully flagged 5 ghost Packages + INFO 2024-10-17 14:37:54.488 Step [flag_ghost_packages] completed in 0 seconds + INFO 2024-10-17 14:37:54.488 Pipeline completed in 0 seconds + + +See :ref:`command_line_interface` for command line usage instructions. + +.. tip:: + + If you need to improve package vulnerability relations created using a certain pipeline, + simply use the **pipeline_id** to filter out only those items. For example, if you want + to improve only those **AffectedByPackageRelatedVulnerability** entries that were created + by npm_importer pipeline, you can do so with the following query: + + .. code-block:: python + + AffectedByPackageRelatedVulnerability.objects.filter(created_by=NpmImporterPipeline.pipeline_id) + +.. note:: + + Make sure to use properly optimized query sets, and wherever needed, use paginated query sets.