Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add statistical analytics to the records module #216

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 37 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
..
Copyright (C) 2021 Graz University of Technology.
Copyright (C) 2021-2024 Graz University of Technology.

Invenio-Records-Marc21 is free software; you can redistribute it and/or modify it
under the terms of the MIT License; see LICENSE file for more details.
Expand Down Expand Up @@ -35,20 +35,51 @@ Install
Choose a version of elasticsearch and a DB, then run:

.. code-block:: console

pipenv run pip install -e .[all]
pipenv run pip install invenio-search[elasticsearch7]
pipenv run pip install invenio-db[postgresql,versioning]


Service
=========
Setting Up Statistics
=====================

To enable and configure the statistics feature using MARC21 records in Invenio, you need to update your `invenio.cfg` file with specific configurations that integrate MARC21 statistics with Invenio's standard statistics modules.

### Configuration Steps

1. **Import Required Configurations:**

Before updating the configuration values, ensure that you import the necessary settings from both the `invenio_records_marc21` module and the `invenio_app_rdm` module. Add the following lines to your `invenio.cfg`:

.. code-block:: console

from invenio_records_marc21.config import MARC21_STATS_CELERY_TASKS, MARC21_STATS_EVENTS, MARC21_STATS_AGGREGATIONS, MARC21_STATS_QUERIES
from invenio_app_rdm.config import CELERY_BEAT_SCHEDULE, STATS_EVENTS, STATS_AGGREGATIONS, STATS_QUERIES

Update Celery Beat Schedule:

Integrate MARC21-specific scheduled tasks with Invenio's scheduler:

.. code-block:: console

CELERY_BEAT_SCHEDULE.update(MARC21_STATS_CELERY_TASKS)


Update Events, Aggregations, and Queries:

Merge MARC21 statistics configurations with the global statistics settings:

.. code-block:: console

STATS_EVENTS.update(MARC21_STATS_EVENTS)
STATS_AGGREGATIONS.update(MARC21_STATS_AGGREGATIONS)
STATS_QUERIES.update(MARC21_STATS_QUERIES)

** Create Marc21 Record**

Tests
=========

.. code-block:: console

pipenv run ./run-tests.sh
pipenv run ./run-tests.sh
221 changes: 219 additions & 2 deletions invenio_records_marc21/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
#
# This file is part of Invenio.
#
# Copyright (C) 2021-2023 Graz University of Technology.
# Copyright (C) 2021-2024 Graz University of Technology.
#
# Invenio-Records-Marc21 is free software; you can redistribute it and/or
# modify it under the terms of the MIT License; see LICENSE file for more
Expand All @@ -13,15 +13,20 @@
from __future__ import absolute_import, print_function

import idutils
from celery.schedules import crontab
from celery.schedules import crontab, timedelta
from flask_principal import RoleNeed
from invenio_i18n import lazy_gettext as _
from invenio_rdm_records.services import facets as rdm_facets
from invenio_rdm_records.services.pids import providers
from invenio_stats.aggregations import StatAggregator
from invenio_stats.contrib.event_builders import build_file_unique_id
from invenio_stats.processors import EventsIndexer, anonymize_user, flag_robots
from invenio_stats.queries import TermsQuery

from .resources.serializers.datacite import Marc21DataCite43JSONSerializer
from .services import facets
from .services.pids import Marc21DataCitePIDProvider
from .utils import build_record_unique_id

MARC21_FACETS = {
"access_status": {
Expand All @@ -42,6 +47,12 @@
"field": "files.types",
},
},
"mostviewed": dict(
title=_("Most viewed"), fields=["-stats.all_versions.unique_views"]
),
"mostdownloaded": dict(
title=_("Most downloaded"), fields=["-stats.all_versions.unique_downloads"]
),
}

MARC21_SORT_OPTIONS = {
Expand Down Expand Up @@ -69,6 +80,12 @@
title=_("Least recently updated"),
fields=["updated"],
),
"mostviewed": dict(
title=_("Most viewed"), fields=["-stats.all_versions.unique_views"]
),
"mostdownloaded": dict(
title=_("Most downloaded"), fields=["-stats.all_versions.unique_downloads"]
),
}

MARC21_SEARCH_DRAFTS = {
Expand All @@ -89,6 +106,8 @@
"newest",
"oldest",
"version",
"mostviewed",
"mostdownloaded",
],
}
"""Record search configuration."""
Expand Down Expand Up @@ -262,3 +281,201 @@ def make_doi(prefix, record):

MARC21_RECORD_CURATOR_NEEDS = [RoleNeed("Marc21Curator")]
"""This Role is to modify records only, no creation, no deletion possible."""


# Statistics configuration

MARC21_STATS_CELERY_TASKS = {
# indexing of statistics events & aggregations
"marc21-stats-process-events": {
"task": "invenio_stats.tasks.process_events",
"args": [("marc21-record-view", "marc21-file-download")],
"schedule": crontab(
minute="20,50",
), # Every hour at minute 20 and 50
},
"marc21-stats-aggregate-events": {
"task": "invenio_stats.tasks.aggregate_events",
"args": [
(
"marc21-record-view-agg",
"marc21-file-download-agg",
)
],
"schedule": crontab(minute="5"), # Every hour at minute 5
},
"marc21-reindex-stats": {
"task": "invenio_records_marc21.services.tasks.marc21_reindex_stats",
"args": [
(
"stats-marc21-record-view",
"stats-marc21-file-download",
)
],
"schedule": crontab(minute="10"),
},
}

# Invenio-Stats
# =============
# See https://invenio-stats.readthedocs.io/en/latest/configuration.html

MARC21_STATS_EVENTS = {
"marc21-file-download": {
"templates": "invenio_records_marc21.records.statistics.templates.events.marc21_file_download",
"event_builders": [
"invenio_rdm_records.resources.stats.file_download_event_builder",
"invenio_rdm_records.resources.stats.check_if_via_api",
],
"cls": EventsIndexer,
"params": {
"preprocessors": [flag_robots, anonymize_user, build_file_unique_id]
},
},
"marc21-record-view": {
"templates": "invenio_records_marc21.records.statistics.templates.events.marc21_record_view",
"event_builders": [
"invenio_rdm_records.resources.stats.record_view_event_builder",
"invenio_rdm_records.resources.stats.check_if_via_api",
"invenio_rdm_records.resources.stats.drop_if_via_api",
],
"cls": EventsIndexer,
"params": {
"preprocessors": [flag_robots, anonymize_user, build_record_unique_id],
},
},
}

MARC21_STATS_AGGREGATIONS = {
"marc21-file-download-agg": {
"templates": "invenio_records_marc21.records.statistics.templates.aggregations.aggr_marc21_file_download",
"cls": StatAggregator,
"params": {
"event": "marc21-file-download",
"field": "unique_id",
"interval": "day",
"index_interval": "month",
"copy_fields": {
"file_id": "file_id",
"file_key": "file_key",
"bucket_id": "bucket_id",
"recid": "recid",
"parent_recid": "parent_recid",
},
"metric_fields": {
"unique_count": (
"cardinality",
"unique_session_id",
{"precision_threshold": 1000},
),
"volume": ("sum", "size", {}),
},
},
},
"marc21-record-view-agg": {
"templates": "invenio_records_marc21.records.statistics.templates.aggregations.aggr_marc21_record_view",
"cls": StatAggregator,
"params": {
"event": "marc21-record-view",
"field": "unique_id",
"interval": "day",
"index_interval": "month",
"copy_fields": {
"recid": "recid",
"parent_recid": "parent_recid",
"via_api": "via_api",
},
"metric_fields": {
"unique_count": (
"cardinality",
"unique_session_id",
{"precision_threshold": 1000},
),
},
"query_modifiers": [lambda query, **_: query.filter("term", via_api=False)],
},
},
}

MARC21_STATS_QUERIES = {
"marc21-record-view": {
"cls": TermsQuery,
"permission_factory": None,
"params": {
"index": "stats-marc21-record-view",
"doc_type": "marc21-record-view-day-aggregation",
"copy_fields": {
"recid": "recid",
"parent_recid": "parent_recid",
},
"query_modifiers": [],
"required_filters": {
"recid": "recid",
},
"metric_fields": {
"views": ("sum", "count", {}),
"unique_views": ("sum", "unique_count", {}),
},
},
},
"marc21-record-view-all-versions": {
"cls": TermsQuery,
"permission_factory": None,
"params": {
"index": "stats-marc21-record-view",
"doc_type": "marc21-record-view-day-aggregation",
"copy_fields": {
"parent_recid": "parent_recid",
},
"query_modifiers": [],
"required_filters": {
"parent_recid": "parent_recid",
},
"metric_fields": {
"views": ("sum", "count", {}),
"unique_views": ("sum", "unique_count", {}),
},
},
},
"marc21-record-download": {
"cls": TermsQuery,
"permission_factory": None,
"params": {
"index": "stats-marc21-file-download",
"doc_type": "marc21-file-download-day-aggregation",
"copy_fields": {
"recid": "recid",
"parent_recid": "parent_recid",
},
"query_modifiers": [],
"required_filters": {
"recid": "recid",
},
"metric_fields": {
"downloads": ("sum", "count", {}),
"unique_downloads": ("sum", "unique_count", {}),
"data_volume": ("sum", "volume", {}),
},
},
},
"marc21-record-download-all-versions": {
"cls": TermsQuery,
"permission_factory": None,
"params": {
"index": "stats-marc21-file-download",
"doc_type": "marc21-file-download-day-aggregation",
"copy_fields": {
"parent_recid": "parent_recid",
},
"query_modifiers": [],
"required_filters": {
"parent_recid": "parent_recid",
},
"metric_fields": {
"downloads": ("sum", "count", {}),
"unique_downloads": ("sum", "unique_count", {}),
"data_volume": ("sum", "volume", {}),
},
},
},
}
Loading