Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend: implement PopularPages KPI #33

Open
benoit74 opened this issue Nov 9, 2023 · 7 comments
Open

Backend: implement PopularPages KPI #33

benoit74 opened this issue Nov 9, 2023 · 7 comments
Assignees
Labels
enhancement New feature or request stale
Milestone

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Nov 9, 2023

Implement backend logic to create PopularPages KPI:

  • KPI
  • Indicator
  • Input
  • Logic to generate input
Name Unique ID  Definition Value stored in DB for each aggregation
PopularPages 2002 Number of visits per package objects (typically a web page), ignoring assets (images, technical files like CSS, JS, ...), implemented only for ZIMs in v1 List of top 50 pages (package + page name, with number of visits for each of them) + total number of visits

Nota: this is in fact already mostly ready on main branch but needs to be revisited following recent discussions and name change.

Important to discuss / adjust: how do we make the distinction between a real object and an asset? Currently we suppose that everything with an html/epub/pdf content-type is a real object (and hence tracked) and everything else is an asset (and hence ignored).

@benoit74
Copy link
Collaborator Author

benoit74 commented Nov 9, 2023

@kelson42 @rgaudin @Popolechien (sorry discussion is a bit technical but it also has "business" sides)

Do you have any input about what should be considered as an Page in this indicator? We want to get the number of visits per page (package + page names indeed) but only "real" pages should be considered and not all "dummy" assets needed to display the page (images, CSS, JS, ...).

The situation is that we base our metrics system on Caddy (web reverse proxy) logs. At this level, we have one log line per web request done to the offspot. To load one single page in the client (e.g. "Hove_War_Memorial" in "Wikipedia"), many logs line are generated by Caddy (one per asset).

My first instinct was to only consider URLs whose content-type contains either "html" or "epub" or "pdf", but in fact we also have videos and audios and maybe other kind of stuff.

And one single object of interest might even need multiple html request to load the whole content (e.g. with iFrames, ...).

Maybe we should consider only "html" stuff (instead of "html" + "epub" + ...) since otherwise we get duplicated stuff on a lot of occasions (e.g. once for the "html" page holding the video and once for the video itself). But this is not true when the ZIM contains an application (e.g. freecodecamp, sooner or later Kolibri ...) which has only one page and loads assets dynamically (we will have only one "html" page per client and it will be a generic name).

I don't know how we could detect that stuff properly (or at least for 80% of the cases).

Probably what we want is to have only (and all) ZIM items marked as is_front, and probably get their Title (as stored in the ZIM and displayed in the suggestions for instance) instead of the whole URL (which does not matter in fact). But AFAIK kiwix-serve does not present this information in the web response (I'm not even 100% sure libkiwix exposes the information). Would it make any sense to make such a modification?

@benoit74
Copy link
Collaborator Author

Given the limited progress in our reflections here, I dig a bit and opened kiwix/libkiwix#1026

@benoit74
Copy link
Collaborator Author

The more we dive into this issue and into kiwix/libkiwix#1026, the more doubts I have about the pertinence this KPI.

If we want to measure this KPI correctly, it has many impacts in terms of software (see kiwix/libkiwix#1026 to just cover the ZIM case, not all other apps).

This KPI also induces a large impact in terms of storage on the offspot (I guess it might account for at least 80% of the DB size) + significant impact on live performance of the metrics subsystem (many logs will need to be analyzed in fine details to confirm it is a page or an asset).

Finally, I heard in some of our discussions that this might not cover the real need which is more around specific pages / assets (APKs, ...) for which we need details but might never make it to the top 50.

So I ask the question pretty boldly: are we really sure we want to include this KPI in metrics v1?

@Popolechien
Copy link

@benoit74 Just to be clear, the question is about whether we monitor individual article metrics. This does not impact the measurement at the zim level, correct?

@benoit74
Copy link
Collaborator Author

Yes, measurement at ZIM level is done (with two KPIs so far, usage in minutes and number of visits)

@Popolechien
Copy link

Ok so let's park this one for v2 or until we think this through a little better.

Copy link

stale bot commented Jan 31, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

2 participants