Backend: implement PopularPages KPI #33

benoit74 · 2023-11-09T07:52:35Z

Implement backend logic to create PopularPages KPI:

KPI
Indicator
Input
Logic to generate input

Name	Unique ID	Definition	Value stored in DB for each aggregation
PopularPages	2002	Number of visits per package objects (typically a web page), ignoring assets (images, technical files like CSS, JS, ...), implemented only for ZIMs in v1	List of top 50 pages (package + page name, with number of visits for each of them) + total number of visits

Nota: this is in fact already mostly ready on main branch but needs to be revisited following recent discussions and name change.

Important to discuss / adjust: how do we make the distinction between a real object and an asset? Currently we suppose that everything with an html/epub/pdf content-type is a real object (and hence tracked) and everything else is an asset (and hence ignored).

The text was updated successfully, but these errors were encountered:

benoit74 · 2023-11-09T12:38:16Z

@kelson42 @rgaudin @Popolechien (sorry discussion is a bit technical but it also has "business" sides)

Do you have any input about what should be considered as an Page in this indicator? We want to get the number of visits per page (package + page names indeed) but only "real" pages should be considered and not all "dummy" assets needed to display the page (images, CSS, JS, ...).

The situation is that we base our metrics system on Caddy (web reverse proxy) logs. At this level, we have one log line per web request done to the offspot. To load one single page in the client (e.g. "Hove_War_Memorial" in "Wikipedia"), many logs line are generated by Caddy (one per asset).

My first instinct was to only consider URLs whose content-type contains either "html" or "epub" or "pdf", but in fact we also have videos and audios and maybe other kind of stuff.

And one single object of interest might even need multiple html request to load the whole content (e.g. with iFrames, ...).

Maybe we should consider only "html" stuff (instead of "html" + "epub" + ...) since otherwise we get duplicated stuff on a lot of occasions (e.g. once for the "html" page holding the video and once for the video itself). But this is not true when the ZIM contains an application (e.g. freecodecamp, sooner or later Kolibri ...) which has only one page and loads assets dynamically (we will have only one "html" page per client and it will be a generic name).

I don't know how we could detect that stuff properly (or at least for 80% of the cases).

Probably what we want is to have only (and all) ZIM items marked as is_front, and probably get their Title (as stored in the ZIM and displayed in the suggestions for instance) instead of the whole URL (which does not matter in fact). But AFAIK kiwix-serve does not present this information in the web response (I'm not even 100% sure libkiwix exposes the information). Would it make any sense to make such a modification?

benoit74 · 2023-11-24T14:37:42Z

Given the limited progress in our reflections here, I dig a bit and opened kiwix/libkiwix#1026

benoit74 · 2023-11-27T09:54:20Z

The more we dive into this issue and into kiwix/libkiwix#1026, the more doubts I have about the pertinence this KPI.

If we want to measure this KPI correctly, it has many impacts in terms of software (see kiwix/libkiwix#1026 to just cover the ZIM case, not all other apps).

This KPI also induces a large impact in terms of storage on the offspot (I guess it might account for at least 80% of the DB size) + significant impact on live performance of the metrics subsystem (many logs will need to be analyzed in fine details to confirm it is a page or an asset).

Finally, I heard in some of our discussions that this might not cover the real need which is more around specific pages / assets (APKs, ...) for which we need details but might never make it to the top 50.

So I ask the question pretty boldly: are we really sure we want to include this KPI in metrics v1?

Popolechien · 2023-11-27T13:19:06Z

@benoit74 Just to be clear, the question is about whether we monitor individual article metrics. This does not impact the measurement at the zim level, correct?

benoit74 · 2023-11-27T13:43:05Z

Yes, measurement at ZIM level is done (with two KPIs so far, usage in minutes and number of visits)

Popolechien · 2023-11-27T13:57:12Z

Ok so let's park this one for v2 or until we think this through a little better.

stale · 2024-01-31T06:58:34Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

benoit74 self-assigned this Nov 9, 2023

benoit74 added the enhancement New feature or request label Nov 9, 2023

benoit74 added this to the v1 milestone Nov 9, 2023

This was referenced Nov 9, 2023

Parse timestamp in reverse proxy logs + make log conversion logic more generic / modular #31

Merged

Decide which KPIs will be included in v1 #8

Closed

This was referenced Nov 10, 2023

Add usage related stuff + fix popularity related stuff #35

Merged

Test on real hardware and measure resources consumptions #7

Closed

benoit74 mentioned this issue Nov 27, 2023

kiwix-serve indicates that the served item is marked "is_front" kiwix/libkiwix#1026

Open

benoit74 modified the milestones: v1, v2 Dec 1, 2023

benoit74 mentioned this issue Dec 1, 2023

Backend: remove PopularPages KPI and associated stuff for v1 #44

Closed

stale bot added the stale label Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend: implement PopularPages KPI #33

Backend: implement PopularPages KPI #33

benoit74 commented Nov 9, 2023

benoit74 commented Nov 9, 2023

benoit74 commented Nov 24, 2023

benoit74 commented Nov 27, 2023

Popolechien commented Nov 27, 2023

benoit74 commented Nov 27, 2023

Popolechien commented Nov 27, 2023

stale bot commented Jan 31, 2024

Backend: implement PopularPages KPI #33

Backend: implement PopularPages KPI #33

Comments

benoit74 commented Nov 9, 2023

benoit74 commented Nov 9, 2023

benoit74 commented Nov 24, 2023

benoit74 commented Nov 27, 2023

Popolechien commented Nov 27, 2023

benoit74 commented Nov 27, 2023

Popolechien commented Nov 27, 2023

stale bot commented Jan 31, 2024