Search feature for meeting documents #491

alfredgrip · 2024-09-19T12:06:21Z

Would be fun and useful if there was a fuzzy search feature for meeting documents. Sometimes you might want to find a specific motion but can't remember which meeting it was brought up on. I think since basically all our documents are LaTeX PDFs, there probably exist some tool that allows for indexing and searching amongst them

alfredgrip · 2024-09-20T08:50:59Z

Idea so far: for all PDFs, generate text files with their content using Poppler which provides pdftotext. Connect PDFs with the text files and provide some search functionality for the text files, but return PDF results.

https://poppler.freedesktop.org/
https://www.npmjs.com/package/node-poppler

alfredgrip · 2024-09-20T08:54:50Z

@01ste02 I vaguely remember you had a script that checked how many times your name was mentioned in guild documents, how did that work?

01ste02 · 2024-09-21T06:18:41Z

I found an old message on discord containing this script:

#!/bin/bash
PERSON="Axel Svensson"
LATEST_MEETING=26

# Usage: ./filename.sh "Namn" SiffraFörSistaStyrelsemöte

if [ $# -eq 1 ]
    then
        if [[ "$1" =~ ^-?[0-9]+$ ]]; then
        LATEST_MEETING=$1
    else    
            PERSON=$1
    fi
elif [ $# -eq 2 ]
        then
        if [[ "$1" =~ ^-?[0-9]+$ ]]; then
                LATEST_MEETING=$1
                PERSON=$2
        else
                PERSON=$1
                LATEST_MEETING=$2
        fi
fi


for fix in ".pdf" "_2.pdf"; do 
    for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do 
        cd /tmp
        #echo "https://minio.api.dsek.se/documents/public/2023/S${i}/protokoll_S${i}_2023${fix}"
        wget -q https://minio.api.dsek.se/documents/public/2023/S${i}/protokoll_S${i}_2023${fix} 2> /dev/null
        pdftotext -l 1 /tmp/protokoll_S${i}_2023${fix} 2> /dev/null
        rm /tmp/protokoll_S${i}_2023${fix} 2> /dev/null
    done
done
echo "${PERSON} har närvarat på ca $(grep -lrnw "${PERSON}" /tmp/protokoll_S* | wc -l) styrelsemöten" 2> /dev/null
echo "Dessa möten har ${PERSON} förmodligen närvarat på:"
for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do 
    if [[ $(grep -lrnw "${PERSON}" /tmp/protokoll_S${i}* 2> /dev/null | wc -l 2> /dev/null) == 1 ]]; then
        echo "S${i}"
    fi
done
echo ""
echo "Vilket betyder att ${PERSON} förmodligen inte närvarat på:"
for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do
        if [[ $(grep -lrnw "${PERSON}" /tmp/protokoll_S${i}* 2> /dev/null | wc -l 2> /dev/null) == 0 ]]; then
                echo "S${i}"
        fi
done
rm -f /tmp/protokoll_S* 2> /dev/null

In essence, it downloads all protocols from that year and dumps them into text using pdftotext. The plaintext is then searched through. Beware that this script was build during a board meeting where I was extra bored and wanted to procrastinate, so the quality is not too great.. :)

If you want to build a search for the website, an easy solution is to dump all pdfs we upload to text and just search through those texts and display files containing the search string. This won't really get you "this page, this line, this column" unless you do some magic stuff when you dump to plaintext. If you prepend the page and row from the pdf to each line of plaintext, you could probably display that along with the file in the results.

Since the pdfs are user-uploaded, we just need to make sure that the file is actually a pdf so that we do not essentially execute arbitrary code that is uploaded as a .pdf...

So basically I had built what you suggested in:

Idea so far: for all PDFs, generate text files with their content using Poppler which provides pdftotext. Connect PDFs with the text files and provide some search functionality for the text files, but return PDF results.

https://poppler.freedesktop.org/ https://www.npmjs.com/package/node-poppler

alfredgrip added the enhancement New feature or request label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search feature for meeting documents #491

Search feature for meeting documents #491

alfredgrip commented Sep 19, 2024

alfredgrip commented Sep 20, 2024

alfredgrip commented Sep 20, 2024

01ste02 commented Sep 21, 2024

Search feature for meeting documents #491

Search feature for meeting documents #491

Comments

alfredgrip commented Sep 19, 2024

alfredgrip commented Sep 20, 2024

alfredgrip commented Sep 20, 2024

01ste02 commented Sep 21, 2024