Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search feature for meeting documents #491

Open
alfredgrip opened this issue Sep 19, 2024 · 3 comments
Open

Search feature for meeting documents #491

alfredgrip opened this issue Sep 19, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@alfredgrip
Copy link
Contributor

Would be fun and useful if there was a fuzzy search feature for meeting documents. Sometimes you might want to find a specific motion but can't remember which meeting it was brought up on. I think since basically all our documents are LaTeX PDFs, there probably exist some tool that allows for indexing and searching amongst them

@alfredgrip alfredgrip added the enhancement New feature or request label Sep 19, 2024
@alfredgrip
Copy link
Contributor Author

Idea so far: for all PDFs, generate text files with their content using Poppler which provides pdftotext. Connect PDFs with the text files and provide some search functionality for the text files, but return PDF results.

https://poppler.freedesktop.org/
https://www.npmjs.com/package/node-poppler

@alfredgrip
Copy link
Contributor Author

@01ste02 I vaguely remember you had a script that checked how many times your name was mentioned in guild documents, how did that work?

@01ste02
Copy link

01ste02 commented Sep 21, 2024

I found an old message on discord containing this script:

#!/bin/bash
PERSON="Axel Svensson"
LATEST_MEETING=26

# Usage: ./filename.sh "Namn" SiffraFörSistaStyrelsemöte

if [ $# -eq 1 ]
    then
        if [[ "$1" =~ ^-?[0-9]+$ ]]; then
        LATEST_MEETING=$1
    else    
            PERSON=$1
    fi
elif [ $# -eq 2 ]
        then
        if [[ "$1" =~ ^-?[0-9]+$ ]]; then
                LATEST_MEETING=$1
                PERSON=$2
        else
                PERSON=$1
                LATEST_MEETING=$2
        fi
fi


for fix in ".pdf" "_2.pdf"; do 
    for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do 
        cd /tmp
        #echo "https://minio.api.dsek.se/documents/public/2023/S${i}/protokoll_S${i}_2023${fix}"
        wget -q https://minio.api.dsek.se/documents/public/2023/S${i}/protokoll_S${i}_2023${fix} 2> /dev/null
        pdftotext -l 1 /tmp/protokoll_S${i}_2023${fix} 2> /dev/null
        rm /tmp/protokoll_S${i}_2023${fix} 2> /dev/null
    done
done
echo "${PERSON} har närvarat på ca $(grep -lrnw "${PERSON}" /tmp/protokoll_S* | wc -l) styrelsemöten" 2> /dev/null
echo "Dessa möten har ${PERSON} förmodligen närvarat på:"
for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do 
    if [[ $(grep -lrnw "${PERSON}" /tmp/protokoll_S${i}* 2> /dev/null | wc -l 2> /dev/null) == 1 ]]; then
        echo "S${i}"
    fi
done
echo ""
echo "Vilket betyder att ${PERSON} förmodligen inte närvarat på:"
for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do
        if [[ $(grep -lrnw "${PERSON}" /tmp/protokoll_S${i}* 2> /dev/null | wc -l 2> /dev/null) == 0 ]]; then
                echo "S${i}"
        fi
done
rm -f /tmp/protokoll_S* 2> /dev/null

In essence, it downloads all protocols from that year and dumps them into text using pdftotext. The plaintext is then searched through. Beware that this script was build during a board meeting where I was extra bored and wanted to procrastinate, so the quality is not too great.. :)

If you want to build a search for the website, an easy solution is to dump all pdfs we upload to text and just search through those texts and display files containing the search string. This won't really get you "this page, this line, this column" unless you do some magic stuff when you dump to plaintext. If you prepend the page and row from the pdf to each line of plaintext, you could probably display that along with the file in the results.

Since the pdfs are user-uploaded, we just need to make sure that the file is actually a pdf so that we do not essentially execute arbitrary code that is uploaded as a .pdf...

So basically I had built what you suggested in:

Idea so far: for all PDFs, generate text files with their content using Poppler which provides pdftotext. Connect PDFs with the text files and provide some search functionality for the text files, but return PDF results.

https://poppler.freedesktop.org/ https://www.npmjs.com/package/node-poppler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants