Chatbots are very popular right now. Most openly accessible information is stored in some kind of a Mediawiki. Creating a RAG Chatbot is becoming a very powerful alternative to traditional data gathering. This project is designed to create a basic format for creating your own chatbot to run locally on linux.
Mediawikis hosted by Fandom usually allow you to download an XML dump of the entire wiki as it currently exists. This project primarily leverages Langchain with a few other open source projects to combine many of the readily available quickstart guides into a complete vertical application based on mediawiki data.
graph TD;
a[/xml dump a/] --MWDumpLoader--> emb
b[/xml dump b/] --MWDumpLoader--> emb
emb{Embedding} --> db
db[(Chroma)] --Document Retriever--> lc
hf(Huggingface) --Sentence Transformer --> emb
hf --LLM--> modelfile
modelfile[/Modelfile/] --> Ollama
Ollama(((Ollama))) <-.ChatOllama.-> lc
lc{Langchain} <-.LLMChain.-> cl(((Chainlit)))
click db href "https://github.com/chroma-core/chroma"
click hf href "https://huggingface.co/"
click cl href "https://github.com/Chainlit/chainlit"
click lc href "https://github.com/langchain-ai/langchain"
click Ollama href "https://github.com/jmorganca/ollama"
multi-mediawiki-rag # $HOME/app
├── .chainlit
│ ├── .langchain.db # Server Cache
│ └── config.toml # Server Config
├── app.py
├── chainlit.md
├── config.yaml
├── data # VectorDB
│ ├── 47e4e036-****-****-****-************
│ │ └── *
│ └── chroma.sqlite3
├── embed.py
├── entrypoint.sh
└── requirements.txt
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
These steps assume you are using a modern Linux OS like Ubuntu 22.04 with Python 3.10+.
apt-get install -y curl git python3-venv
git clone https://github.com/tylertitsworth/multi-mediawiki-rag.git
curl https://ollama.ai/install.sh | sh
python -m .venv venv
source .venv/bin/activate
pip install -U pip setuptools wheel
pip install -r requirements.txt
- Run the above setup steps
- Download a mediawiki's XML dump by browsing to
/wiki/Special:Statistics
or using a tool like wikiteam3- If Downloading, download only the current pages, not the entire history
- If using
wikiteam3
, scrape only namespace 0 - Provide in the following format:
sources/<wikiname>_pages_current.xml
- Edit
config.yaml
with the location of your XML mediawiki data you downloaded in step 1 and other configuration information
Caution
Installing Ollama will create a new user and a service on your system. Follow the manual installation steps to avoid this step and instead launch the ollama API using ollama serve
.
After installing Ollama we can use a Modelfile to download and tune an LLM to be more precise for Document Retrieval QA.
ollama create volo -f ./Modelfile
Tip
Choose a model from the Ollama model library and download with ollama pull <modelname>:<version>
, then edit the model
field in config.yaml
with the same information.
- Download a model of choice from Huggingface with
git clone https://huggingface.co/<org>/<modelname> model/<modelname>
. - If your model of choice is not in
GGUF
format, convert it withdocker run --rm -v $PWD/model/<modelname>:/model ollama/quantize -q q4_0 /model
. - Modify the Modelfile's
FROM
line to contain the path to theq4_0.bin
file in the modelname directory.
Your XML data needs to be loaded and transformed into embeddings to create a Chroma VectorDB.
python embed.py
2023-12-16 09:50:53 - Loaded .env file
2023-12-16 09:50:55 - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
2023-12-16 09:51:18 - Use pytorch device: cpu
2023-12-16 09:56:09 - Anonymized telemetry enabled. See
https://docs.trychroma.com/telemetry for more information.
Batches: 100%|████████████████████████████████████████| 1303/1303 [1:23:14<00:00, 3.83s/it]
...
Batches: 100%|████████████████████████████████████████| 1172/1172 [1:04:08<00:00, 3.28s/it]
023-12-16 19:47:01 - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
2023-12-16 19:47:33 - Use pytorch device: cpu
Batches: 100%|████████████████████████████████████████████████| 1/1 [00:00<00:00, 40.41it/s]
A Tako was an intelligent race of octopuses found in the Kara-Tur setting. They were known for
their territorial nature and combat skills, as well as having incredible camouflaging abilities
that allowed them to blend into various environments. Takos lived in small tribes with a
matriarchal society led by one or two female rulers. Their diet consisted mainly of crabs,
lobsters, oysters, and shellfish, while their ink was highly sought after for use in calligraphy
within Kara-Tur.
Choose a new File type Document Loader or App Document Loader, and add them using your own script. Check out the provided Example.
chainlit run app.py -h
Access the Chatbot GUI at http://localhost:8000
.
export DISCORD_BOT_TOKEN=...
chainlit run app.py -h
Tip
Develop locally with ngrok.
This chatbot is hosted on Huggingface Spaces for free, which means this chatbot is very slow due to the minimal hardware resources allocated to it. Despite this, the provided Dockerfile provides a generic method for hosting this solution as one unified container, however this method is not ideal and can lead to many issues if used for professional production systems.
Cypress tests modern web applications with visual debugging. It is used to test the Chainlit UI functionality.
npm install
# Run Test Suite
bash cypress/test.sh
Note
Cypress requires node >= 16
.
Pytest is a mature full-featured Python testing tool that helps you write better programs.
pip install pytest
# Test Embedding Functions
pytest test/test_embed.py -W ignore::DeprecationWarning
# Test e2e with Ollama Backend
pytest test -W ignore::DeprecationWarning