Skip to content

Commit

Permalink
Merge pull request #19 from marklogic/feature/docs-fixes
Browse files Browse the repository at this point in the history
Various doc fixes
  • Loading branch information
rjrudin authored Sep 24, 2024
2 parents fcadf30 + 6a5142b commit 5a729a4
Show file tree
Hide file tree
Showing 6 changed files with 30 additions and 29 deletions.
8 changes: 4 additions & 4 deletions docs/embedding.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ title: Embedding Examples
nav_order: 5
---

The vector queries shown in the [langchain](../rag-langchain-python/README.md),
[langchain4j](../rag-langchain-java), and [langchain.js](../rag-langchain-js/README.md) RAG examples
The vector queries shown in the [LangChain](rag-examples/rag-python.md),
[langchain4j](rag-examples/rag-java.md), and [LangChain.js](rag-examples/rag-javascript.md) RAG examples
depend on embeddings - vector representations of text - being added to documents in MarkLogic. Vector queries can
then be implemented using [the new vector functions](https://docs.marklogic.com/12.0/js/vec) in MarkLogic 12.
This project demonstrates the use of a
Expand All @@ -21,9 +21,9 @@ documents in MarkLogic.

## Setup

This example depends both on the [main setup for all examples](../setup/README.md) and also on having run the
This example depends both on the [main setup for all examples](setup.md) and also on having run the
"Split to multiple documents" example program in the
[document splitting examples](../splitting-langchain-java/README.md). That example program used langchain4j to split
[document splitting examples](splitting.md). That example program used langchain4j to split
the text in Enron email documents and write each chunk of text to a separate document. This example will then use
langchain4j to generate an embedding for the chunk of text and add it to each chunk document.

Expand Down
12 changes: 6 additions & 6 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,27 +15,27 @@ execute these examples as-is, you will need an Azure OpenAI account and API key.

## Setup

If you would like to try out the example programs, please [follow these instructions](setup/README.md).
If you would like to try out the example programs, please [follow these instructions](setup.md).

## RAG Examples

MarkLogic excels at supporting RAG, or ["Retrieval-Augmented Generation"](https://python.langchain.com/docs/tutorials/rag/),
via its schema-agnostic nature as well as it's powerful and flexible indexing. This repository contains the following
examples of RAG with MarkLogic:

- The [rag-langchain-python](rag-langchain-python/README.md) project demonstrates RAG with Python, langchain, and MarkLogic.
- The [rag-langchain-java](rag-langchain-java/README.md) project demonstrates RAG with Java, langchain4j, and MarkLogic.
- The [rag-langchain-js](rag-langchain-js/README.md) project demonstrates RAG with JavaScript, langchain.js, and MarkLogic.
- The [LangChain](rag-examples/rag-python.md) project demonstrates RAG with Python, LangChain, and MarkLogic.
- The [langchain4j](rag-examples/rag-java.md) project demonstrates RAG with Java, langchain4j, and MarkLogic.
- The [LangChain.js](rag-examples/rag-javascript.md) project demonstrates RAG with JavaScript, LangChain.js, and MarkLogic.

## Splitting / Chunking Examples

A RAG approach typically benefits from sending multiple smaller segments or "chunks" of text to an LLM. Please
see [this guide on splitting documents](splitting-langchain-java/README.md) for more information on how to split
see [this guide on splitting documents](splitting.md) for more information on how to split
your documents and why you may wish to do so.

## Embedding examples

To utilize the vector queries shown in the RAG Examples listed above, embeddings - vector representations of text -
should be added to your documents in MarkLogic.
See [this guide on adding embeddings](embedding-langchain-java/README.md) for more information.
See [this guide on adding embeddings](embedding.md) for more information.

12 changes: 6 additions & 6 deletions docs/rag-examples/rag-java.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ A key feature of MarkLogic is its ability to index all text in a document during
with MarkLogic is to select documents based on the words in a user's question.

To demonstrate this, you can run the Gradle `askWordQuery` task with any question. This example program uses a custom
langchain retriever that selects documents in the `ai-examples-content` MarkLogic database containing one or more words
langchain4j retriever that selects documents in the `ai-examples-content` MarkLogic database containing one or more words
in the given question. It then includes the top 10 most relevant documents in the request that it sends to Azure OpenAI.
For example:

Expand All @@ -46,8 +46,8 @@ of the configured deployment model):
You can alter the value of the `-Pquestion=` parameter to be any question you wish.

Note as well that if you have tried the [Python langchain examples](../rag-langchain-python/README.md), you will notice
some differences in the results. These differences are primarily due to the different prompts used by langchain and
Note as well that if you have tried the [Python LangChain examples](rag-python.md), you will notice
some differences in the results. These differences are primarily due to the different prompts used by LangChain and
langchain4j. See [the langchain4j documentation](https://docs.langchain4j.dev/intro) for more information on prompt
templates when using langchain4j.

Expand Down Expand Up @@ -99,7 +99,7 @@ the following process:

To try RAG with a vector query, you will need to have installed MarkLogic 12 and also have defined
`AZURE_EMBEDDING_DEPLOYMENT_NAME` in your `.env` file. Please see the
[top-level README in this repository](../README.md) for more information.
[setup guide](../setup.md) for more information.

You can now run the Gradle `vectorQueryExample` task:

Expand All @@ -117,11 +117,11 @@ An example result is shown below:
The results are similar but slightly different to the results shown above for a simple word query. You can compare
the document URIs printed by each program to see that a different set of document is selected by each approach.

For an example of how to add embeddings to your data, please see [this embeddings example](../embedding-langchain-java/README.md).
For an example of how to add embeddings to your data, please see [this embeddings example](../embedding.md).

## Summary

The three RAG approaches shown above - a simple word query, a contextual query, and a vector query - demonstrate how
easily data can be queried and retrieved from MarkLogic using langchain. Identifying the optimal approach for your own
easily data can be queried and retrieved from MarkLogic using langchain4j. Identifying the optimal approach for your own
data will require testing the approaches you choose and possibly leveraging additional MarkLogic indexes and/or
further enriching your data.
11 changes: 6 additions & 5 deletions docs/rag-examples/rag-javascript.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,9 @@ Minimum versions of npm are dependent on the version of Node.
See [Node Releases](https://nodejs.org/en/about/previous-releases#looking-for-latest-release-of-a-version-branch)
for more information.

For this LangChain.js example, in addition to the environment variables in the `.env` file described in the README in the
root directory of this project, you'll also need to add the `AZURE_OPENAI_API_INSTANCE_NAME` setting to the `.env` file.
For this LangChain.js example, in addition to the environment variables in the `.env` file described in the
[setup guide](../setup.md), you'll also need to add the `AZURE_OPENAI_API_INSTANCE_NAME` setting to the `.env` file.

```
OPENAI_API_VERSION=2023-12-01-preview
AZURE_OPENAI_ENDPOINT=<Your Azure OpenAI endpoint>
Expand Down Expand Up @@ -69,14 +70,14 @@ documents are first selected in a manner similar to the approaches shown above -
set of indexes that have long been available in MarkLogic. The documents are then further filtered and sorted via
the following process:

1. An embedding of the user's question is generated using [langchain and Azure OpenAI](https://python.langchain.com/docs/integrations/text_embedding/).
1. An embedding of the user's question is generated using [LangChain.js and Azure OpenAI](https://python.langchain.com/docs/integrations/text_embedding/).
2. Using MarkLogic's new vector API, the generated embedding is compared against the embeddings in each
selected crime event document to generate a similarity score for each document.
3. The documents with the highest similarity scores are sent to the LLM to augment the user's question.

To try the `askVectorQuery.js` module, you will need to have installed MarkLogic 12 and also have defined
`AZURE_EMBEDDING_DEPLOYMENT_NAME` in your `.env` file. Please see the
[top-level README in this repository](../README.md) for more information.
[setup guide](../setup.md) for more information.

You can now run `askVectorQuery.js`:
```
Expand All @@ -97,6 +98,6 @@ the document URIs printed by each program to see that a different set of documen
## Summary

The three RAG approaches shown above - a simple word query, a contextual query, and a vector query - demonstrate how
easily data can be queried and retrieved from MarkLogic using langchain. Identifying the optimal approach for your own
easily data can be queried and retrieved from MarkLogic using LangChain.js. Identifying the optimal approach for your own
data will require testing the approaches you choose and possibly leveraging additional MarkLogic indexes and/or
further enriching your data.
14 changes: 7 additions & 7 deletions docs/rag-examples/rag-python.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ nav_order: 1
---

[Retrieval Augmented Generation (RAG)](https://python.langchain.com/docs/tutorials/rag/) can be implemented in Python
with [langchain](https://python.langchain.com/docs/introduction/) and MarkLogic via a "retriever". The examples in this
with [LangChain](https://python.langchain.com/docs/introduction/) and MarkLogic via a "retriever". The examples in this
directory demonstrate three different kinds of retrievers that you can consider for your own AI application.

## Table of contents
Expand All @@ -28,7 +28,7 @@ python -m venv .venv
source .venv/bin/activate
```

Once you have a virtual environment created, run the following to install the necessary langchain dependencies along
Once you have a virtual environment created, run the following to install the necessary LangChain dependencies along
with the [MarkLogic Python client](https://pypi.org/project/marklogic-python-client/):

pip install --quiet --upgrade langchain langchain-community langchain_openai marklogic_python_client
Expand All @@ -40,7 +40,7 @@ You are now ready to execute the example RAG programs.
A key feature of MarkLogic is its ability to index all text in a document during ingest. Thus, a simple approach to RAG
with MarkLogic is to select documents based on the words in a user's question.

To demonstrate this, you can run the `ask_word_query.py` module with any question. The module uses a custom langchain
To demonstrate this, you can run the `ask_word_query.py` module with any question. The module uses a custom LangChain
retriever that selects documents in the `ai-examples-content` MarkLogic database containing one or more of the words
in the given question. It then includes the top 10 most relevant documents in the request that it sends to Azure OpenAI.
For example:
Expand Down Expand Up @@ -85,14 +85,14 @@ documents are first selected in a manner similar to the approaches shown above -
set of indexes that have long been available in MarkLogic. The documents are then further filtered and sorted via
the following process:

1. An embedding of the user's question is generated using [langchain and Azure OpenAI](https://python.langchain.com/docs/integrations/text_embedding/).
1. An embedding of the user's question is generated using [LangChain and Azure OpenAI](https://python.langchain.com/docs/integrations/text_embedding/).
2. Using MarkLogic's new vector API, the generated embedding is compared against the embeddings in each
selected crime event document to generate a similarity score for each document.
3. The documents with the highest similarity scores are sent to the LLM to augment the user's question.

To try the `ask_vector_query.py` module, you will need to have installed MarkLogic 12 and also have defined
`AZURE_EMBEDDING_DEPLOYMENT_NAME` in your `.env` file. Please see the
[top-level README in this repository](../README.md) for more information.
[setup guide](../setup.md) for more information.

You can now run `ask_vector_query.py`:

Expand All @@ -107,11 +107,11 @@ An example result is shown below:
The results are similar but slightly different to the results shown above for a simple word query. You can compare
the document URIs printed by each program to see that a different set of document is selected by each approach.

For an example of how to add embeddings to your data, please see [this embeddings example](../embedding-langchain-java/README.md).
For an example of how to add embeddings to your data, please see [this embeddings example](../embedding.md).

## Summary

The three RAG approaches shown above - a simple word query, a contextual query, and a vector query - demonstrate how
easily data can be queried and retrieved from MarkLogic using langchain. Identifying the optimal approach for your own
easily data can be queried and retrieved from MarkLogic using LangChain. Identifying the optimal approach for your own
data will require testing the approaches you choose and possibly leveraging additional MarkLogic indexes and/or
further enriching your data.
2 changes: 1 addition & 1 deletion docs/splitting.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ to show how easily you can split and store chunks of text and thus get you start

## Setup

Assuming you have followed the [setup instructions for these examples](../setup/README.md), then you already have a
Assuming you have followed the [setup instructions for these examples](setup.md), then you already have a
database in your MarkLogic cluster named `ai-examples-content`. This database contains a small set - specifically,
3,034 text documents - of the
[Enron email dataset](https://www.loc.gov/item/2018487913/) in a collection named `enron`. These documents are good
Expand Down

0 comments on commit 5a729a4

Please sign in to comment.