Skip to content

This repository builds a sample application which detects fraud transaction in real-time using vector search of Azure Cosmos DB

License

Notifications You must be signed in to change notification settings

AzureCosmosDB/vector-search-fraud-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fraud Detection System Using Azure Cosmos DB and Azure OpenAI

Overview

This project implements a fraud detection system that integrates Azure Cosmos DB and Azure OpenAI embeddings. It allows the detection of suspicious activities based on transaction patterns, geographical information, and vector similarity using embeddings generated by OpenAI's API. The system stores transaction data in Cosmos DB, generates embeddings for the locations, and performs vector-based searches to detect anomalies in transactions.

Prerequisites

To set up and run this project, the following Python packages are required:

  1. python-dotenv: For loading environment variables from a .env file.
  2. openai: To interact with the OpenAI API for generating embeddings.
  3. geopy: For geocoding city names into latitude and longitude coordinates.
  4. azure-cosmos: For interacting with the Azure Cosmos DB service.

You can install these packages by running:

!pip install python-dotenv
!pip install openai
!pip install geopy
!pip install azure-cosmos

Environment Setup

You need to set up a .env file that contains your connection details to Azure Cosmos DB and OpenAI. Here's a template for the environment variables that should be included:

NOSQL_URI=<your_cosmos_db_uri>
NOSQL_PRIMARY_KEY=<your_cosmos_db_primary_key>
AOAI_ENDPOINT=<your_openai_endpoint>
AOAI_KEY=<your_openai_api_key>
API_VERSION=<openai_api_version>
AOAI_EMBEDDING_DEPLOYMENT=<openai_embedding_deployment_name>
AOAI_EMBEDDING_DEPLOYMENT_MODEL=<openai_embedding_model_name>

Project Structure

  1. Cosmos DB Setup:

    • A Cosmos DB database and container are created if they don't already exist.
    • The container is configured with a vector index for the locationVector field, which allows for efficient vector searches.
  2. Generating Location Embeddings:

    • The system uses OpenAI's embedding model to generate vector representations of geographical locations (latitude and longitude).
    • These embeddings are then stored in Cosmos DB alongside transaction data.
  3. Transaction Storage:

    • A pre-existing JSON file (data_with_tenants.json) containing transaction data is loaded.
    • Each transaction is updated with its corresponding location embeddings before being stored in the Cosmos DB container.
  4. Vector Search:

    • The system allows vector-based searches to detect anomalies by comparing the current transaction's location vector with the average vector of previous transactions.
    • Transactions are retrieved if the vector distance exceeds a certain threshold, indicating a possible anomaly.

Key Functions

generate_embeddings(lat_lon)

Generates embeddings for a given latitude and longitude using OpenAI's embedding model.

get_average_location_vector(container, tenant_id)

Fetches the average location vector for all transactions associated with a specific tenant from the Cosmos DB container.

vector_search(current_location_vector, tenant_id, average_location_vector, amount, num_results=5)

Performs a vector-based search to detect transactions with a large distance from the average transaction vector and current transaction vector.

perform_search(tenant_id, city, query, amount)

Main function to perform the entire search operation. It calculates the average vector, generates the current transaction's embeddings, and runs a vector-based search in Cosmos DB.

How to Use

  1. Set Up Environment Variables: Ensure your .env file is correctly configured with the necessary credentials and endpoints for both Cosmos DB and OpenAI.

  2. Run the Search: You can run the following code snippet to perform a vector search and detect anomalies in transactions:

tenant_id = "10"
city = "Sweden"
merchant = "Walmart"
amount = 1000

results = perform_search(tenant_id, city, merchant, amount)
print(pd.DataFrame(results))

This will return a dataframe with the results of the vector-based search, listing transactions that deviate from the normal patterns.

Example Output

The output of the perform_search function will be a DataFrame showing the transactions that were found based on the vector search:

  TransactionID   Amount            Timestamp  Location  Merchant  TenantId  ProximityOfCurrentToLast  ProximityOfAverageToLast
0        T3235    282.75  2024-09-15 14:28:38    Boston    Amazon        10                 0.428310                 0.523418
1        T7275    939.29  2024-09-15 14:24:38    Boston   Walmart        10                 0.428310                 0.523418
...

Additional Notes

  • Vector Indexing: The project utilizes Azure Cosmos DB's diskANN indexing for vector-based searches. The embeddings generated for location vectors are stored as 1536-dimensional float arrays.
  • Azure OpenAI Integration: The project uses Azure OpenAI's embedding API to generate location embeddings.

License

This project is open-source and available for modification under the MIT License.

About

This repository builds a sample application which detects fraud transaction in real-time using vector search of Azure Cosmos DB

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published