Skip to content

eth-library/dataset-dj

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataDJ

Data DJ is a value-adding service for collections and archives, initially conceived at ETH Library Lab and currently in development at ETH Library. It helps to provide more convenient and efficient access to batches of digitised records and files. The service works in conjunction with collections' existing websites and search portals. The collection's website forwards the user's request for a list of files to the Data DJ, our service then gathers and compresses the files, and notifies the user via email with a convenient download link.

The requests to the sample application DataDJ can be accessed at https://dj-api-ucooq6lz5a-oa.a.run.app/. The Requests presented throughout the README are written for Visual Studio Code REST Client, however they can simply be transformed to be used with other API Clients or curl.

If you are planning to work on this project, contact us to ask for the detailed internal documentation.

Quickstart Guide

1. Request an archive from a list of files

Edit the curl request below to include your email and the list of files that you want to download (note the included filepath). Aditionally meta information can be included using said field. The endpoint can be called using curl. Once the files have been gathered and downloaded, you should receive an email with the download link. This endpoint should be called by a data collection, forwarding the files requested by a user and specifiying the users email address. Please note that the archiveID remains empty in the current iteration of the service.

Example:

POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: Bearer service_key

{
    "email": "[email protected]",
    "archiveID": "",
    "content": [
        {
            "sourceID": "0ff529e3",
            "files": ["/test/dir/file1", "/test/dir/file2"]
        },
        {
            "sourceID": "eba48cdb",
            "files": ["/test/dir/file3", "/test/dir/file4"]
        }],
    "meta": "{meta: information}"
}

API Endpoints

Check if DataDJ service is live (Public)

GET https://dj-api-ucooq6lz5a-oa.a.run.app/ping

Register Services, Taskhandler and Sources

1. Register new Service (Admin)

An admin can task the DJ to generate a new service token/key and to send an email with a redeem link to the specified email address. The service key is required by collections to interact with the DJ for anything related to creating and altering archives.

POST https://dj-api-ucooq6lz5a-oa.a.run.app/admin/createKeyLink
Content-Type: application/json
Authorization: Bearer admin_key

{
    "email": "[email protected]"
}

2. Register new Taskhandler (Admin)

A taskhandler is the part of the DataDJ responsible for gathering and compressing the requested files, as well as sending an email containing a download link to the user who requested the files. In order to interact to the API part of the DataDJ, the taskhandler requires a handler token/key similar to a service key. Said key can be generated by an admin via the following request and has to be manually handed to the operator of the taskhandler in question (for now).

POST https://dj-api-ucooq6lz5a-oa.a.run.app/admin/registerHandler
Content-Type: application/json
Authorization: Bearer admin_key

3. Register new Source (Service)

A source is a representation of a collection holding files to be downloaded. This services the purpose to identify which files have to be gathered where and also to keep track of the origin of every file to provide an overview of every sources contribution to the final archive with all its files. The registration request returns a source-id which subsequentially has to be used to uniquely identify the source when interacting with the DataDJ.

POST https://dj-api-ucooq6lz5a-oa.a.run.app/source
Content-Type: application/json
Authorization: Bearer service_key

{
    "name": "Test-Source-One",
    "Organisation": "ETHZ"
}

Creating, modifying or downloading archives (Service)

https://dj-api-ucooq6lz5a-oa.a.run.app/archive

This endpoint expects a request that contains four fields:

{
  "email":"",
  "archiveID":"",
  "files":[],
  "meta": ""
}

email, archiveID and meta are strings, whereas files is a list of strings containing the names of the files. Depending on which fields are left empty, the API triggers different operations. For now only option 4 is being used in tests, whereas the other option are kept for the future.

1. Create an archive from a list of files

Both email and archiveID are left empty, whereas files contains the names of the files the archive should be initialised with.

Example:

POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: Bearer service_key

{
    "email": "",
    "archiveID": "",
    "content": [
        {
            "sourceID": "0ff529e3",
            "files": ["/test/dir/file1", "/test/dir/file2"]
        },
        {
            "sourceID": "eba48cdb",
            "files": ["/test/dir/file3", "/test/dir/file4"]
        }],
    "meta": "{meta: information}"
}

2. Add a list of files to an archive

email is left empty. archiveID contains the identifier of a previously created archive and files the list of files you want to add to the archive.

Example:

POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: service_key

{
    "email": "",
    "archiveID": "e01fd941",
    "content": [
        {
            "sourceID": "0ff529e3",
            "files": ["/test/dir/file1", "/test/dir/file2"]
        },
        {
            "sourceID": "eba48cdb",
            "files": ["/test/dir/file3", "/test/dir/file4"]
        }],
    "meta": "{meta: information}"
}

3. Download an archive

email contains the email address the download link is being sent to, archiveID specifies the archive you want to download and files is left empty. The DataDj will send you a download link that allows you to download the archive as a .zip file.

Example:

POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: Bearer service_key

{
    "email": "[email protected]",
    "archiveID": "e01fd941",
    "content": [],
    "meta": ""
}

4. Directly download a list of files as archive

email contains the email address the download link is being sent to, archiveID is left empty and files contains the names of the files you want to download. The DJ creates an archive of the files in the request and will also return its identifier in the response, in case that archive needs to be accessed or modified later on. However it is not necessary to separatly trigger the notification containing the download link as this is going to happen automatically.

Example:

POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: Bearer service_key

{
    "email": "[email protected]",
    "archiveID": "",
    "content": [
        {
            "sourceID": "0ff529e3",
            "files": ["/test/dir/file1", "/test/dir/file2"]
        },
        {
            "sourceID": "eba48cdb",
            "files": ["/test/dir/file3", "/test/dir/file4"]
        }],
    "meta": "{meta: information}"
}

Currently, the /archive endpoint returns an object describing the order which was created for the archive in question. Orders are objects telling the taskhandlers which archives should be downloaded.

{
  "orderID": "a5777ffb",
  "archiveID": "4afc3f67",
  "email": "[email protected]",
  "date": "2022-12-14 16:27:28.967665178 +0000 UTC m=+67114.216955617",
  "status": "opened",
  "sources": [
    "0ff529e3"
  ]
}

Inspecting an archive (Service)

https://data-dj-2021.oa.r.appspot.com/archive/id

This endpoint allows to inspect the contents of an archive id either in the browser or via an API client. The response is a JSON object representing the archive.

Example:

GET https://dj-api-ucooq6lz5a-oa.a.run.app/archive/a2e11165
Content-Type: application/json
Authorization: Bearer service_key

Example Response:

{
  "id": "a2e11165",
  "content": [
    {
      "sourceID": "0ff529e3",
      "files": [
        "/test/dir/file1",
        "/test/dir/file2"
      ]
    },
    {
      "sourceID": "eba48cdb",
      "files": [
        "/test/dir/file3",
        "/test/dir/file4"
      ]
    }
  ],
  "meta": "{meta: information}",
  "timeCreated": "2022-12-09 13:31:43.320372 +0100 CET m=+305.508934168",
  "timeUpdated": "",
  "status": "opened",
  "sources": [
    "0ff529e3",
    "eba48cdb"
  ]
}

Local Development (Outdated)

  1. make a copy of .env.example and save it as .env.local
  2. replace the example directory paths, bucketnames and other settings as needed.

option a: run with go

download and run the redis image with docker

docker pull redis
docker run --name dj-redis -p 6379:6379 -d redis

start the task handler
open a terminal in project root.
export all of the variables in the .env.local file
run the task handler

source .env.local
export $(cut -d= -f1 .env.local)
go run ./taskHandler/*.go

open a separate terminal in project root.
export all of the variables in the .env.local file
run the api

source .env.local && export $(cut -d= -f1 .env.local)
go run ./api/*.go

note that for any changes in the environment file to take effect, you must export the variables again and restart that part of the application.

option b: (to be completed)
to run publisher and subscriber applications using docker. include the path to the .env.local file in the docker run command.

docker run --env-file=./.env.local -p 8080:8080 data-dj-image

Docker commands

  • docker build --platform=linux/amd64 -f Dockerfile.api -t dj-api-amd64 .
  • docker tag dj-api-amd64:0.0.1 europe-west6-docker.pkg.dev/data-dj-2021/dj-docker-repo/dj-api:0.0.1
  • docker push europe-west6-docker.pkg.dev/data-dj-2021/dj-docker-repo/dj-api:0.0.1

Steps for Google Cloud Run

curl -X POST "0.0.0.0:8765/admin/createKeyLink"
-H "Authorization: Bearer $ADMIN_KEY"
-H "content:application/json"
-d '{"email":"[email protected]"}'`

Authentication

generates a token
saves hashed token in mongo middleware function validates token during requests

set mongo collection to delete a document after the given number of seconds. Does not apply if the index field is not in the document e.g. if a doc does not have expiryRequestedDate it will not be deleted. db.apiKeys.createIndex( { "expiryRequestedDate": 1 }, { expireAfterSeconds: 3600 } )

Useful Reference Material for Go

  • Learning Go by Jon Bodner
    general reference for programming in GO; types, syntax, imports etc.
    see Ch13 for writing tests

  • Cloud Native Go

Material for MongoDB

http://www.inanzzz.com/index.php/post/g7e8/running-mongodb-migration-script-at-the-docker-startup

About

file aggregation and compression

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published