Skip to content

Commit

Permalink
Merge pull request #351 from oduwsdl/issue-70
Browse files Browse the repository at this point in the history
A monolithic Dockerfile that actually works
  • Loading branch information
machawk1 authored Dec 23, 2017
2 parents 8d7f59e + 9b508cc commit c41f956
Show file tree
Hide file tree
Showing 3 changed files with 99 additions and 9 deletions.
51 changes: 46 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,8 +1,49 @@
FROM python:2-onbuild
# Use Python 2.7 as default, but facilitate change at build time
ARG PYTHON_TAG=2.7
FROM python:${PYTHON_TAG}

LABEL maintainer="Sawood Alam <@ibnesayeed>"
# Add some metadata
LABEL app.name="InterPlanetary Wayback (IPWB)" \
app.description="A distributed and persistent archive replay system using IPFS" \
app.license="MIT License" \
app.license.url="https://github.com/oduwsdl/ipwb/blob/master/LICENSE" \
app.repo.url="https://github.com/oduwsdl/ipwb" \
app.authors="Mat Kelly <@machawk1> and Sawood Alam <@ibnesayeed>"

RUN pip install .
EXPOSE 5000
# Create folders for WARC, CDXJ and IPFS stores
RUN mkdir /warc /cdxj /ipfs

CMD ["ipwb"]
# Download and install IPFS
ENV IPFS_PATH=/ipfs
ARG IPFS_VERSION=v0.4.13
RUN cd /tmp \
&& wget https://dist.ipfs.io/go-ipfs/v0.4.13/go-ipfs_${IPFS_VERSION}_linux-amd64.tar.gz \
&& tar xvfz go-ipfs*.tar.gz \
&& mv go-ipfs/ipfs /usr/local/bin/ipfs \
&& rm -rf go-ipfs* \
&& ipfs init

# Make necessary changes to prepare the environment for IPWB
RUN apt update && apt install -y locales \
&& rm -rf /var/lib/apt/lists/* \
&& echo "en_US.UTF-8 UTF-8" > /etc/locale.gen \
&& locale-gen

# Add a custom entrypoint script
COPY entrypoint.sh /usr/local/bin/
RUN chmod a+x /usr/local/bin/entrypoint.sh

# Copy source files and install IPWB
WORKDIR /ipwb
COPY requirements.txt ./
RUN pip install -r requirements.txt
COPY . ./
RUN python setup.py install

# Run ipfs daemon in background
# Wait for the daemon to be ready
# Runs provided command
ENTRYPOINT ["entrypoint.sh"]

# Index a sample WARC file and replay it
CMD ["ipwb", "replay"]
35 changes: 31 additions & 4 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Peer-To-Peer Permanence of Web Archives

|travis| |pypi| |codecov|

InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of `WARC`_ files into the IPFS network. `IPFS`_ is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. ipwb splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a `CDXJ index`_ with references to the IPFS hashes returns, and combines the header and payload from IPFS at the time of replay.
InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of `WARC`_ files into the IPFS network. `IPFS`_ is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. ipwb splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a `CDXJ index`_ with references to the IPFS hashes returns, and combines the header and payload from IPFS at the time of replay.

InterPlanetary Wayback primarily consists of two scripts:

Expand All @@ -32,7 +32,7 @@ The latest release of ipwb can be installed using pip:
The latest development version containing changes not yet released can be installed from source:

.. code-block:: bash
$ git clone https://github.com/oduwsdl/ipwb
$ cd ipwb
$ pip install -r requirements.txt
Expand Down Expand Up @@ -69,7 +69,7 @@ In a separate terminal session (or the same if you started the daemon in the bac
$ ipwb index ipwb/samples/warcs/salam-home.warc
`indexer.py`, the default script called by the ipwb binary, parititions the WARC into WARC Records, extracts the WARC Response headers, HTTP response headers, and HTTP response body (payload). Relevant information is extracted from the WARC Response headers, temporary byte strings are created for the HTTP response headers and payload, and these two bytes strings are pushed into IPFS. The resulting CDXJ data is written to `stdout` by default but can be redirected to a file, e.g.,
`indexer.py`, the default script called by the ipwb binary, parititions the WARC into WARC Records, extracts the WARC Response headers, HTTP response headers, and HTTP response body (payload). Relevant information is extracted from the WARC Response headers, temporary byte strings are created for the HTTP response headers and payload, and these two bytes strings are pushed into IPFS. The resulting CDXJ data is written to `stdout` by default but can be redirected to a file, e.g.,

.. code-block:: bash
Expand All @@ -86,7 +86,7 @@ An archival replay system is also included with ipwb to re-experience the conten
.. code-block:: bash
$ ipwb replay
A CDXJ index can also be provided and used by the ipwb replay system by specifying the path of the index file as a parameter to the replay system:

.. code-block:: bash
Expand All @@ -104,6 +104,33 @@ Once started, the replay system's web interface can be accessed through a web br

.. (TODO: provide instructions on specifying a CDXJ file/directory to be read from the CDXJ replay system)
Using Docker
------------

A pre-built Docker image is made available that can be run as following:

.. code-block:: bash
$ docker container run -it --rm -p 5000:5000 oduwsdl/ipwb
The container will run an IPFS daemon, index a sample WARC file, and replay it using the newly created index.
It will take a few seconds to be ready, then the replay will be accessible at http://localhost:5000/ with a sample archived page.
To index and replay your own WARC file, mount your WARC store at `/warc` using `-v` (or `--volume`) flag and run the following command:

.. code-block:: bash
$ docker container run -it --rm -v /path/to/warc/folder:/warc oduwsdl/ipwb ipwb index /warc/custom.warc.gz > /cdxj/custom.cdxj
$ docker container run -it --rm -p 5000:5000 oduwsdl/ipwb ipwb replay /cdxj/custom.cdxj
Generated index files (CDXJ) and the IPFS store can also be persisted using volumes.
Bind mount corresponding host directories at `/cdxj` and `/ipfs`.

To build an image from the source, run the following command from the directory where the source code is checked out.

.. code-block:: bash
$ docker image build -t ipwb .
Help
-------------
Usage of sub-commands in ipwb can be accessed through providing the `-h` or `--help` flag, like any of the below.
Expand Down
22 changes: 22 additions & 0 deletions entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/usr/bin/env bash

set -e

if [[ ("$1" = "ipwb") && ("$1" != "$@") && ("$@" != *" -h"*) && ("$@" != *" --help"*) ]]
then
# Initialize IPFS if not initialized already
if [ ! -f $IPFS_PATH/config ]
then
ipfs init
fi
# Run the IPFS daemon in background
ipfs daemon &

# Wait for IPFS daemon to be ready
while ! curl -s localhost:5001 > /dev/null
do
sleep 1
done
fi

exec "$@"

0 comments on commit c41f956

Please sign in to comment.