Skip to content

A downloader for e621.net with meant for iterability, using e621 db post exports, designed for deep learning.

License

Notifications You must be signed in to change notification settings

slobodaapl/py621dl

Repository files navigation

py621dl - an iterable E621 downloader

This package is meant to be used in deep learning applications and automation, not as a means to download specific images and post IDs or searching for tags. For that application, please check out py621 which is not related to this package in any way.

The package is meant to be used with the official db export format from E621, posts information. See here for available db exports and here for general information on the API.

!! This is a pre-release version, and is not meant for production use !!

Proper documentation, tests, and automated updates to the package will be added later.

Installation

You can install the package using pip install py621dl on python>=3.11

Usage

The E621Downloader class must be initialized using the Reader class, to which the csv file must be passed. The Reader supports only the official db export csv files of the format "posts-YYYY-MM-DD.csv.gz", either compressed or uncompressed.

The E621Downloader class can be initialized with the following parameters:

  • csv_reader: the Reader object
  • timeout: the timeout for the requests, in seconds
  • retries: the number of retries for the requests

It can be used as an iterable, yielding lists of np.ndarray objects of the images. The list size will depend on your batch_size specified for Reader. The images are of opencv BGR format. The downloader automatically handles and filters deleted or flagged posts, and will attempt to fill the batch with new images so that it will always yield a full batch.

The Reader class can be initialized with the following parameters:

  • csv_file: the path to the csv file
  • batch_size: the size of the batch to be returned by the E621Downloader
  • excluded_tags: a list of E621 tags to be excluded from the results
  • minimum_score: the minimum score of the posts to be included in the results
  • chunk_size: the size of the chunk to be read from the csv file at once
  • checkpoint_file: the path to the checkpoint file, to resume from any point. If path doesn't exist, a new file will be created.
  • repeat: whether to repeat from the beginning of the csv file when the end is reached automatically. Otherwise StopIteration is raised. E621Downloader handles this exception and raises its own StopIteration when the end is reached.

Example use

from py621dl import Reader, E621Downloader

reader = Reader("posts-2022-10-30.csv.gz")
downloader = E621Downloader(reader, timeout=10, retries=3)

for batch in downloader:
    # do something with the batch
    pass

Contributing

For any opened issues, please create a linked branch for that issue and create pull requests into the test branch for completed edits.

To get started with contribution to this repository, you will need Python 3.11 and Poetry. After that, simply navigate to a folder into which you have cloned this repository, and do the following:

poetry use 3.11
poetry install --with dev

Note that python 3.11 will need to be in your PATH for it to poetry use 3.11 to work. Otherwise refer to Poetry documentation.

In order to write your own tests for new code (strongly recommended), you will need to run pip install -e . from the project folder, in order to install it locally based on the current state of the files, so that pytest may use this package as if it was properly installed on an end-user system, without the need to re-build and re-install it with every change you make.

You can also use pip install -e . to insall the package locally, so you can simply use import py621dl and any changes in your cade will be instantly reflected while you debug the code.

About

A downloader for e621.net with meant for iterability, using e621 db post exports, designed for deep learning.

Topics

Resources

License

Stars

Watchers

Forks

Languages