This repository contains the code, data, and models for paper Improving Language Understanding from Screenshots. In this paper, we focus on improving the language understanding ability of "screenshot LM" (models that process everything -- including text -- within visual inputs) and propose patch-and-text prediction (PTP), a novel pre-training objective for screenshot LMs.
- Environment
- Preparing the data
- Reproducing our pre-trained models
- Downloading our models
- Fine-tuning PTP models
- Bugs or Questions?
- Citation
Firstly, please install the latest compatible PyTorch.
Then, install all the required packages by running:
pip install -r requirements.txt
We strongly recommend using the exact same transformers
and accelerate
versions for best reproducibility. Please checkout the renderer readme to make sure that the renderer is correctly configured.
For our encoder-decoder experiments and the train-from-scratch autoregressive screenshot LM experiments, we use Wikipedia+BookCorpus as the pre-training data. You can find the already-tokenized dataset from this Huggingface website. You can download the data by
git clone https://huggingface.co/datasets/princeton-nlp/ptp_data data
This folder contains four files
wikibook_256_opt_tk_train.npy
andwikibook_256_opt_tk_val.npy
: Wiki+Book using OPT tokenizer, 256 tokens per example (for encoder-decoder).wikibook_512_llama_tk_train.npy
andwikibook_512_llama_tk_val.npy
: Wiki+Book using LLAMA tokenizer, 512 tokens per example (for train-from scratch autoregressive).
For continuing training Sheared-llama to use screenshots, we use Sheared-llama's pipeline for processing RedPajama data. Please follow this guideline for processing the data. Our example config will use ./data/sheared-llama-rp/for_ft
for continuing pre-training and ./data/sheared-llama-rp/eval
for evaluation.
To reproduce our models, run the following command (requires 8 GPUs):
NUM_GPU=8 bash run_multiple_gpus.sh {CONFIG PATH}
There are three example configs:
run_configs/ptp.yaml
: our main PTP model (encoder-decoder).run_configs/screenshot-llama-380m.yaml
: train-from-scratch autoregressive.run_configs/screenshot-llama-1.3b-from-sheared-llama.yaml
: continuing pre-training sheared-llama.
You can also run the single-GPU command run_single_gpu.sh
for testing. To ensure the same hyperparameters, you should adjust the per-GPU batch size (per_device_train_batch_size
) or the gradient accumulation steps (gradient_accumulation_steps
) accordingly if you are not using 8 GPUs or your GPUs cannot fit our preset batch sizes.
We provide the following pre-trained models on Huggingface:
- princeton-nlp/ptp
- princeton-nlp/screenshot-llama-380m
- princeton-nlp/screenshot-llama-1.3b-from-sheared-llama
Coming soon!
If you have any questions related to the paper, feel free to email Tianyu ([email protected]
). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
Please cite our paper if you use PTP in your work:
@article{gao2024improving,
title={Improving Language Understanding from Screenshots},
author={Gao, Tianyu and Wang, Zirui and Bhaskar, Adithya and Chen, Danqi},
journal={arXiv preprint arXiv:2402.14073},
year={2024}
}