GPT-J 6B Inference

GPT-J 6B inference best known configurations with Intel® Extension for PyTorch.

Model Information

Use Case	Framework	Model Repo	Branch/Commit/Tag	Optional Patch
Inference	PyTorch	https://huggingface.co/EleutherAI/gpt-j-6	-	-

Pre-Requisite

Bare Metal

General setup

Follow link to build Pytorch, IPEX, TorchVison and TCMalloc.

Model Specific Setup

Install Intel OpenMP

pip install packaging intel-openmp accelerate

Set IOMP and tcmalloc Preload for better performance

export LD_PRELOAD="<path_to>/tcmalloc/lib/libtcmalloc.so":"<path_to_iomp>/lib/libiomp5.so":$LD_PRELOAD

Install datasets
```
pip install datasets
```

Set INPUT_TOKEN before running the model

export INPUT_TOKEN=32
(choice in [32 64 128 256 512 1024 2016], we prefer to benchmark on 32 and 2016)

Set OUTPUT_TOKEN before running the model

export OUTPUT_TOKEN=32
(32 is preferred, while you could set any other length)

About the BATCH_SIZE in scripts

using BATCH_SIZE=1 for realtime mode
using BATCH_SIZE=N for throughput mode (N could be further tuned according to the testing host, by default using 1);

About the BEAM_SIZE in scripts
```
using BEAM_SIZE=4 by default
```

Do calibration to get "qconfig.json" before running INT8.

# You can get "qconfig.json" for calibration:
bash do_quantization.sh calibration sq #using smooth quant as default

Set ENV to use fp16 AMX if you are using a supported platform
```
export DNNL_MAX_CPU_ISA=AVX512_CORE_AMX_FP16
```

Inference

git clone https://github.com/IntelAI/models.git
cd models/models_v2/pytorch/gptj/inference/cpu
Create virtual environment venv and activate it:
```
python3 -m venv venv
. ./venv/bin/activate
```
Run setup.sh
```
./setup.sh
```
Install the latest CPU versions of torch, torchvision and intel_extension_for_pytorch
Setup required environment paramaters

Parameter	export command
TEST_MODE (THROUGHPUT, ACCURACY, REALTIME)	`export TEST_MODE=THROUGHPUT`
OUTPUT_DIR	`export OUTPUT_DIR=$(pwd)`
PRECISION	`export PRECISION=bf16` (fp32, bf32, bf16, fp16, int8-fp32, int8-bf16)
MODEL_DIR	`export MODEL_DIR=$(pwd)`
BATCH_SIZE (optional)	`export BATCH_SIZE=256`

Run run_model.sh

Output

Single-tile output will typically looks like:

 ---------- Summary: ----------
inference-latency: 246.340 sec.
first-token-latency: 38.192 sec.
rest-token-latency: 6.681 sec.
P90-rest-token-latency: 6.857 sec.

Final results of the inference run can be found in results.yaml file.

results:
 - key: throughput
   value: N/A
   unit: N/A
 - key: latency
   value: 246.340
   unit: s
 - key: accuracy
   value: N/A
   unit: AP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GPT-J 6B Inference

Model Information

Pre-Requisite

Bare Metal

General setup

Model Specific Setup

Inference

Output

Files

README.md

Latest commit

History

README.md

File metadata and controls

GPT-J 6B Inference

Model Information

Pre-Requisite

Bare Metal

General setup

Model Specific Setup

Inference

Output