v4.4.0 (2024-09-09)
Removed: Flash Attention support in the Python package due to significant package size increase with minimal performance gain.
Note: Flash Attention remains supported in the C++ package with the WITH_FLASH_ATTN
option.
Flash Attention may be re-added in the future if substantial improvements are made.
- Support Llama3 (#1751)
- Support Gemma2 (1772)
- Add log probs for all tokens in vocab (#1755)
- Grouped conv1d (#1749 + #1758)
- Fix pipeline (#1723 + #1747)
- Some improvements in flash attention (#1732)
- Fix crash when using return_alternative on CUDA (#1733)
- Quantization AWQ GEMM + GEMV (#1727)
v4.3.1 (2024-06-10)
Note: Because of exceeding project's size on Pypi (> 20 GB), the release v4.3.0 was pushed unsuccessfully.
- Improve the compilation (#1706 and #1705)
- Fix position bias in tensor parallel mode (#1714)
v4.3.0 (2024-05-17)
- Support phi-3 (8k and 128k) (#1700 and #1680)
- Fix regression Flash Attention (#1695)
v4.2.1 (2024-04-24)
Note: Because of the increasing of package's size (> 100 MB), the release v4.2.0 was pushed unsuccessfully.
- Support load/unload for generator/Whisper Attention (#1670)
- Fix Llama 3 (#1671)
v4.2.0 (2024-04-10)
- Support Flash Attention (#1651)
- Implementation of gemm for FLOAT32 compute type with RUY backend (#1598)
- Conv1D quantization for only CPU (DNNL and CUDA backend is not supported) (#1601)
- Fix bug tensor parallel (#1643)
- Use BestSampler when temperature is 0 (#1659)
- Fix bug gemma (#1660)
- Optimize loading/unloading time for Translator with cache (#1645)
v4.1.0 (2024-03-11)
- Support Gemma Model (#1631)
- Support Tensor Parallelism (#1599)
- Avoid initializing unused GPU (#1633)
- Read very large tensor by chunk if the size > max value of int (#1636)
- Update Readme
v4.0.0 (2024-02-15)
This major version introduces the breaking change while updating to cuda 12.
- Support cuda 12
- Add feature to_device() in class StorageView in Python to move data between host <-> device
- Implement Conv1D with im2col and GEMM to improvement in performance
- Get tokens in the range of the vocab size for LlaMa models
- Fix loss of performance
- Update cibuildwheel to 2.16.5
v3.24.0 (2024-01-08)
- Support of new option offset to ignore token score of special tokens
v3.23.0 (2023-12-05)
- Support Phi model
- Fix the conversion for whisper without the "alignment_heads" in the "generation_config.json"
- Fix forward batch
v3.22.0 (2023-11-22)
- Support "sliding window" and "chunking input" for Mistral
- Take into account the "generation_config.json" and fix "lang_ids" getter for Whisper converter
- Accept callback even on "generate_tokens" method
- Fix iomp5 linking with latest Intel OpenAPI on Ubuntu
- Fixed "decoder_start_token_id" for T5
v3.21.0 (2023-11-09)
- Minimal Support for Mistral (Loader and Rotary extension for long sequence). No sliding yet
- Support Distil-Whisper
- Support Whisper-large-v3
v3.20.0 (2023-09-18)
- Update the Transformers converter to support more model architectures:
- MixFormerSequential (used by microsoft/phi-1_5)
- Accept batch inputs in method
generate_tokens
- Add method
async_generate_tokens
to return an asynchronous generator compatible withasyncio
- Remove the epsilon value in the softmax CPU kernel for consistency with other implementations
- Optimize implementation of the Dynamic Time Wrapping (DTW) function (used for Whisper alignment)
- Avoid an unnecessary copy of the input arguments in method
Whisper::align
v3.19.0 (2023-08-31)
- Binary wheels for Python 3.7 are no longer built
- Build wheels for Python 3.12
- Update the Transformers converter to support more model architectures:
- Falcon-RW
- DistilBERT
- Llama with linear RoPE scaling (e.g. Vicuna v1.5)
- Llama with a non default RoPE base period (e.g. CodeLlama)
- Accept the token type IDs as inputs for encoder models
- Add property
GenerationStepResult.hypothesis_id
to identify the different hypotheses when running random sampling withnum_hypotheses
> 1
- Improve performance of 8-bit models on CPU:
- Vectorize the GEMM output dequantization
- Fuse the GEMM output dequantization with bias and activation
- Allow inputs shorter than 30 seconds in Whisper methods
- Fix incorrect
batch_id
values passed to the callback function - Fix a shape error in models using both MQA and relative positions
- Fix compilation error related to AVX512 when using GCC 7
- Call
.detach()
on PyTorch tensors before getting the Numpy array in converters
v3.18.0 (2023-08-03)
Converted models now uses the same floating point precision as the original models. For example, a model saved in float16 will be converted to a float16 model. Before this change, the weights were casted to float32 by default.
Similarly, selecting int8 keeps non quantized weights in their original precision unless a more specific quantization type is selected:
- int8_float32
- int8_float16
- int8_bfloat16
- Add property
compute_type
to model instances - Extend the Python class
StorageView
with additional methods and properties:to(dtype)
device_index
device
dtype
shape
- Update the function
get_supported_compute_types
to correctly return bfloat16 when supported - Update the HF Llama converter to accept extra tokens in the vocabulary
- Fix a shape error when enabling
return_alternatives
with a model using relative positions - Fix a conversion error when using
torch<1.13
- Fix a type error when running Whisper models with the bfloat16 type
- Update pybind11 to 2.11.1
v3.17.1 (2023-07-20)
- Fix an error when running models with the new
int8_bfloat16
computation type - Fix a vocabulary error when converting Llama 2 models with the Transformers converter
- Update the Transformers converter to correctly convert Llama models using GQA
- Stop the decoding when the generator returned by the method
generate_tokens
is closed
v3.17.0 (2023-07-18)
- Add new computation types:
bfloat16
andint8_bfloat16
(require a GPU with Compute Capability 8.0 or above) - Support multi-query attention for encoder-decoder models
- Allow converters to register weights as PyTorch tensors instead of Numpy arrays
- Pass the flag
trust_remote_code
when loading the tokenizer in the Transformers converter - Improve performance of T5 models by reusing the same relative position bias in every layers
- Whisper: disable the first timestamp decoding rule when a prefix is used
- Install the CMake configuration in the correct library directory (e.g. some platforms use
lib64
instead oflib
)
v3.16.1 (2023-07-03)
- Fix repeated outputs in version 3.16.0 when using
include_prompt_in_result=False
and a batch input with variable lengths: a typo in the code led tomin_length
being incorrectly applied - Update the Transformers converter to accept extra tokens for Falcon models
- Release the Python GIL when loading the model
- Initialize the rotary embeddings on the GPU instead of the CPU
- Avoid a copy for the input features passed to the Whisper methods
- Vectorize copy in the Tile CUDA operator
v3.16.0 (2023-06-15)
- Update the Transformers converter to support more architectures:
- Falcon-40B
- XLM-RoBERTa
- Add the generation option
sampling_topp
to enable top-p (nucleus) sampling - Save vocabulary files in the JSON format to better support tokens containing newlines or carriage returns
- Fix the application of
min_length
andmax_length
when usinginclude_prompt_in_result=False
and a batch input with variable lengths: the length constraint should only apply to the sequence after the prompt - Update oneDNN to 3.1.1
v3.15.1 (2023-06-09)
- Fix an error when using the new
static_prompt
argument in the methodsgenerate_tokens
andgenerate_batch
- Improve the performance of models using ALiBi
v3.15.0 (2023-06-06)
- Initial support of encoder-only Transformer model via a new class
ctranslate2.Encoder
- Update the Transformers converter to support the Falcon models
- Add a generation argument
static_prompt
to optimize the execution for models using system prompts: the model state for this prompt is cached and reused in future calls - Support early stopping in greedy search when the callback function returns
True
- Make the layer norm epsilon value configurable in the model configuration file
config.json
- Add Tanh as a possible activation function
- Fix a performance issue when running models using ALiBi on the GPU
- Fix application of the rotary embeddings when the multi-query attention is used
- Fix conversion of Marian models using
tied-embeddings-all: false
- Remove
use_fast
argument when loading Hugging Face tokenizers to use the default tokenizer for the model
v3.14.0 (2023-05-26)
- Update the Transformers converter with new architectures:
- CodeGen
- GPTBigCode
- LLaMa
- MPT
- Update the OpenNMT-py converter to support some recent options:
layer_norm="rms"
max_relative_positions=-1
(rotary embeddings)max_relative_positions=-2
(ALiBi)pos_ffn_activation_fn="silu"
- Update the OpenNMT-tf converter to support models using different configurations for the encoder and decoder (e.g. post-norm in the encoder and pre-norm in the decoder)
- Implement the multi-query attention (used by GPTBigCode)
- Support paths containing Unicode characters on Windows
- Fix the
generate_tokens
method to properly raise the underlying exception instead of hanging indefinitely - Fix compilation error when using
-DBUILD_SHARED_LIBS=OFF
- Fix runtime errors when linking against
libctranslate2.a
without using the "whole archive" flags
v3.13.0 (2023-04-25)
- Support conversion of GPT-NeoX models with the Transformers converter
- Extend the
end_token
argument to also accept a list of tokens - Add option
return_end_token
to include the end token in the results of the methodsgenerate_batch
andtranslate_batch
(by default the end token is removed) - Expose the
callback
argument for the methodsgenerate_batch
andtranslate_batch
to get early results from the decoding loop - Fallback to a custom threading implementation when OpenMP is not used (which is currently the case for the macOS ARM64 Python wheels)
- Define the CMake package
CTranslate2::ctranslate2
to facilitate the library integration in other CMake projects
- Fix the vocabulary loading when some tokens end with the carriage return
- Implement a fused kernel to apply the rotary embeddings
- Update the Ruy library to commit 363f2522
v3.12.0 (2023-04-17)
- Add methods
Generator.generate_tokens
andTranslator.generate_tokens
returning a generator that yields tokens as soon as they are generated by the model (not compatible with beam search) - Improve performance of rotary embeddings on CPU with an alternative implementation that is enabled when setting
rotary_interleave=False
in the model specification (may require to permute QK weights) - Support a variable number of input frames in method
Whisper.align
to improve batch support - Expose flag
low_cpu_mem_usage
in the Transformers converter to reduce the memory usage when loading large models (requires the packageaccelerate
)
- Fix crash in
Whisper.align
whennum_frames // 2 <= median_filter_width
- Raise an error if arguments
end_token
orsuppress_sequences
contain tokens that are not in the vocabulary - Optimize the quantization of FP16 weights during the model conversion
- In the Transformers converter, also load the model weights in FP16 when the selected quantization is
int8_float16
- Update the Whisper timestamp decoding rules to prevent the generation of segments with zero duration
v3.11.0 (2023-04-06)
- The Python wheels for macOS ARM are now built with the Ruy backend to support INT8 computation. This will change the performance and results when loading an INT8 model and/or using the
auto
compute type. To keep the previous behavior, setcompute_type="float32"
.
- Support conversion of the GPT-J architecture
- Support conversion of models using rotary position embeddings
- Apply the new OpenNMT-py option
decoder_start_token
- Add option
revision
in the Transformers converter to download a specific revision of the model from the Hugging Face Hub
v3.10.3 (2023-03-28)
- Fix a synchronization issue when the model input is a CUDA storage
v3.10.2 (2023-03-27)
- Select the correct device when copying a
StorageView
instance
v3.10.1 (2023-03-25)
- Add missing device setter in
Whisper.encode
v3.10.0 (2023-03-24)
- Add
Generator
optioninclude_prompt_in_result
(True
by default) - Add method
Whisper.encode
to only run the Whisper encoder - Add model properties
Whisper.device
andWhisper.device_index
- Update the methods
Whisper.detect_language
,Whisper.generate
, andWhisper.align
to accept the encoder output - Fix a crash when running
Generator.forward
on GPU and the generator object is destroyed before the forward output - Fix parsing of Marian YAML vocabulary files containing "complex key mappings" and escaped sequences such as "\x84"
v3.9.1 (2023-03-17)
- Fix missing alignments in the
Whisper.align
result due to a bug in the DTW implementation - Fix error when converting a Whisper model from a path
v3.9.0 (2023-03-15)
- Support BLOOM language models
- Add method
Whisper.align
to return the text/audio alignment and implement word-level timestamps
- Do not force
intra_threads
to 1 when loading a model on the GPU as some ops may still run on the CPU - Disable multithreading when copying a batch of small arrays
v3.8.0 (2023-03-06)
- Experimental support of AVX512 in manually vectorized functions: this code path is not enabled by default but can be enabled by setting the environment variable
CT2_FORCE_CPU_ISA=AVX512
- Add Transformers converter option
copy_files
to copy any files from the Hugging Face model to the converted model directory - Expose some Whisper parameters:
max_initial_timestamp_index
suppress_blank
suppress_tokens
- Reduce conversion time for large models by skipping some weights comparisons
- Reduce maximum memory usage when converting Transformers models with
--quantization float16
- Set FP32 compute type for FP16 convolutions to match the PyTorch behavior and accuracy
- Update oneDNN to 3.0.1
v3.7.0 (2023-02-23)
- Rename the "float" compute type to "float32" for clarity. "float" is still accepted for backward compatibility.
- Add the environment variable
CT2_CUDA_TRUE_FP16_GEMM
. This flag is enabled by default so that FP16 GEMMs are running in full FP16. When disabled, the compute type of FP16 GEMMs is set to FP32, which is what PyTorch and TensorFlow do by default.
- Improve the numerical precision of Whisper models running in FP16 by setting the FP32 compute type for GEMMs (same behavior as PyTorch)
- Improve support for running the Whisper models with INT16 quantization
- Ensure the Whisper decoding does not continue past
max_length
, which could previously happen when the prompt was longer thanmax_length/2
- Include the EOS score in the score returned by Whisper during greedy search
v3.6.0 (2023-02-16)
- Build the Windows Python wheels with cuDNN to enable GPU execution of Whisper models
- Add the model attribute
Whisper.is_multilingual
- Reduce the beam search memory usage by not duplicating the decoder states that are the same in each beam (e.g. the projected memory keys and values)
- Optimize the dot product attention during beam search by moving the query beam dimension to the time dimension
- Fix support of English-only Whisper models
- Include the prefix tokens (if they exist) in the output of
Whisper.generate
- Log a warning when the model weights are implicitly converted to another type
v3.5.1 (2023-02-13)
- Whisper: fix an incorrect timestamp rule that prevented timestamps to be generated in pairs
- Whisper: ignore the EOS token when applying the length penalty to match the original implementation
v3.5.0 (2023-02-10)
- Add a patience factor for beam search to continue decoding until
beam_size * patience
hypotheses are finished, as described in Kasai et al. 2022 - Implement all GELU variants and select them accordingly when converting models:
- Tanh approximation (already implemented)
- Sigmoid approximation
- Reference implementation based on the CDF
- Fix incorrect outputs of T5 models due to a bug in the CUDA kernel of the RMS normalization
- Raise an error if the Whisper input shape is incorrect
- Optimize the transposition operator used in the multi-head attention when running on GPU
- Remove the upper limit in
python_requires
to facilitate the package installation with tools like Poetry and PDM
v3.4.0 (2023-02-03)
- Fix incorrect vocabulary in M2M100 models after conversion with
transformers>=4.24
- Fix incorrect model outputs when executing with very large batch sizes on GPU
- Fix memory error in biased decoding: the vector of divergence was read and updated past its length
- Allow setting
prefix_bias_beta
> 0 withbeam_size
== 1 - Prevent timestamps from decreasing during Whisper generation
- Make some error messages more helpful when implementing a custom converter
v3.3.0 (2023-01-02)
- Support T5 models, including the variants T5v1.1 and mT5
- Support loading the model files from memory:
- Python: see the
files
argument in the constructor of classes loading models - C++: see the
models::ModelMemoryReader
class
- Python: see the
- Improve the quantization accuracy of OPT models by applying the SmoothQuant technique during conversion (pre-computed activation scales should be passed to the converter option
--activation_scales
) - Fix conversion of BART-like models from HuggingFace that are using a different number of encoder and decoder layers
- Fix compilation when no BLAS CPU backend is selected
- Remove no longer relevant CMake warning when the project is compiled without oneDNN
- Update oneDNN to 3.0
- Update oneMKL to 2023.0
v3.2.0 (2022-12-12)
- Add decoding option
suppress_sequences
to prevent specific sequences of tokens from being generated - Add decoding option
end_token
to stop the decoding on a different token than the model EOS token - Allow returning multiple random hypotheses from greedy search + random sampling when setting
num_hypotheses
> 1
- Improve support for batch generation with the Whisper model:
- Improve performance of batch generation with a context (we only require the prompts to have the same length, which is easily done by adapting the number of previous text tokens)
- Support batch mode for option
return_no_speech_prob
- Support cases where some prompts in the batch have the token
<|notimestamps|>
but not others
- Enable the Conv1D layer in more Python wheels:
- macOS x64 (using oneDNN)
- macOS ARM64 (using a custom implementation)
- Linux AArch64 (using a custom implementation)
- Update the OpenNMT-py converter to support the latest checkpoint structure
- Generalize the
TransformerSpec
constructor to accept arbitrary encoder and decoder specifications - Remove the global compilation flag
-ffast-math
which introduces unwanted side effects and enable it only for the layer norm CPU kernel where it is actually useful - Fix CMake error on Windows when setting
-DOPENMP_RUNTIME=COMP
v3.1.0 (2022-11-29)
- The input prompt is no longer included in the result of
Whisper.generate
as it is usually not useful in a transcription loop - The default beam size in
Whisper.generate
is updated from 1 to 5 to match the default value in openai/whisper - Generation options
min_length
andno_repeat_ngram_size
now penalize the logits instead of the log probs which may change some scores - Raise a deprecation warning when reading the
TranslationResult
object as a list of dictionaries
- Allow configuring the C++ logs from Python with the function
ctranslate2.set_log_level
- Implement the timestamp decoding rules when the Whisper prompt does not include the token
<|notimestamps|>
- Add option
return_no_speech_prob
to the methodWhisper.generate
for the result to include the probability of the no speech token
- Improve performance of the Whisper model when generating with a context
- Fix timestamp tokens in the Whisper vocabulary to use the correct format (
<|X.XX|>
) - Fix AVX and NEON log functions to return -inf on log(0) instead of NaN
- When info logs are enabled, log the system configuration only when the first model is loaded and not immediately when the library is loaded
- Define a
LogitsProcessor
abstract class to apply arbitrary updates to the logits during decoding - Update oneDNN to 2.7.2
v3.0.2 (2022-11-14)
- Whisper: fix
generate
arguments that were not correctly passed to the model
v3.0.1 (2022-11-10)
- Whisper: do not implicitly add
<|startoftranscript|>
ingenerate
since it is not always the first token
v3.0.0 (2022-11-07)
This major version integrates the Whisper speech recognition model published by OpenAI. It also introduces some breaking changes to remove deprecated usages and simplify some modules.
- Remove option
normalize_scores
: the scores are now always divided bypow(length, length_penalty)
withlength_penalty
defaulting to 1 - Remove option
allow_early_exit
: the beam search now exits early only when no penalties are used
- Rename some classes:
OpenNMTTFConverterV2
->OpenNMTTFConverter
TranslationStats
->ExecutionStats
- Remove compatibility for reading
ScoringResult
as a list of scores: the scores can be accessed with the attributelog_probs
- Remove compatibility for reading
ExecutionStats
as a tuple - Remove support for deprecated Python version 3.6
- Rename the client executable
translate
to a more specific namect2-translator
- Rename or remove some classes and methods:
TranslationStats
->ExecutionStats
GeneratorPool
->Generator
TranslatorPool
->Translator
TranslatorPool::consume_*
->Translator::translate_*
TranslatorPool::consume_stream
-> removedTranslatorPool::score_stream
-> removed
- Remove support for building with CUDA 10
- Integrate the Whisper speech recognition model published by OpenAI
- Support conversion of models trained with OpenNMT-py V3
- Add method
Generator.forward_batch
to get the full model output for a batch of sequences - Add Python class
StorageView
to expose C++ methods taking or returning N-dimensional arrays: the class implements the array interface for interoperability with Numpy and PyTorch - Add a new configuration file
config.json
in the model directory that contains non structual model parameters (e.g. related to the input, the vocabulary, etc.) - Implement the Conv1D layer and operator on CPU and GPU (using oneDNN and cuDNN respectively)
- [C++] Allow registration of external models with
models::ModelFactory
- Fix conversion of models that use biases only for some QKV projections but not for all
- Fuse masking of the output log probs by aggregating disabled tokens from all related options:
disable_unk
,min_length
,no_repeat_ngram_size
, etc. - Reduce the layer norm epsilon value on GPU to 1e-5 to match the default value in PyTorch
- Move some Transformer model attributes under the encoder/decoder scopes to simplify loading
- Redesign the
ReplicaPool
base class to simplify adding new classes with multiple model workers - Compile the library with C++17
- Update oneDNN to 2.7.1
- Update oneMKL to 2022.2
- Update pybind11 to 2.10.1
- Update cibuildwheel to 2.11.2
v2.24.0 (2022-10-03)
- The Linux binaries now use the GNU OpenMP runtime instead of Intel OpenMP to workaround an initialization error on systems without
/dev/shm
- Fix a memory error when running random sampling on GPU
- Optimize the model loading on multiple GPUs by copying the finalized model weights instead of reading the model from disk multiple times
- In the methods
Translator.translate_iterable
andTranslator.score_iterable
, raise an error if the input iterables don't have the same length - Fix some compilation warnings
v2.23.0 (2022-09-16)
- Build wheels for Python 3.11
- In beam search, get more candidates from the model output and replace finished hypotheses by these additional candidates
- Fix possibly incorrect attention vectors returned from the beam search
- Fix coverage penalty that was actually not applied
- Fix crash when the beam size is larger than the vocabulary size
- Add missing compilation flag
-fvisibility=hidden
when building the Python module - Update oneDNN to 2.6.2
- Update OpenBLAS to 0.3.21
v2.22.0 (2022-09-02)
score_batch
methods now return a list ofScoringResult
instances instead of plain lists of probabilities. In most cases you should not need to update your code: the result object implements the methods__len__
,__iter__
, and__getitem__
so that it can still be used as a list.
- Add methods to efficiently process long iterables:
Translator.translate_iterable
Translator.score_iterable
Generator.generate_iterable
Generator.score_iterable
- Add decoding option
min_alternative_expansion_prob
to filter out unlikely alternatives inreturn_alternatives
mode - Return
ScoringResult
instances fromscore_batch
to include additional outputs. The current attributes are:tokens
: the list of tokens that were actually scored (including special tokens)log_probs
: the log probability of each scored token
- Support running
score_batch
asynchronously by setting theasynchronous
flag
- Fix possibly incorrect results when using
disable_unk
oruse_vmap
with one of the following options:min_decoding_length
no_repeat_ngram_size
prefix_bias_beta
repetition_penalty
- Also pad the output layer during scoring to enable Tensor Cores
- Improve the correctness of the model output probabilities when the output layer is padded
- Skip translation when the NLLB input is empty (i.e. when the input only contains EOS and the language token)
v2.21.1 (2022-07-29)
- Fix conversion of NLLB models when
tokenizer_class
is missing from the configuration
v2.21.0 (2022-07-27)
- Support NLLB multilingual models via the Transformers converter
- Support Pegasus summarization models via the Transformers converter
- Do not stop decoding when the EOS token is coming from the user input: this is required by some text generation models like
microsoft/DialoGPT
where EOS is used as a separator - Fix conversion error for language models trained with OpenNMT-py
- Fix conversion of models that are not using bias terms in the multi-head attention
- Fix data type error when enabling the translation options
return_alternatives
andreturn_attention
with afloat16
model - Improve CPU performance of language models quantized to
int8
- Implement a new vectorized GELU operator on CPU
- Raise a more explicit error when trying to convert a unsupported Fairseq model
- Update pybind11 to 2.10.0
v2.20.0 (2022-07-06)
- Generation option
no_repeat_ngram_size
to prevent the repetitions of N-grams with a minimum size
- Fix conversion of OpenNMT-tf models that use static position embeddings
- Fix a segmentation fault in
return_alternatives
mode when the target prefix is longer thanmax_decoding_length
- Fix inconsistent state of asynchronous results in Python when a runtime exception is raised
- Remove
<pad>
token when converting MarianMT models from Transformers: this token is only used to start the decoder from a zero embedding, but it is not included in the original Marian model - Optimize CPU kernels with vectorized reduction of accumulated values
- Do not modify the configuration passed to
OpenNMTTFConverterV2.from_config
- Improve Python classes documentation by listing members at the top
v2.19.1 (2022-06-23)
- Fix missing final bias in some MarianMT models converted from Transformers
- Fix missing final layer normalization in OPT models converted from Transformers
- Fix error when converting OpenNMT-tf V1 checkpoints with the new OpenNMT-tf converter
- Reduce model conversion memory usage when the loaded weights are in FP16 and the model is converted with quantization
- Add missing C++ type
ctranslate2::float16_t
in the public headers that is required to use some functions - Fix some Python typing annotations
v2.19.0 (2022-06-08)
- Support conversion of decoder-only Transformer models trained with OpenNMT-tf
- Fix conversion error for Transformers' model
facebook/bart-large-cnn
- Fix crash when scoring empty sequences
- Apply
max_input_length
after all special tokens have been added to the input - Clear the GPU memory cache when no new batches are immediately available for execution
- Improve functions signature in the generated Python API documentation
- Update oneDNN to 2.6
- Update spdlog to 1.10.0
- Update OpenBLAS to 0.3.20
v2.18.0 (2022-05-23)
- Support Meta's OPT models via the Transformers converter
- Extend the Fairseq converter to support
transformer_lm
models
- Fix conversion error for Marian's pre-norm Transformer models
- Fix conversion error for Transformers' MarianMT models that are missing some configuration fields
- Improve conversion speed of Marian models (optimize the generation of the sinusoidal position encodings)
v2.17.0 (2022-05-09)
- Add a converter for Hugging Face's Transformers. The following models are currently supported:
- BART
- M2M100
- MarianMT
- MBART
- OpenAI GPT2
- Revisit the OpenNMT-tf converter to better support custom models and configurations:
- Extend the conversion script to accept the training configuration
- Add a new converter class
ctranslate2.converters.OpenNMTTFConverterV2
- Move all documentation and guides to the website to improve navigation and clarity
- In text generation, include the start token in the output if it is not the BOS token
v2.16.0 (2022-04-28)
- Initial support of language models:
- Add a high-level class
ctranslate2.Generator
to generate text with language models - Add a converter for OpenAI GPT-2 models
- Update the OpenNMT-py converter to support
transformer_lm
decoders
- Add a high-level class
- Build ARM64 wheels for macOS
- Allow loading custom Fairseq extensions and architectures during conversion with the option
--user_dir
- Enable conversion of the Fairseq architectures
multilingual_transformer
andmultilingual_transformer_iwslt_de_en
- Implement random sampling in beam search using the Gumbel-max trick
- Generate and publish the Python API reference to https://opennmt.net/CTranslate2
- Fix model loading on a GPU with index > 0
- Fix memory error when running random sampling on GPU with certain batch sizes
- Fix incorrect tokens order in some converted Marian vocabularies
- Properly count the number of layers before building the encoder/decoder instead of relying on runtime exceptions
v2.15.1 (2022-04-04)
- Fix missing deactivation of OpenMP threading in GPU execution (regression introduced in version 2.15.0)
v2.15.0 (2022-04-04)
- Expose translator option
max_queued_batches
to configure the maximum number of queued batches (when the queue is full, future requests will block until a free slot is available) - Allow converters to customize the vocabulary special tokens
<unk>
,<s>
, and</s>
- Fix compatibility of models converted on Windows with other platforms by saving the vocabulary files with the newline character "\n" instead of "\r\n"
- Clarify conversion error when no TensorFlow checkpoints are found in the configured model directory
- Enable fused QKV transposition by switching the heads and time dimensions before the QKV split
- Cache the prepared source lengths mask in the Transformer decoder state and reuse it in the next decoding steps
- Pad the output layer to enable Tensor Cores only once instead of updating the layer on each batch
- Vectorize copy in Concat and Split ops on GPU
- Factorize all OpenMP parallel for loops to call the
parallel_for
function - Compile CUDA kernels for deprecated Compute Capabilities that are not yet dropped by CUDA:
- CUDA 11: 3.5 and 5.0
- CUDA 10: 3.0
v2.14.0 (2022-03-16)
- Include BART and MBART in the list of supported Fairseq architectures
- Add Fairseq converter option
--no_default_special_tokens
to require all special tokens to be set by the user during inference, including the decoder start tokens (for example, this is required by MBART-25 to properly set the language tokens)
- Fix conversion of Post-Norm Transformers trained with OpenNMT-tf
- Fix scoring with Fairseq models that used an incorrect decoder start token (Fairseq uses
</s>
as the decoder start token, not<s>
) - Fix scoring result to include the end of sentence token
- Ignore OpenNMT-py options
--alignment_layer
and--alignment_heads
for models that are not trained with alignments - Enable batch encoding in
return_alternatives
translation mode (the decoding still runs sequentially) - Make enumerations
ctranslate2.specs.Activation
andctranslate2.specs.EmbeddingsMerge
public since they could be used to configure the Transformer specification - Update oneDNN to 2.5.3
- Update cpu_features to 0.7.0
- Update cxxopts to 3.0.0
- Update spdlog to 1.9.2
v2.13.1 (2022-03-02)
- Fix conversion error for old OpenNMT-py models that do not have the option
self_attn_type
v2.13.0 (2022-02-28)
- Add converter for Marian and support the collection of OPUS-MT pretrained models
- Support models applying a layer normalization after the embedding layer (cf. option
--layernorm-embedding
in Fairseq) - Support models using the Swish (a.k.a SiLU) activation function
- Support models using custom decoder start tokens, which can be passed in the target prefix
- Remove unexcepted call to a CUDA function in CPU execution when unloading models
- Add option groups in the translation client help output
- Use new
thrust::cuda::par_nosync
execution policy when calling Thrust functions - Update Thrust to 1.16.0
- Update pybind11 to 2.9.1
v2.12.0 (2022-02-01)
- Support models using additional source features (a.k.a. factors)
- Fix compilation with CUDA < 11.2
- Fix incorrect revision number reported in the error message for unsupported model revisions
- Improve quantization correctness by rounding the value instead of truncating (this change will only apply to newly converted models)
- Improve default value of
intra_threads
when the system has less than 4 logical cores - Update oneDNN to 2.5.2
v2.11.0 (2022-01-11)
- With CUDA >= 11.2, the environment variable
CT2_CUDA_ALLOCATOR
now defaults tocuda_malloc_async
which should improve performance on GPU.
- Build Python wheels for AArch64 Linux
- Improve performance of Gather CUDA kernel by using vectorized copy
- Update Intel oneAPI to 2022.1
- Update oneDNN to 2.5.1
- Log some additional information with
CT2_VERBOSE
>= 1:- Location and compute type of loaded models
- Version of the dynamically loaded cuBLAS library
- Selected CUDA memory allocator
v2.10.1 (2021-12-15)
- Fix stuck execution when loading a model on a second GPU
- Fix numerical error in INT8 quantization on macOS
v2.10.0 (2021-12-13)
inter_threads
now also applies to GPU translation, where each translation thread is using a different CUDA stream to allow some parts of the GPU execution to overlap
- Add option
disable_unk
to disable the generation of unknown tokens - Add function
set_random_seed
to fix the seed in random sampling - [C++] Add constructors in
Translator
andTranslatorPool
classes withModelReader
parameter
- Fix incorrect output from the Multinomial op when running on GPU with a small batch size
- Fix Thrust and CUB headers that were included from the CUDA installation instead of the submodule
- Fix static library compilation with the default build options (
cmake -DBUILD_SHARED_LIBS=OFF
) - Compile the Docker image and the Linux Python wheels with SSE 4.1 (vectorized kernels are still compiled for AVX and AVX2 with automatic dispatch, but other source files are now compiled with SSE 4.1)
- Enable
/fp:fast
for MSVC to mirror-ffast-math
that is enabled for GCC and Clang - Statically link against oneDNN to reduce the size of published binaries:
- Linux Python wheels: 43MB -> 17MB
- Windows Python wheels: 41MB -> 11MB
- Docker image: 733MB -> 600MB
v2.9.0 (2021-12-01)
- Add GPU support to the Windows Python wheels
- Support OpenNMT-py and Fairseq options
--alignment_layer
and--alignment_heads
which specify how the multi-head attention is reduced and returned by the Transformer decoder - Support dynamic loading of CUDA libraries on Windows
- Fix division by zero when normalizing the score of an empty target
- Fix error that was not raised when the input length is greater than the number of position encodings
- Improve performance of random sampling on GPU for large values of
sampling_topk
or when sampling over the full vocabulary - Include
transformer_align
andtransformer_wmt_en_de_big_align
in the list of supported Fairseq architectures - Add a CUDA kernel to prepare the length mask to avoid moving back to the CPU
v2.8.1 (2021-11-17)
- Fix dtype error when reading float16 scores in greedy search
- Fix usage of MSVC linker option
/nodefaultlib
that was not correctly passed to the linker
v2.8.0 (2021-11-15)
- The Linux Python wheels now use Intel OpenMP instead of GNU OpenMP for consistency with other published binaries
- Build Python wheels for Windows
- Fix segmentation fault when calling
Translator.unload_model
while an asynchronous translation is running - Fix implementation of repetition penalty that should be applied to all previously generated tokens and not just the tokens of the last step
- Fix missing application of repetition penalty in greedy search
- Fix incorrect token index when using a target prefix and a vocabulary mapping file
- Set the OpenMP flag when compiling on Windows with
-DOPENMP_RUNTIME=INTEL
or-DOPENMP_RUNTIME=COMP
v2.7.0 (2021-11-03)
- Inputs are now truncated after 1024 tokens by default (see translation option
max_input_length
)
- Add translation option
max_input_length
to limit the model input length - Add translation option
repetition_penalty
to apply an exponential penalty on repeated sequences - Add scoring option
with_tokens_score
to also output token-level scores when scoring a file
- Adapt the length penalty formula when using
normalize_scores
to match other implementations: the scores are divided bypow(length, length_penalty)
- Implement
LayerNorm
with a single CUDA kernel instead of 2 - Simplify the beam search implementation
v2.6.0 (2021-10-15)
- Build wheels for Python 3.10
- Accept passing the vocabulary as a
opennmt.data.Vocab
object or a list of tokens in the OpenNMT-tf converter
- Fix segmentation fault in greedy search when
normalize_scores
is enabled but notreturn_scores
- Fix segmentation fault when
min_decoding_length
andmax_decoding_length
are both set to 0 - Fix segmentation fault when option
sampling_topk
is larger than the vocabulary size - Fix incorrect score normalization in greedy search when
max_decoding_length
is reached - Fix incorrect score normalization in the
return_alternatives
translation mode - Improve error checking when reading the binary model file
- Apply
LogSoftMax
in-place during decoding and scoring
v2.5.1 (2021-10-04)
- Fix logic error in the in-place implementation of the
Gather
op that could lead to incorrect beam search outputs
v2.5.0 (2021-10-01)
- Add an 8-bit GEMM backend on AArch64 using Ruy
- Skip unnecessary transpositions of the projected decoder queries in the multi-head attention
- Use 32-bit indexing in all CUDA kernels to slightly improve performance
- Let the compiler auto-vectorize the
LayerNorm
CPU kernel - Update Intel oneAPI to 2021.4
v2.4.0 (2021-09-10)
- [Python] Support asynchronous translation:
translate_batch
can return future-like objects with argumentasynchronous=True
- [Python]
translate_batch
now returns a list ofTranslationResult
objects instead of a list of dictionaries (this object can also be indexed as a list of dictionaries for backward compatibility) - Add options
--source_lang
and--target_lang
to the Fairseq converter for models that do not include these information
- Fix Fairseq model conversion when the model options are stored in
model["cfg"]["model"]
- Compile the CPU INT8 quantization kernel with FMA instructions
- Enable packing of the last linear weight when not using dynamic vocabulary reduction
- Replace the generic
Tile
implementation by dedicated CPU and CUDA kernels - [Python] Implement
__repr__
method forTranslationStats
objects - [Python] Update pybind11 to 2.7.1
v2.3.2 (2021-08-05)
- Fix GPU execution that gets stuck when applying the GELU activation
v2.3.1 (2021-07-28)
- Fix compilation with CUDA < 10.2
v2.3.0 (2021-07-26)
- Add compute type
int8_float16
for mixed INT8 and FP16 computation on GPU (requires Compute Capability >= 7.0) - Add methods
Translator.score_batch
andTranslator.score_file
to score existing translations
- Relax the GPU driver requirement for running the Docker image to >= 450.80.02 (same as the published Python package)
v2.2.0 (2021-07-06)
- Add Python utility functions to query the system capabilities:
ctranslate2.get_cuda_device_count
ctranslate2.get_supported_compute_types
- Add option
fixed_dictionary
in the Fairseq converter to support multilingual models - Extend environment variable
CT2_VERBOSE
to configure more log levels (see README)
- Fuse activation with bias addition on GPU for a small performance increase
- Make the GELU activation compatible with FP16 execution
- Improve the log format using the spdlog library
- Improve the accuracy of the profiling results on GPU
- Update Intel oneAPI to 2021.3
v2.1.0 (2021-06-14)
- Support conversion of Transformer models trained with Fairseq (see script
ct2-fairseq-converter
) - Support conversion of models using GELU activations
- Add translation option
normalize_scores
to return scores normalized by the hypotheses length: enabling this option can improve the beam search output for some models - Add translation option
allow_early_exit
to toggle the beam search early exit optimization: disabling this option has a small negative impact on performance, but it can improve the beam search output when using penalties or normalized scores - [C++] Add class
BufferedTranslationWrapper
to buffer and batch independent inputs to the same model
- Read value of environment variable
OMP_NUM_THREADS
whenintra_threads
is not set - Improve file translation performance by enabling local sorting by default
- [Python] Improve error message when converting unsupported models and list all options that are unuspported
- [Python] Return statistics of
Translator.translate_file
as an object with named properties - [C++] Fix compilation of method
TranlatorPool::consume_raw_text_file
that takes streams as inputs
v2.0.0 (2021-06-03)
This major version introduces some breaking changes to simplify model conversion, improve the consistency of user options, and update the Python package to CUDA 11.x. It also comes with internal improvements to facilitate future changes.
- Disable
return_scores
by default as most applications do not use translation scores - Replace all Docker images by a single one:
<version>-ubuntu20.04-cuda11.2
- Replace CMake option
LIB_ONLY
byBUILD_CLI
- Require CMake version >= 3.15 for GPU compilation
- For GPU execution, the Linux Python wheels published on PyPI now require CUDA 11.x to be installed on the system. The CUDA dependencies (e.g. cuBLAS) are no longer included in the package and are loaded dynamically.
- Remove support for converting the TensorFlow SavedModel format (checkpoints should be converted instead)
- Remove the
model_spec
option for converters that can automatically detect it from the checkpoints - Force translation options to be set with keyword arguments only (see the API reference)
- Rename tokenization callables arguments in
translate_file
for clarity:tokenize_fn
tosource_tokenize_fn
detokenize_fn
totarget_detokenize_fn
- Rename length contraints options for consistency with other APIs:
max_sent_length
tomax_decoding_length
min_sent_length
tomin_decoding_length
- Move the
max_batch_size
andbatch_type
options from theTranslationOptions
structure to the translation methods ofTranslatorPool
- Simplify the
TranslationResult
structure with public attributes instead of methods - Asynchronous translation API now returns one future per example instead of a single future for the batch
- Add translation option
prefix_bias_beta
to bias the decoding towards the target prefix (see Arivazhagan et al. 2020) - Automatically detect the model specification when converting OpenNMT-py models
- Support conversion and execution of Post-Norm Transformers
- Add an experimental asynchronous memory allocator for CUDA 11.2 and above (can be enabled with the environment variable
CT2_CUDA_ALLOCATOR=cuda_malloc_async
) - Expose the Python package version in
ctranslate2.__version__
- Fix silent activation of
replace_unknowns
when enablingreturn_attention
- Improve support for the NVIDIA Ampere architecture in prebuilt binaries
- Reduce the size of the Python wheels published on PyPI
- Define a custom CUDA kernel for the GEMM output dequantization instead of a Thrust-based implementation
- Update Thrust to 1.12.0
v1.20.1 (2021-04-29)
- Do not return scores for empty outputs when
return_scores
is disabled - Do not include google/cpu_features library in CTranslate2 installation
v1.20.0 (2021-04-20)
- Drop Python 3.5 support
- Docker image tags suffixed with
-gpu
are no longer updated to prefer tags with an explicit CUDA version
- Fix int8 quantization for rows that only contains zeros
- Fix type error when running the CUDA code path of the Multinomial operator
- Add EOS score to the greedy search final score for consistency with the beam search output
- Use third party library google/cpu_features to resolve CPU features at runtime
- Small optimizations when manipulating tensor shapes and indices
- Internal refactoring of Transformer layers
v1.19.0 (2021-03-31)
- Rename CMake option
WITH_TESTS
toBUILD_TESTS
- Add "auto" compute type to automatically select the fastest compute type on the current system
- [Python] Clear memory allocator cache when calling
unload_model
- [Python] Make methods
unload_model
andload_model
thread safe - Fix conversion of TensorFlow SavedModel with shared embeddings
- Update Intel oneAPI to 2021.2
- Compile core library with C++14 standard
v1.18.3 (2021-03-02)
- Use Intel OpenMP instead of GNU OpenMP in the Docker images as a workaround for issue #409.
v1.18.2 (2021-02-23)
- Fix crash when enabling coverage penalty in GPU translation
- Fix incorrect value of AVX2 flag in
CT2_VERBOSE
output
v1.18.1 (2021-02-01)
- Fix conversion of models setting the attributes
with_source_bos
orwith_source_eos
v1.18.0 (2021-01-28)
- Some options default value in the
translate
client have been changed to match the Python API:batch_size
= 32 (instead of 30)beam_size
= 2 (instead of 5)intra_threads
= 4 (instead of 0)
- Support multi-GPU translation:
device_index
argument can now be set to a list of GPU IDs (see example)
- Improve performance when using multiple GPU translators concurrently in the same process
- [Python] Do nothing when calling
unload_model(to_cpu=True)
on CPU translators - [Python] Set a default value for
max_batch_size
argument in methodTranslator.translate_file
- Disable
CT2_TRANSLATORS_CORE_OFFSET
in OpenMP builds as setting thread affinity does not work when OpenMP is enabled
v1.17.1 (2021-01-15)
- Fix Python wheel loading error on macOS
v1.17.0 (2021-01-11)
- Linux Python wheels are now compiled under
manylinux2014
and requirepip
version >= 19.3
- Publish Python wheels for macOS (CPU only)
- Support compilation for ARM 64-bit architecture and add NEON vectorization
- Add new optional GEMM backends: Apple Accelerate and OpenBLAS
- Add
replace_unknowns
translation option to replace unknown target tokens by source tokens with the highest attention - Add flags in the model specification to declare that BOS and/or EOS tokens should be added to the source sequences
- Fix segmentation fault when the model is converted with a wrong vocabulary and predicts an out-of-vocabulary index
- Fix result of vectorized array reduction when the array length is not a multiple of the SIMD registers width
- Fix exit code when running
cli/translate -h
- Improve performance of vectorized vector math by inlining calls to intrinsics functions
- Improve accuracy of LogSoftMax CUDA implementation
- Improve error message when
--model
option is not set incli/translate
- Update oneMKL to 2020.1 in published binaries
- Update oneDNN to 2.0 in published binaries
- Update default search paths to support compilation with oneMKL and oneDNN installed from the oneAPI toolkit
v1.16.2 (2020-11-27)
- Fix cuBLAS version included in the Python wheels published to PyPI. The included library was targetting CUDA 10.2 instead of CUDA 10.1.
- Re-add Python 3.5 wheels on PyPI to give users more time to transition
v1.16.1 (2020-11-23)
- Fuse dequantization and bias addition on GPU for improved INT8 performance
- Improve performance of masked softmax on GPU
- Fix error when building the CentOS 7 GPU Docker image
- The previous version listed "Pad size of INT8 matrices to a multiple of 16 when the GPU has INT8 Tensor Cores". However, the padding was not applied due to a bug and fixing it degraded the performance, so this behavior is not implemented for now.
v1.16.0 (2020-11-18)
- Drop support for Python 2.7 and 3.5
- Add Docker images using CUDA 11.0
- Enable parallel CPU translations from
translate_batch
in Python when settinginter_threads
> 1 andmax_batch_size
> 0 - Improve GPU performance on Turing architecture when using a Docker image or the Python package
- Pad size of INT8 matrices to a multiple of 16 when the GPU has INT8 Tensor Cores
- Add information about detected GPU devices in
CT2_VERBOSE
output - Update oneDNN to 1.7
- [Python] Improve type checking for some arguments
v1.15.0 (2020-11-06)
- [Experimental] The Python package published on PyPI now includes GPU support. The binary is compiled with CUDA 10.1, but all CUDA dependencies are integrated in the package and do not need to be installed on the system. The only requirement should be a working GPU with driver version >= 418.39.
- Remove the TensorRT dependency to simplify installation and reduce memory usage:
- Reduce GPU Docker images size by 600MB
- Reduce memory usage on the GPU and the system by up 1GB
- Reduce initialization time during the first GPU translation
- Improve TopK performance on GPU for K < 5
- Improve INT8 performance on GPU
- Accept linear layers without bias when converting models
- Update Intel MKL to 2020.4
- [Python] Improve compatibility with Python 3.9
v1.14.0 (2020-10-13)
- Accept target prefix in file translation APIs
- Fix CUDA illegal memory access when changing the beam size in the same process
- Fix decoding with target prefix that sometimes did not go beyond the prefix
- Fix Intel MKl search paths on macOS
- Update Intel MKL to 2020.3
- Clarify error message when selecting a CUDA device in CPU-only builds
v1.13.2 (2020-08-31)
- Fix model conversion to
float16
when using the Python converters: weights were duplicated and not correctly converted - Fix incorrect code logic that could lead to incorrect translation results
v1.13.1 (2020-08-06)
- Fix performance regression when decoding with a large beam size on GPU
v1.13.0 (2020-07-30)
- Environment variable
CT2_TRANSLATORS_CORE_OFFSET
to pin parallel translators to a range of CPU cores (only forintra_threads
= 1) - [Python] Add some properties to the
Translator
object:device
device_index
num_translators
num_queued_batches
model_is_loaded
- Improve batch performance of target prefix
- Improve performance when the input batch contains sentences with very different lengths
- Improve beam search performance by expanding the batch size only after the first decoding step
- Optimize Transpose op on GPU for the permutation used in multi-head attention
- Remove padding in returned attention vectors
- Update Intel MKL to 2020.2
v1.12.1 (2020-07-20)
- Fix implicit int16 to float16 model conversion on compatible GPUs
v1.12.0 (2020-07-16)
- Docker images based on Ubuntu 16.04 are no longer updated
- Support
float16
data type for model conversion (with--quantization float16
) and computation (with--compute_type float16
). FP16 execution can improve performance by up to 50% on NVIDIA GPUs with Compute Capability >= 7.0. - Add Docker images with newer CUDA versions, which can improve performance in some cases:
latest-ubuntu18-cuda10.0
(same aslatest-ubuntu18-gpu
)latest-ubuntu18-cuda10.1
latest-ubuntu18-cuda10.2
latest-centos7-cuda10.0
(same aslatest-centos7-gpu
)latest-centos7-cuda10.1
latest-centos7-cuda10.2
- Allow setting a computation type per device (e.g.
Translator(..., compute_type={"cuda": "float16", "cpu": "int8"})
with the Python API) - [C++] Add
ModelReader
interface to customize model loading
- Optimize Transpose op on CPU for the permutation used in multi-head attention
- Optimize GELU op CPU with Intel MKL
- Fix compilation when targeting an architecture and disabling ISA dispatch (e.g.:
-DCMAKE_CXX_FLAGS="-march=skylake" -DENABLE_CPU_DISPATCH=OFF
) - Inline some frequently called methods
v1.11.0 (2020-06-29)
- Add tokenization and detokenization hooks for file translation APIs
- Add alternatives to Intel MKL:
- Integrate oneDNN for GEMM functions
- Implement vectorized operators that automatically select the instruction set architecture (ISA) (can be manually controlled with the
CT2_FORCE_CPU_ISA
environment variable)
- When alternatives are available, avoid using Intel MKL on non Intel processors (can be manually controlled with the
CT2_USE_MKL
environment variable) - Enable a verbose mode with the environment variable
CT2_VERBOSE=1
to help debugging the run configuration (e.g. the detected CPU, whether Intel MKL is being used, etc.)
- Improve numerical precision of SoftMax and LogSoftMax layers on CPU
- Parallelize INT16 quantization/dequantization and ReLU on CPU
- Add back the translation client in CentOS 7 Docker images
v1.10.2 (2020-06-23)
- [Python] Fix error when calling
unload_model(to_cpu=True)
for models with shared weights - [Python] Do not ignore errors when importing the compiled translator extension
v1.10.1 (2020-05-25)
- Force
intra_threads
to 1 when running a model on GPU to prevent high CPU load - Improve handling of decoding length constraints when using a target prefix
- Do not raise an error when setting
use_vmap
but no vocabulary map exists
v1.10.0 (2020-04-17)
- Coverage penalty as in Wu et al. 2016 with the option
coverage_penalty
- Batch size can be expressed in number of tokens with the option
batch_type
- Translation scores can be disabled with the option
return_scores
(if disabled, the final SoftMax is skipped during greedy decoding) - Support compilation without TensorRT by setting
-DWITH_TENSORRT=OFF
during CMake configuration (in this case, beam search is no longer supported) - Experimental integration of Intel MKL's packed GEMM which can be enabled by setting the environment variable
CT2_USE_EXPERIMENTAL_PACKED_GEMM=1
- Remove direct dependency to cuDNN (still an indirect dependency via TensorRT)
- Static AVX optimization for the ReLU operator
- Remove unnecessary memory initialization when creating temporary buffers
- Dissociate SoftMax and LogSoftMax in profiling report
v1.9.1 (2020-04-08)
- Fix parallel translations when calling
Translator.translate_batch
from multiple Python threads - Fix crash on invalid
num_hypotheses
value
v1.9.0 (2020-03-24)
- Return 2 additional statistics from file translation APIs:
- the number of translated examples
- the total translation time in milliseconds
- Fix exceptions that were not catched by the Python wrapper
- Fix an invalid insertion in the variables collection while iterating over it
- Optimize filling operation of float storages
- Internal refactoring of decoding functions to make them reusable for other tasks (e.g. generative language models)
v1.8.0 (2020-03-10)
- [Python] Add methods
Translator.unload_model
andTranslator.load_model
to manually manage memory - [Docker] Move all images to Python 3 only
- Expose options that enable an internal sorting by length to increase the translation efficiency:
- for file translation:
read_batch_size
contiguous examples will be loaded, sorted by length, and batched with sizemax_batch_size
- for batch translation: if the batch is larger than
max_batch_size
, examples will be sorted by length and batched with sizemax_batch_size
- for file translation:
- Fix another error when releasing a translator that is placed on a GPU that is not GPU 0
- Fix possible memory corruption when creating GPU translators in parallel
- Fix memory that is briefly allocated on GPU 0 when destroying a translator that is placed on another GPU
- Reduce latency of model loading, especially on GPU
v1.7.1 (2020-03-03)
- Revert "Parallelize some low level transformations on CPU" which caused incorrect computation
- Avoid unnecessary TensorFlow runtime initialization when converting checkpoints
- Fix compilation without MKL
v1.7.0 (2020-02-28)
- Translation option
return_alternatives
to return multiple choices at the first unconstrained decoding position: combined with a target prefix, this could be used to provide alternative words and translations at a specific location in the target - Support Transformers with different number of encoder/decoder layers
- Allow compilation without OpenMP with
-DOPENMP_RUNTIME=NONE
- Fix SavedModel conversion when TensorFlow Addons 0.8 is installed
- Fix error when releasing a translator/model that is placed on a GPU that is not GPU 0
- Fix memory that was allocated on GPU 0 even when the translator/model was placed on another GPU
- Query GPU int8 support on the first model load, and then cache the result for future loads
- Avoid creating an empty model directory on conversion errors
- Parallelize some low level transformations on CPU
- Reduce memory usage when translating large files by limiting the work queue size
v1.6.3 (2020-02-24)
- Fix incorrectness in relative representation computation
v1.6.2 (2020-02-21)
- Fix conversion of models with shared embeddings
v1.6.1 (2020-02-11)
- [Docker] Remove translation client in CentOS 7 images as it can cause compatibility issues with downstream images
v1.6.0 (2020-02-14)
- Support Transformers with relative position representations (as in Shaw et al. 2018)
- Accept target prefix in batch request
- Support
return_attention
with prefixed translation
v1.5.1 (2020-02-06)
- Fix INT8 translation on CPU with vocabulary map
v1.5.0 (2020-02-06)
- [C++] Add
max_batch_size
translation options for single translators
- Improve INT8 performance on CPU
- Enable INT8 support on default Intel MKL build
- Simplify project dependencies:
- Replace
boost::program_options
withcxxopts
for client options - Include header-only dependencies as Git submodules (
cxxopts
,cub
, andthrust
) - Remove MKL-DNN
- Replace
- Harmonize Python/C++ default values:
- [Python] Change default beam size from 4 to 2
- [C++] Load models on the CPU by default
v1.4.0 (2020-01-20)
- Publish a package on PyPI (without GPU support)
- Add method to convert OpenNMT-tf models directly from a dictionary of variables
- Return statistics from Python method
Translator.translate_file
- Add
set_model
methods to support changing models without creating a newTranslator
- Add a
contains_model
function to check whether a directory could contain a CTranslate2 model
v1.3.0 (2020-01-14)
- Support random sampling (see the
sampling_topk
andsampling_temperature
translation options) CT2_CUDA_CACHING_ALLOCATOR_CONFIG
environment variable to configure the CUDA caching allocator
- Fix incorrect translations on Windows due to incompatibility between the compiler OpenMP and Intel OpenMP
- Release cuDNN/cuBLAS/TensorRT handles on thread exit when destroying a
TranslatorPool
- Remove use of
--{start,end}-group
compiler options when compiling on Mac OS - Update Intel MKL to 2020.0 in Docker images
- Load vocabulary assets for SavedModel exported with OpenNMT-tf 2.5 and above
v1.2.3 (2019-12-11)
- Improve translator robustness on empty batch and inputs
- Speed optimization for
LayerNorm
- Check vocabulary size when converting OpenNMT-tf models
- Add more samples in the execution profiling output which now supports nested functions
v1.2.2 (2019-11-25)
- Fix
PositionEncoder
internal state that was shared with other instances on the same thread - Replace Boost.Python by pybind11
- Include a Python source distribution in the Docker images
v1.2.1 (2019-11-06)
- Avoid copying decoder states when possible to improve decoding performance (10% to 20% faster)
- Fix execution profiling on GPU (device was not synchronized before measuring the time)
- Include
Mul
operation in profiling report - Add a Python 3 wheel in Ubuntu Docker images
v1.2.0 (2019-10-28)
- Accept Transformer models with custom number of layers and heads
--log-profiling
client option to profile ops execution
- Fix conversion error for models having 2 different weights with the same values
- Fix invalid MKL function override after a refactoring
- Add more information and context to several error messages
v1.1.0 (2019-10-18)
- New Docker images:
latest-ubuntu16-gpu
,latest-ubuntu18
,latest-ubuntu18-gpu
- Support OpenNMT-tf Transformer models with shared embeddings
- Update to TensorRT 6
- Make OpenMP runtime configurable
- Reduce the size of models with shared weights on disk and in memory
- Shared words vocabulary is no longer duplicated on disk and in memory
- Improve performance of translation with a vocabulary map on GPU
- Statically link against Intel MKL
- Remove some implementation details from public headers
v1.0.1 (2019-10-08)
- Fix loading of newer OpenNMT-py models
- Promote FP16 to FP32 in model converter scripts
- Improve INT8 performance on CPU and GPU
- Improve performance on GPU by fusing the layer normalization operation
x * gamma + beta
- Enable INT8 and INT16 computation on all platforms with Intel MKL 2019.5 and above
v1.0.0 (2019-09-23)
First stable release.