vLLM + Mixtral AWQ question about chat template and tokenizer #3092
Michelklingler
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm currently running an instance of "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ" on a RTX A6000 ADA.
For some reason I get wierd response when I talk with the AI, or at least not as good as when I was using Ollama as an inference server. I was wondering if I need to specify a chat template location or a tokenizer?
This is the command I use to run the server:
python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --quantization gptq --dtype half --api-key BLANK --gpu-memory-utilization 0.87
And this is the launchlog I get:
python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --quantization gptq --dtype half --api-key BLANK --gpu-memory-utilization 0.87 INFO 02-28 15:04:29 api_server.py:229] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='BLANK', served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.87, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='gptq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='cuda', engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 02-28 15:04:29 config.py:577] Casting torch.bfloat16 to torch.float16. WARNING 02-28 15:04:29 config.py:186] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 02-28 15:04:29 llm_engine.py:79] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) INFO 02-28 15:04:32 weight_utils.py:163] Using model weights format ['*.safetensors']
Any advice or support would be appreciated.
Thanks,
Michel
Beta Was this translation helpful? Give feedback.
All reactions