Error when running wih vllm==0.9.2 and CUDA 12.6 on Ampere GPUs

#8
by EmilPi - opened

I have 4x RTX 3090.
I try running

vllm serve --host 0.0.0.0 --port 1238 ai21labs/AI21-Jamba-Mini-1.7-FP8 --max-model-len 32768 --max-num-seqs 2 --tensor-parallel-size 4

And get (part of the log):

...

INFO 07-10 00:40:32 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.2) with config: model='ai21labs/AI21-Jamba-Mini-1.7-FP8', speculative_config=None, tokenizer='ai21labs/AI21-Jamba-Mini-1.7-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=ai21labs/AI21-Jamba-Mini-1.7-FP8, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":2,"local_cache_dir":null}, use_cached_outputs=True, 
WARNING 07-10 00:40:33 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-10 00:40:33 [cuda.py:363] Using Flash Attention backend.
INFO 07-10 00:40:37 [__init__.py:244] Automatically detected platform cuda.
INFO 07-10 00:40:37 [__init__.py:244] Automatically detected platform cuda.
INFO 07-10 00:40:37 [__init__.py:244] Automatically detected platform cuda.
(VllmWorkerProcess pid=3666110) INFO 07-10 00:40:39 [multiproc_worker_utils.py:226] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3666108) INFO 07-10 00:40:39 [multiproc_worker_utils.py:226] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3666109) INFO 07-10 00:40:39 [multiproc_worker_utils.py:226] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3666110) INFO 07-10 00:40:39 [cuda.py:363] Using Flash Attention backend.
(VllmWorkerProcess pid=3666108) INFO 07-10 00:40:39 [cuda.py:363] Using Flash Attention backend.
(VllmWorkerProcess pid=3666109) INFO 07-10 00:40:39 [cuda.py:363] Using Flash Attention backend.
INFO 07-10 00:40:46 [__init__.py:1152] Found nccl from library libnccl.so.2
INFO 07-10 00:40:46 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorkerProcess pid=3666108) INFO 07-10 00:40:46 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3666109) INFO 07-10 00:40:46 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3666110) INFO 07-10 00:40:46 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3666109) INFO 07-10 00:40:46 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorkerProcess pid=3666108) INFO 07-10 00:40:46 [pynccl.py:70] vLLM is using nccl==2.26.2

...

(VllmWorkerProcess pid=3666108) INFO 07-10 00:40:56 [default_loader.py:272] Loading weights took 7.79 seconds
(VllmWorkerProcess pid=3666108) WARNING 07-10 00:40:56 [marlin_utils_fp8.py:166] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
INFO 07-10 00:40:56 [model_runner.py:1203] Model loading took 12.8324 GiB and 8.754978 seconds

...

(VllmWorkerProcess pid=3666108) INFO 07-10 00:41:08 [worker.py:294] model weights take 12.83GiB; non_torch_memory takes 0.19GiB; PyTorch activation peak memory takes 2.33GiB; the rest of the memory reserved for KV Cache is 5.88GiB.
INFO 07-10 00:41:08 [worker.py:294] Memory profiling takes 11.24 seconds
INFO 07-10 00:41:08 [worker.py:294] the current vLLM instance can use total_gpu_memory (23.59GiB) x gpu_memory_utilization (0.90) = 21.23GiB
INFO 07-10 00:41:08 [worker.py:294] model weights take 12.83GiB; non_torch_memory takes 0.19GiB; PyTorch activation peak memory takes 2.33GiB; the rest of the memory reserved for KV Cache is 5.88GiB.
INFO 07-10 00:41:09 [executor_base.py:113] # cuda blocks: 96340, # CPU blocks: 65536
INFO 07-10 00:41:09 [executor_base.py:118] Maximum concurrency for 32768 tokens per request: 47.04x
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Exception in worker VllmWorkerProcess while processing method initialize_cache.
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Exception in worker VllmWorkerProcess while processing method initialize_cache.
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Traceback (most recent call last):
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Traceback (most recent call last):
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 233, in _run_worker_process
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 233, in _run_worker_process
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     return func(*args, **kwargs)
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     return func(*args, **kwargs)
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 334, in initialize_cache
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 334, in initialize_cache
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     self._init_cache_engine()
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     self._init_cache_engine()
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 340, in _init_cache_engine
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 340, in _init_cache_engine
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     CacheEngine(self.cache_config, self.model_config,
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     CacheEngine(self.cache_config, self.model_config,
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 67, in __init__
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 67, in __init__
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, "cpu")
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, "cpu")
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 96, in _allocate_kv_cache
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 96, in _allocate_kv_cache
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     layer_kv_cache = torch.zeros(
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     layer_kv_cache = torch.zeros(
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]                      ^^^^^^^^^^^^
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]                      ^^^^^^^^^^^^
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] RuntimeError: CUDA error: invalid argument
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] RuntimeError: CUDA error: invalid argument
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] 
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] 
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Exception in worker VllmWorkerProcess while processing method initialize_cache.
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Traceback (most recent call last):
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 233, in _run_worker_process
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     return func(*args, **kwargs)
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 334, in initialize_cache
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     self._init_cache_engine()
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 340, in _init_cache_engine
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     CacheEngine(self.cache_config, self.model_config,
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 67, in __init__
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, "cpu")
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 96, in _allocate_kv_cache
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     layer_kv_cache = torch.zeros(
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]                      ^^^^^^^^^^^^
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] RuntimeError: CUDA error: invalid argument
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] 
ERROR 07-10 00:41:09 [engine.py:458] CUDA error: invalid argument
ERROR 07-10 00:41:09 [engine.py:458] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 07-10 00:41:09 [engine.py:458] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 07-10 00:41:09 [engine.py:458] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 07-10 00:41:09 [engine.py:458] Traceback (most recent call last):
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
ERROR 07-10 00:41:09 [engine.py:458]     engine = MQLLMEngine.from_vllm_config(
ERROR 07-10 00:41:09 [engine.py:458]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
ERROR 07-10 00:41:09 [engine.py:458]     return cls(
ERROR 07-10 00:41:09 [engine.py:458]            ^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 87, in __init__
ERROR 07-10 00:41:09 [engine.py:458]     self.engine = LLMEngine(*args, **kwargs)
ERROR 07-10 00:41:09 [engine.py:458]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 268, in __init__
ERROR 07-10 00:41:09 [engine.py:458]     self._initialize_kv_caches()
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 426, in _initialize_kv_caches
ERROR 07-10 00:41:09 [engine.py:458]     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 124, in initialize_cache
ERROR 07-10 00:41:09 [engine.py:458]     self.collective_rpc("initialize_cache",
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 332, in collective_rpc
ERROR 07-10 00:41:09 [engine.py:458]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 07-10 00:41:09 [engine.py:458]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 186, in _run_workers
ERROR 07-10 00:41:09 [engine.py:458]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 07-10 00:41:09 [engine.py:458]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
ERROR 07-10 00:41:09 [engine.py:458]     return func(*args, **kwargs)
ERROR 07-10 00:41:09 [engine.py:458]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 334, in initialize_cache
ERROR 07-10 00:41:09 [engine.py:458]     self._init_cache_engine()
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 340, in _init_cache_engine
ERROR 07-10 00:41:09 [engine.py:458]     CacheEngine(self.cache_config, self.model_config,
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 67, in __init__
ERROR 07-10 00:41:09 [engine.py:458]     self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, "cpu")
ERROR 07-10 00:41:09 [engine.py:458]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 96, in _allocate_kv_cache
ERROR 07-10 00:41:09 [engine.py:458]     layer_kv_cache = torch.zeros(
ERROR 07-10 00:41:09 [engine.py:458]                      ^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458] RuntimeError: CUDA error: invalid argument
ERROR 07-10 00:41:09 [engine.py:458] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 07-10 00:41:09 [engine.py:458] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 07-10 00:41:09 [engine.py:458] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
...

@EmilPi apologies for the delay in response.

As of the one of the errors early in the log indicates, RTX 3090 does not support FP.

This can be solved by using the regular Jamba Mini weights and passing quantization experts_int8 in the vLLM run command.

Sign up or log in to comment