ai21labs/AI21-Jamba-Mini-1.7-FP8 · Error when running wih vllm==0.9.2 and CUDA 12.6 on Ampere GPUs

I have 4x RTX 3090.
I try running

vllm serve --host 0.0.0.0 --port 1238 ai21labs/AI21-Jamba-Mini-1.7-FP8 --max-model-len 32768 --max-num-seqs 2 --tensor-parallel-size 4

And get (part of the log):

...

INFO 07-10 00:40:32 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.2) with config: model='ai21labs/AI21-Jamba-Mini-1.7-FP8', speculative_config=None, tokenizer='ai21labs/AI21-Jamba-Mini-1.7-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=ai21labs/AI21-Jamba-Mini-1.7-FP8, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":2,"local_cache_dir":null}, use_cached_outputs=True, 
WARNING 07-10 00:40:33 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-10 00:40:33 [cuda.py:363] Using Flash Attention backend.
INFO 07-10 00:40:37 [__init__.py:244] Automatically detected platform cuda.
INFO 07-10 00:40:37 [__init__.py:244] Automatically detected platform cuda.
INFO 07-10 00:40:37 [__init__.py:244] Automatically detected platform cuda.
(VllmWorkerProcess pid=3666110) INFO 07-10 00:40:39 [multiproc_worker_utils.py:226] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3666108) INFO 07-10 00:40:39 [multiproc_worker_utils.py:226] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3666109) INFO 07-10 00:40:39 [multiproc_worker_utils.py:226] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3666110) INFO 07-10 00:40:39 [cuda.py:363] Using Flash Attention backend.
(VllmWorkerProcess pid=3666108) INFO 07-10 00:40:39 [cuda.py:363] Using Flash Attention backend.
(VllmWorkerProcess pid=3666109) INFO 07-10 00:40:39 [cuda.py:363] Using Flash Attention backend.
INFO 07-10 00:40:46 [__init__.py:1152] Found nccl from library libnccl.so.2
INFO 07-10 00:40:46 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorkerProcess pid=3666108) INFO 07-10 00:40:46 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3666109) INFO 07-10 00:40:46 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3666110) INFO 07-10 00:40:46 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3666109) INFO 07-10 00:40:46 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorkerProcess pid=3666108) INFO 07-10 00:40:46 [pynccl.py:70] vLLM is using nccl==2.26.2

...

(VllmWorkerProcess pid=3666108) INFO 07-10 00:40:56 [default_loader.py:272] Loading weights took 7.79 seconds
(VllmWorkerProcess pid=3666108) WARNING 07-10 00:40:56 [marlin_utils_fp8.py:166] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
INFO 07-10 00:40:56 [model_runner.py:1203] Model loading took 12.8324 GiB and 8.754978 seconds

...

(VllmWorkerProcess pid=3666108) INFO 07-10 00:41:08 [worker.py:294] model weights take 12.83GiB; non_torch_memory takes 0.19GiB; PyTorch activation peak memory takes 2.33GiB; the rest of the memory reserved for KV Cache is 5.88GiB.
INFO 07-10 00:41:08 [worker.py:294] Memory profiling takes 11.24 seconds
INFO 07-10 00:41:08 [worker.py:294] the current vLLM instance can use total_gpu_memory (23.59GiB) x gpu_memory_utilization (0.90) = 21.23GiB
INFO 07-10 00:41:08 [worker.py:294] model weights take 12.83GiB; non_torch_memory takes 0.19GiB; PyTorch activation peak memory takes 2.33GiB; the rest of the memory reserved for KV Cache is 5.88GiB.
INFO 07-10 00:41:09 [executor_base.py:113] # cuda blocks: 96340, # CPU blocks: 65536
INFO 07-10 00:41:09 [executor_base.py:118] Maximum concurrency for 32768 tokens per request: 47.04x
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Exception in worker VllmWorkerProcess while processing method initialize_cache.
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Exception in worker VllmWorkerProcess while processing method initialize_cache.
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Traceback (most recent call last):
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Traceback (most recent call last):
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 233, in _run_worker_process
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 233, in _run_worker_process
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     return func(*args, **kwargs)
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     return func(*args, **kwargs)
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 334, in initialize_cache
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 334, in initialize_cache
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     self._init_cache_engine()
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     self._init_cache_engine()
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 340, in _init_cache_engine
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 340, in _init_cache_engine
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     CacheEngine(self.cache_config, self.model_config,
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     CacheEngine(self.cache_config, self.model_config,
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 67, in __init__
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 67, in __init__
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, "cpu")
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, "cpu")
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 96, in _allocate_kv_cache
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 96, in _allocate_kv_cache
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     layer_kv_cache = torch.zeros(
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     layer_kv_cache = torch.zeros(
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]                      ^^^^^^^^^^^^
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]                      ^^^^^^^^^^^^
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] RuntimeError: CUDA error: invalid argument
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] RuntimeError: CUDA error: invalid argument
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=3666108) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] 
(VllmWorkerProcess pid=3666110) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] 
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Exception in worker VllmWorkerProcess while processing method initialize_cache.
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Traceback (most recent call last):
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 233, in _run_worker_process
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     return func(*args, **kwargs)
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 334, in initialize_cache
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     self._init_cache_engine()
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 340, in _init_cache_engine
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     CacheEngine(self.cache_config, self.model_config,
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 67, in __init__
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, "cpu")
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 96, in _allocate_kv_cache
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]     layer_kv_cache = torch.zeros(
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239]                      ^^^^^^^^^^^^
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] RuntimeError: CUDA error: invalid argument
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=3666109) ERROR 07-10 00:41:09 [multiproc_worker_utils.py:239] 
ERROR 07-10 00:41:09 [engine.py:458] CUDA error: invalid argument
ERROR 07-10 00:41:09 [engine.py:458] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 07-10 00:41:09 [engine.py:458] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 07-10 00:41:09 [engine.py:458] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 07-10 00:41:09 [engine.py:458] Traceback (most recent call last):
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
ERROR 07-10 00:41:09 [engine.py:458]     engine = MQLLMEngine.from_vllm_config(
ERROR 07-10 00:41:09 [engine.py:458]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
ERROR 07-10 00:41:09 [engine.py:458]     return cls(
ERROR 07-10 00:41:09 [engine.py:458]            ^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 87, in __init__
ERROR 07-10 00:41:09 [engine.py:458]     self.engine = LLMEngine(*args, **kwargs)
ERROR 07-10 00:41:09 [engine.py:458]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 268, in __init__
ERROR 07-10 00:41:09 [engine.py:458]     self._initialize_kv_caches()
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 426, in _initialize_kv_caches
ERROR 07-10 00:41:09 [engine.py:458]     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 124, in initialize_cache
ERROR 07-10 00:41:09 [engine.py:458]     self.collective_rpc("initialize_cache",
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 332, in collective_rpc
ERROR 07-10 00:41:09 [engine.py:458]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 07-10 00:41:09 [engine.py:458]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 186, in _run_workers
ERROR 07-10 00:41:09 [engine.py:458]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 07-10 00:41:09 [engine.py:458]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2736, in run_method
ERROR 07-10 00:41:09 [engine.py:458]     return func(*args, **kwargs)
ERROR 07-10 00:41:09 [engine.py:458]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 334, in initialize_cache
ERROR 07-10 00:41:09 [engine.py:458]     self._init_cache_engine()
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 340, in _init_cache_engine
ERROR 07-10 00:41:09 [engine.py:458]     CacheEngine(self.cache_config, self.model_config,
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 67, in __init__
ERROR 07-10 00:41:09 [engine.py:458]     self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, "cpu")
ERROR 07-10 00:41:09 [engine.py:458]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458]   File "/home/ai/3rdparty/vllm_dir/.venv/lib/python3.12/site-packages/vllm/worker/cache_engine.py", line 96, in _allocate_kv_cache
ERROR 07-10 00:41:09 [engine.py:458]     layer_kv_cache = torch.zeros(
ERROR 07-10 00:41:09 [engine.py:458]                      ^^^^^^^^^^^^
ERROR 07-10 00:41:09 [engine.py:458] RuntimeError: CUDA error: invalid argument
ERROR 07-10 00:41:09 [engine.py:458] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 07-10 00:41:09 [engine.py:458] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 07-10 00:41:09 [engine.py:458] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
...