5a1689fc648c8afa04ee040bf1b3526a6fe3d75e [Fix] Fix update_aclgraph_sizes when running MoE models (#913)
3442fbdb235b4c6d72c2bc64a49707a7bd89958e [1/N][UT][v1 MTP] add basic v1 mtp features (#890)
05a471001baf35340e000d74ea24bb1ea153fcc7 bugfix for qwen2_5_vl (#805)
a93bed45350315586a2818e2398099a29b2aa215 [aclgraph] implentment NPUPiecewiseBackend to enable aclgraph (#836)
cc74b97f742971634bcdd9282a3cf835fc8275c8 [Bugfix][V1] Fix deepseek with v1 (#958)
e3c7f71462f1c252a60ce6476336fb24311a593c [Perf] Refactor tensor disposal logic to reduce memory usage (#966)
6eddbd2521d9f22b90b302d65f986254251d49cb [CI/UT][PD Disaggreate] Initialize PD Disaggreate UT (#889)
f6e5decc109663f39e44e5fb792988b18a74fd53 [CI] upgrade to vllm 0.9.0 (#959)
9f5ab59e307a66fd0b17916218c9387c328b1c59 [WIP][BugFix]Fix accuracy issues caused by wrong etp_size passed into FusedMoEParallelConfig when using vLLM 0.9.0 (#961)
a0c3e9ba506e8c02b79a1cbe4d3e4daca15eb1d9 [Bugfix] Adjust inputbatch to be compatible with latest vllm (#945)
1f9fb869ad8832fefed92ae6ddaf36552d694c89 [BugFix] Fix accuracy bugs for unquantized deepseekv3 models (#897)
17f05b10893bd18558b3c69f7af880fffc6c1653 [Feature] Add CustomQwen3MoeForCausalLM model (#925)
df58fb80eee24139fc61c495be3ce79cf81b3f73 Spec decode support for V1 Engine (#874)
a970b27e2ddc4de2e49aebf7dca447bb143a1b5d [WIP][Perf]remove unnecessary padding before MLA V1 prefill (#917)
dc6172efd3860ce95b40a7b3e93611f875f06d40 update attention nz and mla nz(Improve TPOP 6ms performance) (#909)
7153d8890b91b7807e817acc2a32fffbbe41e2fc [Feature] Impl v1 disaggregated prefill in ascend scheduler (#852)
b434f37b46116ec288f10ea75192ce80eb75ea86 [V1] Revert the default value of enable_chunked_prefill in additional… (#935)
46df67a5e9ab73fade08cbb2d8c0155cee7316d1 [bugfix] Improve log level and info for custom ops build (#937)
0f53b138f6ba7d31527bb105c99168fd1cbf42a8 [V1][LoRA][Test] V1 Engine LoRA support & e2e test (#893)
7aa4f85f10d55ca1b4a97f86938c9ef9e9706e44 [Bugfix][kvcache] revert multiple kv cache groups (#923)
b4d6672d018689430551f7ce2d115b4847bce239 [BugFix] Fix chunked prefill bugs in engine v1 (#844)
a73bd6caf44bfe677ffcd18387d3bdac3cf5ad48 [Fix] Set div_mode to False and fix view_as position (#912)
5cf9ff18e91b0b7031c258d71a257b8e24689763 [Performance]: Custom AscendC Kernel of Multi-Step Prepare Input (#814)
00e0243561720ad3ba66ba43e84031dd0e91814b enable online serving quantization (#877)
732664451309f34828a1f387f20cec2cbf757f14 [CI] Fix qwen2.5 vl CI failure (#888)
7a325b2e2d1001a6341ba71eebd8bd8d4458d12b [Bugfix][Model] Fix fusedmoe and make modelrunner_v1 compatible with latest vllm (#867)
1e67089bc970bfe2f667f321fd0c0777a0c55a26 [BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant (#819)
68fb63428b8b972dd60ad9189538909b0eb1fcc8 [CI] Patch torch.library.infer_schema for fused moe ops to fix CI (#854)
857f489cbf20ce76f69d065a03597442804ff888 [CI] Patch torch.library.infer_schema for torch 2.5 backward compatibility (#837)
e56447033889ca95df512208cab22ef832bfdf07 [Attention][Kernel]moe support for llama4 and mllama4 (#740)
c6ac399091f927dd743267ed7cebbba957e1a92f [Bugfix] Fix the method of importing environment variables in DeepSee… (#817)
6193ba679b159cdd73b2c145423e494d90301370 [CI] add codespell CI and fix format.sh (#827)
5998704c0857c4139ae196d2ce06d748afcf70e9 [BugFix] Fix ascend scheduler bugs. (#822)
701b0fd95ea188897998d056576d38e9229b6fe0 [Enhancement] Add padding for ACL Graph (#803)
efabd722eb757e49aa309c173bbec91ca8c4ced1 feat: support torchair graph mode in v1 engine (#789)
5305a2ccf943304435a7716120ee6bb5c130a6b8 [Bugfix] Tweak distributed process group initialization and add dummy… (#816)
cdece86f2cf27a47f800403a00f0816b068493a1 [Bugfix] Add max_num_batched_tokens to InputBatch to make main CI pass (#806)
fa99f89e93d1e70d7685a128ef003333ef17b302 [Core] Support the features of prefix cache and chunked prefill in v0/v1 (#782)
324f819b929ae2c30fe629e27df0c37c2cc607b7 [Perf] Optimize fused_experts quantization code to save npu memory (#784)
2c685e3b61b0dae7ab26a95913cee1ba37bdd7c6 [Bugfix] Correct method call for _set_cos_sin_cache (#774)
6c020883a8332b5c519f4f6502733edd9b391c2b [WIP]Add Func: aclgraph_batch_size auto-adjust to different model (#771)
2e3520e28518ad5009a68efd8d64476f52e6d147 [Bugfix] Fix output tensor shape in vanilla_chunked_prefill and update import paths for model_loader (#773)
2cd036ee8ec737cefa8d30c0acab56e4f18ea189 [Bugfix] fix accuracy problem for quantized deepseek models (#768)
d6e94176528b7b1d7e24e2cfee0b9cc663b8769d [Bugfix] Fix masked_fill_ function typo (#769)
afe1767c17cda86483a0176b451f181989757e41 [Core] Cleanup triton patch which has been fixed in vllm (#764)
d6bfae8eeebedf677b643b712d367a3a69c9cce4 support 32K model len on deepseek r1 W8A8 (#728)
d7e1110c8ed217d8fc01bc5ef4dc180aa47f6376 Re-patch TritonPlaceholder on main to make CI happy (#753)
8b194ad12ec629edda070008bdc332a0157f74ed [Disaggregated Prefill] P2P Disaggregated Prefill based on llm_datadist (#694)
84e2ed898b7edec4f358d0809a2270dfe7bacfd5 performance optimization, usability optimization and API compatibility adjustments for deepseek with npu graph mode (#731)
3a628891ab6473f9b3f052ff5fe3d19bacbae413 [Feature] Add quant description file for new quant model generated by modelslim (#719)
affca6f348f4e7a41428aba6c99ef81c24130e4d [Test] Add accuracy test report workflow (#542)
ba9714ccee43c7c12cca9e7eb02b104b85ffe2c1 Optimize qwen2_vl and qwen2_5_vl (#701)
f8350569e6267675861a8e8e4b975400268e7d2b [CI] upgrade vllm to 0.8.5 (#715)
95e7aa47363760792461525e5e9e8dbe80f85101 [Platform] format platform to make it more clear (#610)
b917361ca55b034606bfafa1f0dac5834321e506 [MISC] Clean up torch_npu (#688)
0329fad9276f4b29a4766edc9c00539b05e0592c [Perf] Deepseekv3 performance optimization for eager mode (#598)
87975fa058fe3f90d204ded42a08989a8dcb413e [Bugfix] Fix early return in CustomDeepseekV2MoE.forward during profile_run (#682)
0dae55a9a3deebdb4f2263011154d886c525fc13 [MISC] fix format check error (#654)
1fce70a2fb2602170781773104a69c69beb16161 [Model] Support common fused moe ops for moe model, such as Qwen3Moe (#709)
40bd6024856b340dfb0ad80f101d96f670c72991 [Feature] Use reshape_and_cache fused op (#706)
54c0e63df7dc513a2c2316e2ccd6ed1dfdefdff3 [MTP] follow custom deepseek modeling changes to support graph mode (#636)
be9e3e85457381fc537bca6c7ad4cb97ad39d32d [Bugfix] Fix triton placeholder patch period (#704)
5de3646522b3de5cf1e06ca579725bbaa5ed3aec [MISC] Make vllm version configurable (#651)
38f34e359f08bcf652d2a95b2a2521880286d710 [Fix] fix deepseek v0 attention eager mode (#671)
2e20797934fe1f357fe840f538543445acd7b92d [BUILD] Upgrade torch-npu to 2.5.1 (#661)
fa4a5d980e8845a88b9162cf169f0a5ab230f8a5 [Bugfix] Remove redundant tensor creation and unused code (#656)
ba3d8aae943271221ff7b1ae74551939fd935ee5 [Model][MiniCPM] support MiniCPM (#645)
742f679c7d56ebea9cee449ba2428940aa353bb9 Remove prompt string from engine core data structures (#663)
3879d9cad95c14e3cce8fc053540e369a39cd341 [CI] Fix sample backward compatibility problem (#648)
d785e785639a0ebcf21c0b5e46ab47a3b041344c [V1] Make V1 engine backward compatible (#637)
a9c6b52205c3911e0725549c0fcdd7089e8f25a7 [Bugfix] Fix qwen2.5-vl positon input bug (#639)
05bdcbeae47c7fcb9b1c30cad059abf1d40b5421 support aclgraph (#426)
5c6d05a59e996ab0ce6b91e7d4e267d7be1157f8 support deepseek quant & mix-parallel with graphmode (#585)
e74331a1ede31c69ec0b1b97bd407d38742caa9c Add dp initialize patch with hccl backend (#626)
4a0ce3660ed3188c2531dd63341350df712bd97b [Misc] Remove some parts of metrics patch (#603)
538a69c1459cc8fde032b8db211ea215c78063b9 [Patch] format patch module to make it more clear (#601)
d12a057df850f2664f4388397646ec9c32beb88e Add note for deepseek related docs and remove unnecessary comments (#590)
a8d633f629cb1c6c81c80ba3bf8babcde698bf65 [Bugfix] fix import error (#600)
0ae9ee0f8a0fb8f8c20dd5005932a68550a15f89 [BUGFIX] main-sd-bugfix && [UT] add mtp UT (#593)
5442b463fd7b232fc560dd549e50b737043992cb add doc for patch_config (#574)
12cae04db9ebe6cd70dede409c63ed5d326537fb [quantization] Support w8a8 quantization (#580)
1a1f9a6d894a3947fcff4c5c52fa0846c35e5759 port deepseekv2 and mtp to main branch (#429)
a127cc83f89c249ff062be76124e54a05e2a31c6 catch ImportError when C code not compiled (#575)
65c1f4579fb3cb85f0d56d5b08d4d226aac62d9d [V1][Structured Output] Add `apply_grammar_bitmask()` method to model runner (#555)
84563fc65d938f1b2655eac18164fca5666d3c67 Add sleep mode feature for Ascend NPU (#513)
42c7fbb10eb0d128488c38dc6d1061608fad04cb [Misc] Fix import error and address nits to make CI happy (#563)
66a0837963ff5dd6734907083ca3cf57e6bb223b adopt rope in vllm-ascend (#530)
23f85e3f7425d0041abc28a46d6010740ec36bbc [BugFix] Fix scheduler problems in last PR. (#558)
6ee7f5cf711509349fc06f7871659174395d1801 [SpecDecode] Add spec decode support (#500)
20dff4deffc90dfcb96472de5ae36737e78cba96 [Scheduler] Add AscendScheduler. (#543)
697908f5cd7c65a3a917ec1a962b0886efc98c7e [Platform][Worker][ModelRunner] Add LoRA & Multi-LoRA support (#521)
9935d457289ae85c0bf3ffbe5496875c6eb34782 [CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460)
c3d1a3782aee57432b9f6071a1bdc02462117f21 Add pyhccl (#503)
6061f3367010961e6d35b0857c719a1aee391323 [Bugfix][Model] Fix api in DeepSeek model (#545)
415ed027fadb730694855f987f34501481e8431d [V1][Platform] Remove `supports_structured_output()` in platform (#531)
bbe7ccd3664928560169cb0f965a6ee768a5a993 [MISC] Add patch module (#526)
bcbc04f92b258b5b6aa4e7fe4ae073611de97bc1 [Doc] Add environment variables doc (#519)
44a8301424ded94dae83e13b837f5bfc0a1bfc15 [Feature] Add PD separation feature (#432)
c7f6584d75e164f6140f2d822e454d477b887e27 [V1] clean up V1 code (#505)
f6af1d2471b99942d5284d804c10d4cb3c3b38b8 [MISC] fix logger (#515)
9c7428b3d5b63939c15ae713edc3871e51b98cbc [CI] enable custom ops build (#466)
f6cf92e7d55004bf8eb8729d71acaf0635f88883 [quant][bugfix] fix deepseek quant bug (#478)
1d88dacf9f9656b78211dae045227096e47d5795 [V1][Platform] Add `supports_structured_output()` method to Platform (#475)
344228a5da6130bf1d7e5f3c3d04cc8b661748ff [deepseek][bugfix] support deepseek quant (#469)
3f9752f8ee1c71435aeccfd6f753b70989071767 [Bugfix]Lazy import vllm config  (#462)
ce8259975e0befc4830152f631c826ccc046e808 [core] Support custom ascendc kernels in vllm-ascend (#233)
14d9a640472f7e63f90c9a138e614c4e6ad0f06e [ModelRunner][V1] Optimize V1 attention mask (#442)
2dbd763584699fbea874b3fa083a8eb273447732 [CI] Fix mypy CI (#443)
31f29b9f30eb65f0a38d30ffe613282ce2f1130a [Core] Make V1 work and enable V1 engine test (#389)
57a84bb7befeaa0dc62aa35fa406e4d6affbfcca [Bug Fix] Fix bug of platform for parameter checking (#411)
b1557abab6534af830f1555f262332aba2bf6e51 fix multistep bug,remove uselesscodes (#355)
122505208ff6284f409846ca7294f4a4b9883285 FastPatch: Optimized Patch Embedding for Qwen2VL (#345)
89ca63a2c2d98dbd153b33596888090426a9e9f0 [Bugfix] Disable torch.compile() (#370)
befbee5883446ccb8b0df255a911168079a414f1 Update README and add collect_env info (#369)
c06af8b2e0f4ace8caf20f4cf4fdbb1978647df6 [V1][Core] Add support for V1 Engine (#295)
7330416de3fd2f8c6b9b82fb1ad0adfb9c70d483 [BugFix] Fix bugs when using ascend quantization (#275)
5c7a95b01d339bd33e343ef14ff497fa1b2f4eea [Attn] Support encoder-only attention with torch sdpa (#290)
12aa7115b58e6def5603e4eae6744f0af8e05634 bugfix for qwen2_vl (#301)
0db6670bfab8cb1d84c9e7270df0a1d42d6ce7ca [Feature] Implement EP-compatible fused_moe (#121)
4c9d78a0354773267b50772ffde86f85d18d3ed0 support multistep decode (#299)
feb6bdb12e6e5c2dfce8dbd9232e97d435b52ecb [Platform][Model Runner] Add hash of request_ids; Change blocksize back to 128. (#293)
faf8cd89cb9a853bd8a3c16b8d7321e2da4a2342 register qwen2_vl to rewrite qwen2_vl forwad (#241)
3217f0d10fbbc6e6cc8b0db9594b8cef515b4f90 [Feature] Modify description and api for ascend quantization (#243)
dcd0005058dbd6fd8672378565890cbda924b792 [Fix] Remove npu_group_topk before CANN version update (#242)
0d3463400a8ae776fc637f4db3a464c0d0dc3da6 [Performance] Change the shape of kv_cache to avoid view of k_cache and v_cache. (#204)
503f5045ffdeb26a17cb3a7972a8310dadb9089c [ModelRunner] Remove redundant profile_run() in model runner (#224)
ae49bfd13a8c6c6f549872e6a7666e2d3c68bbae [Core] Support pooling (#229)
b64ee7d346511b6ea7a64b09db58c17aa1c915ef [Dist] Set device as rank (#202)
14bca9911a265bb3c75708dbd4fcdfe56d267db4 [CI] Fix unsolved bugs caused by pta api change. (#190)
1715230867048aaf3102dbe6448b3c476db74c9e [CI] Upgrade to newest pta.(MLA and FusedMoE) (#189)
c131e43e7d5983b394d6846de432b3a0d7031935 [Worker]Lazy import torch_npu (#184)
6042c210bc715573a65c76209445a3d92054c1a6 [CI] upgrade to newest pta (#187)
fd18ae649453fa4c31b58b04ba75d3fd9ed0b3d4 [MOE] fix #176 (#179)
ee43179767ba1a61be543ed42beca276bee061eb [ModelRunner] Fix cuda hard code in model runner (#155)
94cd66bba7b8e90a4b00eb92649b1239aabf3780 [CI][UT]enable multimodal ut (#158)
1c238b930d2b21a37140a1b32e5ac12465fe5c6f [worker] remove unused assertion (#161)
7776f2e6a4ef5c372b53c6f092a5062c4d2fe083 [ModelRunner] remove padding for vlm inputs (#150)
79fbb20b4db5538f33ae1d1fc6f531847a42de8b [ModelRunner] remove unused args (follow vllm changes) (#159)
d0b3cb4fa79d5fc7f8245a3c68885ce1fa030ba4 modify:Eliminate redundant operations in the code to improve performance (#137)
202b39a38c2869b0ecc3df486550fb555a2eb0c0 Ray Worker Ops Optimization (#136)
386817b4d1c0781abcc5ab5370da3b444882a74d [Model Runner][Performance] Cache the jugement result of is_encoder_decoder to decrease framework overhead (#138)
dd425d68f8a51a7b1fcb60a193fcd0d3ea1848a6 [Platform] add dispatch key (#17)
5f465010deef1a2b507a107e720c1c366161d820 [Core] Cherry pick from 0.7.1 to keep the main code newest (#127)
8ea8523744138da981bf952f28a5eb304f9898c3 reset default block_size from 16 to 128 (#84)
4544e99d88aed9247381a420c896d39fced69096 [dist] revert communicator patch (#66)
b88443b6c645942b89991c3df35f5485630e8df3 [dist] fix communicator patch (#58)
f762ee89cc2e9fc7378b696139b264d80600adba [Communicator] Add monkey patch (#30)
70068359770b6e8cfcbb9931aa79be50731a274c [attn] fix device of tensors in attention (#25)
8fc5dc966aaf4e174d1ec0d1902c40289411ec0e [Worker] Register mindie_turbo while initializing NPUWorker (#13)
4495fc68389e3fb1ef14534c202948931e38446b bugfix for mrope (#14)
bfccf739e2fe121b54d9b198c2ec205a9379190e [ModelRunner] Refactor model_runner for NPU (#6)
d5e7756028bd5884ade96b654555c375770a2f64 [Core] Init vllm-ascend (#3)