Add triton support
Updated the Triton implementation of rotary to support platforms such as XPU.
Intel is gradually phasing out IPEX. This kernel currently only supports CUDA. This PR implements calling the rotary kernel on XPU without modifying the CUDA interface or calling method.
Hi, I discussed this with Daniel internally and they suggest moving this kernel to a separate repo, would be cool to host it on the Intel org.
The idea is that we keep universal kernels in separate repos, we do not mix them with cuda/rocm/sycl/metal specific kernels (check the universal tag https://huggingface.co/kernels-community/triton-layer-norm/blob/main/build.toml#L3 for example).
The PR/code suggested is also missing build/validation components and needs proper torch-ext/build.toml. It's really important that the kernel goes through the build cycle to avoid any issue from non-unique ops
identifier.