Linoy Tsaban PRO
linoyts
AI & ML interests
None yet
Recent Activity
published
a Space
about 4 hours ago
zerogpu-aoti/Qwen-Image-Edit-Multi-Image
updated
a collection
about 4 hours ago
Qwen Image Edit Accelerated Inference
updated
a Space
about 4 hours ago
zerogpu-aoti/Qwen-Image-Edit-Multi-Image
Organizations

reacted to
AdinaY's
post with 🔥
4 days ago

reacted to
a-r-r-o-w's
post with 🔥
about 2 months ago
Post
3306
Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by
@chengzeyi
building upon the ideas of TeaCache.
The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with.
Diffusers usage with CogView4:
Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT.
Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache
PR: https://github.com/huggingface/diffusers/pull/11180
References:
- First Block Cache: https://github.com/chengzeyi/ParaAttention
- TeaCache: https://github.com/ali-vilab/TeaCache
The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with.
Diffusers usage with CogView4:
import torch
from diffusers import CogView4Pipeline
from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig
pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16)
pipe.to("cuda")
apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2))
prompt = "A photo of an astronaut riding a horse on mars"
image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0]
image.save("output.png")
Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT.
Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache
PR: https://github.com/huggingface/diffusers/pull/11180
References:
- First Block Cache: https://github.com/chengzeyi/ParaAttention
- TeaCache: https://github.com/ali-vilab/TeaCache

reacted to
merve's
post with ❤️
2 months ago
Post
4364
Release picks of the past week is here! Find more models, datasets, Spaces here
merve/june-20-releases-68594824d1f4dfa61aee3433
🖼️ VLMs/OCR
> moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos 👏 (OS)
> nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)
💬 LLMs
> moonshotai/Kimi-Dev-72B is a strong coding model based on Qwen2.5-72B (OS)
> Mistral released mistralai/Mistral-Small-3.2-24B-Instruct-2506, an update to their former model with better function calling & instruction following (OS)
🗣️ Audio
> Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4)
> kyutai released new speech-to-text models that come in 1B & 2B ( kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay
3D
> Tencent released tencent/Hunyuan3D-2.1 an image-to-3D model (see below)
🖼️ VLMs/OCR
> moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos 👏 (OS)
> nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)
💬 LLMs
> moonshotai/Kimi-Dev-72B is a strong coding model based on Qwen2.5-72B (OS)
> Mistral released mistralai/Mistral-Small-3.2-24B-Instruct-2506, an update to their former model with better function calling & instruction following (OS)
🗣️ Audio
> Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4)
> kyutai released new speech-to-text models that come in 1B & 2B ( kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay
3D
> Tencent released tencent/Hunyuan3D-2.1 an image-to-3D model (see below)

reacted to
merve's
post with ❤️
2 months ago
Post
3652
Releases of the past week are here
merve/releases-june-13-6852c3c1eaf1e0c24c958860
Here's our picks 🤓
So many interesting models released past week in open AI! 🤖
🖼️ Computer Vision/VLMs
> nanonets/Nanonets-OCR-s is the new state-of-the-art OCR model that can handle checkboxes, watermarks, tables (OS)
> Meta released facebook/v-jepa-2-6841bad8413014e185b497a6, new sota video embeddings with two new classification models (OS)
> ByteDance-Seed/SeedVR2-3B is a new 3B video restoration model (OS)
Audio
> Stepfun released stepfun-ai/Step-Audio-AQAA, new large (137B 🤯) audio language model that takes in audio and generates audio (OS)
🤖 Robotics
> nvidia released nvidia/GR00T-N1.5-3B, new open foundation vision language action model
3D
> tencent/Hunyuan3D-2.1 is the new version of Hunyuan by Tencent that can generate 3D assets from text and image prompts
Here's our picks 🤓
So many interesting models released past week in open AI! 🤖
🖼️ Computer Vision/VLMs
> nanonets/Nanonets-OCR-s is the new state-of-the-art OCR model that can handle checkboxes, watermarks, tables (OS)
> Meta released facebook/v-jepa-2-6841bad8413014e185b497a6, new sota video embeddings with two new classification models (OS)
> ByteDance-Seed/SeedVR2-3B is a new 3B video restoration model (OS)
Audio
> Stepfun released stepfun-ai/Step-Audio-AQAA, new large (137B 🤯) audio language model that takes in audio and generates audio (OS)
🤖 Robotics
> nvidia released nvidia/GR00T-N1.5-3B, new open foundation vision language action model
3D
> tencent/Hunyuan3D-2.1 is the new version of Hunyuan by Tencent that can generate 3D assets from text and image prompts

reacted to
merve's
post with 🔥
2 months ago
Post
2480
#CVPR2025 Paper Picks #1
VisionZip is a compression technique that reduces number of visual tokens to improve performance AND prefill time for vision language models
demo: Senqiao/VisionZip
paper: VisionZip: Longer is Better but Not Necessary in Vision Language Models (2412.04467)
most of the image tokens are redundant for the LLM, so the authors ask "are all visual tokens necessary?"
the method is simple:
find which tokens have the highest attention score, merge rest of the tokens based on similarity, then merge both
their method is both training-free and for fine-tuning
the authors report 5 point improvement on average of vision language tasks + 8x improvement in prefilling time for Llava-Next 7B and 13B 🤯
removing redundant tokens improve image token quality too 🥹
VisionZip is a compression technique that reduces number of visual tokens to improve performance AND prefill time for vision language models
demo: Senqiao/VisionZip
paper: VisionZip: Longer is Better but Not Necessary in Vision Language Models (2412.04467)
most of the image tokens are redundant for the LLM, so the authors ask "are all visual tokens necessary?"
the method is simple:
find which tokens have the highest attention score, merge rest of the tokens based on similarity, then merge both
their method is both training-free and for fine-tuning
the authors report 5 point improvement on average of vision language tasks + 8x improvement in prefilling time for Llava-Next 7B and 13B 🤯
removing redundant tokens improve image token quality too 🥹

reacted to
fdaudens's
post with 🤗
3 months ago
Post
2967
🎵 Dream come true for content creators! TIGER AI can extract voice, effects & music from ANY audio file 🤯
This lightweight model uses frequency band-split technology to separate speech like magic. Kudos to @fffiloni for the amazing demo! fffiloni/TIGER-audio-extraction
This lightweight model uses frequency band-split technology to separate speech like magic. Kudos to @fffiloni for the amazing demo! fffiloni/TIGER-audio-extraction

reacted to
AdinaY's
post with 🔥
3 months ago
Post
1962
HunyuanPortrait 🔥 video model by Tencent Hunyuan team.
HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation (2503.18860)
tencent/HunyuanPortrait
✨Portrait animation from just one image + a video prompt
✨Diffusion-based, implicit motion control
✨Superior temporal consistency & detail
HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation (2503.18860)
tencent/HunyuanPortrait
✨Portrait animation from just one image + a video prompt
✨Diffusion-based, implicit motion control
✨Superior temporal consistency & detail

reacted to
sayakpaul's
post with 🤗
3 months ago
Post
2792
Diffusers supports a good variety of quantization backends. It can be challenging to navigate through them, given the complex nature of diffusion pipelines in general.
So, @derekl35 set out to write a comprehensive guide that puts users in the front seat. Explore the different backends we support, learn the trade-offs they offer, and finally, check out the cool space we built that lets you compare quantization results.
Give it a go here:
https://lnkd.in/gf8Pi4-2
So, @derekl35 set out to write a comprehensive guide that puts users in the front seat. Explore the different backends we support, learn the trade-offs they offer, and finally, check out the cool space we built that lets you compare quantization results.
Give it a go here:
https://lnkd.in/gf8Pi4-2

reacted to
sayakpaul's
post with 🔥
3 months ago
Post
1762
Despite the emergence of combining LLM and DiT architectures for T2I synthesis, its design remains severely understudied.
This was done long ago and got into CVPR25 -- super excited to finally share it now, along with the data and code ♥️
We explore several architectural choices that affect this design. We provide an open & reproducible training recipe that works at scale.
Works like Playground v3 have already explored a deep fusion between an LLM and a DiT, sharing their representations through layerwise attention. They exhibit excellent performance on T2I.
Despite its compelling results and other performance virtues, it remains unexplored, which is what we want to improve in our work. Specifically, we take a pre-trained LLM (Gemma-2B) and trainable DiT, and set out to explore what makes a "good deep fusion" between the two for T2I.
We explore several key questions in the work, such as:
Q1: How should we do attention? We considered several alternatives. PixArt-Alpha like attention (cross-attention) is very promising.
Q2: Should we incorporate additional text modulation?
Q3: Can we eliminate timestep conditioning?
Q4: How do we do positional encodings?
Q5: Do instruction-tuned LLMs help deep fusion?
Q6: Would using a decoder LLM from a multimodal model be helpful?
Q7: Does using a better variant of Gemma help?
Based on the above findings, we arrive at FuseDiT with the following components on top of the base architecture from the findings of our experiments.
* No AdaLN-Zero modules
* 1D + 2D-RoPE
* Gemma 2 2B, adjusting DiT configurations accordingly
We trained FuseDiT on a mixture from CC12M, JourneyDB, & SA (~26M image-text pairs) for 800 steps. While not the best model, it's encouraging to develop something in a guided manner using open datasets.
To know more (code, models, all are available), please check out the paper:
https://lnkd.in/gg6qyqZX.
This was done long ago and got into CVPR25 -- super excited to finally share it now, along with the data and code ♥️
We explore several architectural choices that affect this design. We provide an open & reproducible training recipe that works at scale.
Works like Playground v3 have already explored a deep fusion between an LLM and a DiT, sharing their representations through layerwise attention. They exhibit excellent performance on T2I.
Despite its compelling results and other performance virtues, it remains unexplored, which is what we want to improve in our work. Specifically, we take a pre-trained LLM (Gemma-2B) and trainable DiT, and set out to explore what makes a "good deep fusion" between the two for T2I.
We explore several key questions in the work, such as:
Q1: How should we do attention? We considered several alternatives. PixArt-Alpha like attention (cross-attention) is very promising.
Q2: Should we incorporate additional text modulation?
Q3: Can we eliminate timestep conditioning?
Q4: How do we do positional encodings?
Q5: Do instruction-tuned LLMs help deep fusion?
Q6: Would using a decoder LLM from a multimodal model be helpful?
Q7: Does using a better variant of Gemma help?
Based on the above findings, we arrive at FuseDiT with the following components on top of the base architecture from the findings of our experiments.
* No AdaLN-Zero modules
* 1D + 2D-RoPE
* Gemma 2 2B, adjusting DiT configurations accordingly
We trained FuseDiT on a mixture from CC12M, JourneyDB, & SA (~26M image-text pairs) for 800 steps. While not the best model, it's encouraging to develop something in a guided manner using open datasets.
To know more (code, models, all are available), please check out the paper:
https://lnkd.in/gg6qyqZX.

reacted to
AdinaY's
post with 🚀
3 months ago
Post
2823
ByteDance is absolutely cooking lately🔥
BAGEL 🥯 7B active parameter open multimodal foundation model by Bytedance Seed team.
ByteDance-Seed/BAGEL-7B-MoT
✨ Apache 2.0
✨ Outperforms top VLMs (Qwen2.5-VL & InternVL-2.5)
✨ Mixture-of-Transformer-Experts + dual encoders
✨ Trained on trillions of interleaved tokens
BAGEL 🥯 7B active parameter open multimodal foundation model by Bytedance Seed team.
ByteDance-Seed/BAGEL-7B-MoT
✨ Apache 2.0
✨ Outperforms top VLMs (Qwen2.5-VL & InternVL-2.5)
✨ Mixture-of-Transformer-Experts + dual encoders
✨ Trained on trillions of interleaved tokens

reacted to
loubnabnl's
post with ❤️
3 months ago
Post
3889
SmolVLM is now available on PocketPal — you can run it offline on your smartphone to interpret the world around you. 🌍📱
And check out this real-time camera demo by @ngxson , powered by llama.cpp:
https://github.com/ngxson/smolvlm-realtime-webcam
https://x.com/pocketpal_ai
And check out this real-time camera demo by @ngxson , powered by llama.cpp:
https://github.com/ngxson/smolvlm-realtime-webcam
https://x.com/pocketpal_ai

reacted to
AdinaY's
post with 🚀
3 months ago
Post
2541
Matrix Game 🎮 an interactive foundation model for controllable game world generation, released by Skywork AI.
Skywork/Matrix-Game
✨ 17B with MIT licensed
✨ Diffusion-based image-to-world video generation via keyboard & mouse input
✨ GameWorld Score benchmark for Minecraft world models
✨ Massive Matrix Game Dataset with fine-grained action labels
Skywork/Matrix-Game
✨ 17B with MIT licensed
✨ Diffusion-based image-to-world video generation via keyboard & mouse input
✨ GameWorld Score benchmark for Minecraft world models
✨ Massive Matrix Game Dataset with fine-grained action labels

reacted to
merve's
post with 🔥
3 months ago
Post
5068
VLMS 2025 UPDATE 🔥
We just shipped a blog on everything latest on vision language models, including
🤖 GUI agents, agentic VLMs, omni models
📑 multimodal RAG
⏯️ video LMs
🤏🏻 smol models
..and more! https://huggingface.co/blog/vlms-2025
We just shipped a blog on everything latest on vision language models, including
🤖 GUI agents, agentic VLMs, omni models
📑 multimodal RAG
⏯️ video LMs
🤏🏻 smol models
..and more! https://huggingface.co/blog/vlms-2025

reacted to
AdinaY's
post with 😎
4 months ago
Post
3952
ACE-Step 🎵 a music generation foundation model released by
StepFun & ACEStudio
Model: ACE-Step/ACE-Step-v1-3.5B
Demo: ACE-Step/ACE-Step
✨ 3.5B, Apache2.0 licensed
✨ 115× faster than LLMs (4-min music in 20s on A100)
✨ Diffusion + DCAE + linear transformer = speed + coherence
✨ Supports voice cloning, remixing, lyric editing & more
StepFun & ACEStudio
Model: ACE-Step/ACE-Step-v1-3.5B
Demo: ACE-Step/ACE-Step
✨ 3.5B, Apache2.0 licensed
✨ 115× faster than LLMs (4-min music in 20s on A100)
✨ Diffusion + DCAE + linear transformer = speed + coherence
✨ Supports voice cloning, remixing, lyric editing & more

reacted to
RiverZ's
post with 🤗
4 months ago
Post
6927
🔥 We're thrilled to share some exciting news about ICEdit! Currently, ICEdit app (
RiverZ/ICEdit) has soared to the second place on the weekly trend list of Hugging Face Space, just trailing behind Qwen3. What's more, it also holds the second position on the overall space trend list. This achievement wouldn't have been possible without your incredible support and love. A huge thank you to each and every one of you❤!
🎉 The ICEdit community has been incredibly active, and we've seen a plethora of amazing ComfyUI workflows being shared. For instance, with the help of ComfyUI - nunchaku, you can run ICEdit locally with just 4GB of VRAM. This makes it much more accessible for those with limited hardware resources.
🎇 If you're interested in the detailed information, please head over to our repository. We highly encourage you to give these workflows a try and explore the creative possibilities that ICEdit offers.
Github Repo: https://github.com/River-Zhang/ICEdit
Hugging Face Space: RiverZ/ICEdit
🎉 The ICEdit community has been incredibly active, and we've seen a plethora of amazing ComfyUI workflows being shared. For instance, with the help of ComfyUI - nunchaku, you can run ICEdit locally with just 4GB of VRAM. This makes it much more accessible for those with limited hardware resources.
🎇 If you're interested in the detailed information, please head over to our repository. We highly encourage you to give these workflows a try and explore the creative possibilities that ICEdit offers.
Github Repo: https://github.com/River-Zhang/ICEdit
Hugging Face Space: RiverZ/ICEdit

reacted to
nyuuzyou's
post with 🔥
4 months ago
Post
3692
🖼️ PublicDomainFiles.com Collection -
nyuuzyou/publicdomainfiles
Collection of 206,204 Public Domain multimedia files featuring:
- Comprehensive metadata: title, description, creator name, keywords, original page URL, and more.
- Contains various media types including images, clip art, artwork, fonts, videos, and TV shows.
- All content explicitly released into the public domain under the CC0 license.
- Organized in a single
Collection of 206,204 Public Domain multimedia files featuring:
- Comprehensive metadata: title, description, creator name, keywords, original page URL, and more.
- Contains various media types including images, clip art, artwork, fonts, videos, and TV shows.
- All content explicitly released into the public domain under the CC0 license.
- Organized in a single
train
split with 206,204 entries.

posted
an
update
4 months ago
Post
4502
FramePack is hands down one of the best OS releases in video generation 🙇🏻♀️🤯
✅ fully open sourced + amazing quality + reduced memory + improved speed
but more even - its gonna facilitate *soooo* many downstream applications
like this version adapted for landscape rotation 👇https://huggingface.co/spaces/tori29umai/FramePack_rotate_landscape
✅ fully open sourced + amazing quality + reduced memory + improved speed
but more even - its gonna facilitate *soooo* many downstream applications
like this version adapted for landscape rotation 👇https://huggingface.co/spaces/tori29umai/FramePack_rotate_landscape

reacted to
RiverZ's
post with 🔥
4 months ago
Post
3500
🚀 Excited to Share Our Latest Work: In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer~
🎨 Daily Paper:
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer (2504.20690)
🔓 Code is now open source!
🔥 Huggingface DEMO:
RiverZ/ICEdit
🌐 Project Website: https://river-zhang.github.io/ICEdit-gh-pages/
🏠 GitHub Repository: https://github.com/River-Zhang/ICEdit/blob/main/scripts/gradio_demo.py
🤗 Huggingface:
sanaka87/ICEdit-MoE-LoRA
📄 arxiv Paper:
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer (2504.20690)
🔥 Why it’s cool:
- Achieves high-quality, multi-task image editing.
- Uses only 1% of the training parameters and 0.1% of the training data compared to existing methods — extremely efficient
- Beats several commercial models on background preservation, ID control, and consistency
- Open-source, low-cost, faster, and stronger — think of it as the “DeepSeek of image editing” 👀
We also implemented a Gradio demo app, available directly in our GitHub repo! And we made a flashy demo video — happy to send it your way!
🎨 Daily Paper:
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer (2504.20690)
🔓 Code is now open source!
🔥 Huggingface DEMO:
RiverZ/ICEdit
🌐 Project Website: https://river-zhang.github.io/ICEdit-gh-pages/
🏠 GitHub Repository: https://github.com/River-Zhang/ICEdit/blob/main/scripts/gradio_demo.py
🤗 Huggingface:
sanaka87/ICEdit-MoE-LoRA
📄 arxiv Paper:
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer (2504.20690)
🔥 Why it’s cool:
- Achieves high-quality, multi-task image editing.
- Uses only 1% of the training parameters and 0.1% of the training data compared to existing methods — extremely efficient
- Beats several commercial models on background preservation, ID control, and consistency
- Open-source, low-cost, faster, and stronger — think of it as the “DeepSeek of image editing” 👀
We also implemented a Gradio demo app, available directly in our GitHub repo! And we made a flashy demo video — happy to send it your way!

reacted to
abidlabs's
post with ❤️
4 months ago
Post
5119
HOW TO ADD MCP SUPPORT TO ANY 🤗 SPACE
Gradio now supports MCP! If you want to convert an existing Space, like this one hexgrad/Kokoro-TTS, so that you can use it with Claude Desktop / Cursor / Cline / TinyAgents / or any LLM that supports MCP, here's all you need to do:
1. Duplicate the Space (in the Settings Tab)
2. Upgrade the Gradio
3. Set
4. (Optionally) add docstrings to the function so that the LLM knows how to use it, like this:
That's it! Now your LLM will be able to talk to you 🤯
Gradio now supports MCP! If you want to convert an existing Space, like this one hexgrad/Kokoro-TTS, so that you can use it with Claude Desktop / Cursor / Cline / TinyAgents / or any LLM that supports MCP, here's all you need to do:
1. Duplicate the Space (in the Settings Tab)
2. Upgrade the Gradio
sdk_version
to 5.28
(in the README.md
)3. Set
mcp_server=True
in launch()
4. (Optionally) add docstrings to the function so that the LLM knows how to use it, like this:
def generate(text, speed=1):
"""
Convert text to speech audio.
Parameters:
text (str): The input text to be converted to speech.
speed (float, optional): Playback speed of the generated speech.
That's it! Now your LLM will be able to talk to you 🤯