size matters.
I think we need the 10-12b model .
With a 10b models as they are just about right to run on a 16 gou or shared GPU setup .
The 7b models are fine and even the 8b models even better . But we are left needing better .
So for my 16gb setup I found that a 10b was about the highest I could run on pure GPU with full 128k context ...
The 14 and 12 are very sluggish!
With the distilled models such as qwen 0.6b I found it was very very good . But for big tasks it failed. So it turns out we need more layers .
The 7b qwens were also nice but it failed on the same tasks ? .. so I do not think the 7b models were trained more than the 0.6b hence a shortfall between models .
I found that the larger parameter models actually do perform better than some 7b vanilla models.
So in the end a comfort level for Moe or other complexed models with larger context the 10b models is for local 16gb GPU ! .
Yes I will say I have e run the 14b offloaded . But it left the system slow !
So please make some considerations on why to release models at specific sizes . How to benefit the little guys ! . A 10b is also basically trainable locally too ( on low context. Not full)
The final goal is to have a prettained model lwhich can be trained and converted to gguf and hosted locally . So we can get the full circle.
For image models I think 8b is the max local !
For audio model simular . ... Perhaps do some testing on some dell / hp / Asus laptops to see Thier limitations.
As we are seeing high performance with distilled models .
Perhaps even consider a slicer model instead of distilation ... IE merging to a new model with less layers. Hence breaking off the front stack.
When training the model high concentration on the front stacks also enable for the large layer model to perform well in the later layers as the heavy lifting was also done in the early layer .
There are methods of training and expanding a model layer by layer ! So the first 32 layers should be highly focuses and the last 32 layers should be unsupervised ! ... Giving the possibility to slice of a section of layers from the front?
I hope all this make sense 🙏 🤞🙏
As today it's very hard to explain these xonplexitys and experiments and comparisons to people whom are just using prettained models without thought.
I highly appreciate the sharing of models as it does block the hungry comwrically oriented Americans from being able to sell us model usage token by token! Max rate ! Very bad 😞.
I also believe that the deepsel mayeben be the largest models available today and chat got even having less parameters !! Hence Thier bad sharing of bad models . As they are not fully using the network but using a raza (chatbot framework for services which uses slotfilling etc ) combined with age tic workflow ! Hence it does perform higher but it's not the model ! It's the front end I tent detection and cached query's and rag ! So most of the time your guardrailled response is not new ! It's recycled ! ...
I know about distillations of R1 onto smaller models in the size range you mentioned. But I'm unaware of V3 distillations.
I'm working in the "JPEG" of LLMs. With that models can be reduced in size almost arbitrarily. Every GB saved will reduce quality, as with JPEG but it will increase flexibility a lot.
V3.1 671B reduced to 15GB will be like a poor JPEG image of a high-res source and not very usable.
My internal tests show it works quite well for 2-4x size reductions on top of Unsloth models. The sweet spot to me is 0.5 - 2 bpw.