Instructions to use embedme/lightonai-lateon-code-edge-f16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use embedme/lightonai-lateon-code-edge-f16 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="embedme/lightonai-lateon-code-edge-f16", filename="lightonai-lateon-code-edge-f16.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use embedme/lightonai-lateon-code-edge-f16 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf embedme/lightonai-lateon-code-edge-f16:F16 # Run inference directly in the terminal: llama-cli -hf embedme/lightonai-lateon-code-edge-f16:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf embedme/lightonai-lateon-code-edge-f16:F16 # Run inference directly in the terminal: llama-cli -hf embedme/lightonai-lateon-code-edge-f16:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf embedme/lightonai-lateon-code-edge-f16:F16 # Run inference directly in the terminal: ./llama-cli -hf embedme/lightonai-lateon-code-edge-f16:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf embedme/lightonai-lateon-code-edge-f16:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf embedme/lightonai-lateon-code-edge-f16:F16
Use Docker
docker model run hf.co/embedme/lightonai-lateon-code-edge-f16:F16
- LM Studio
- Jan
- Ollama
How to use embedme/lightonai-lateon-code-edge-f16 with Ollama:
ollama run hf.co/embedme/lightonai-lateon-code-edge-f16:F16
- Unsloth Studio new
How to use embedme/lightonai-lateon-code-edge-f16 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for embedme/lightonai-lateon-code-edge-f16 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for embedme/lightonai-lateon-code-edge-f16 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for embedme/lightonai-lateon-code-edge-f16 to start chatting
- Docker Model Runner
How to use embedme/lightonai-lateon-code-edge-f16 with Docker Model Runner:
docker model run hf.co/embedme/lightonai-lateon-code-edge-f16:F16
- Lemonade
How to use embedme/lightonai-lateon-code-edge-f16 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull embedme/lightonai-lateon-code-edge-f16:F16
Run and chat with the model
lemonade run user.lightonai-lateon-code-edge-f16-F16
List all available models
lemonade list
LateOn-Code-edge (GGUF f16 + Projection)
GGUF conversion of lightonai/LateOn-Code-edge for use with litembeddings.
Model Details
| Property | Value |
|---|---|
| Base model | lightonai/LateOn-Code-edge |
| Architecture | ModernBERT (17M params) |
| Output dimensions | 48 (after projection) |
| Context length | 8,192 tokens |
| Quantization | f16 |
| GGUF size | 34 MB |
| Projection | 256 → 48 (composed from two PyLate Dense layers: 256→512→48) |
| Use case | Fast, CPU-friendly code search with late interaction (ColBERT-style) |
Variants
| Variant | Size | Quality |
|---|---|---|
| f32 | 66 MB | Original precision (lossless) |
| f16 (this repo) | 34 MB | Lossless — 100% top-1 agreement, 240/300 weighted |
| Q8_0 | 19 MB | 79% weighted score, 96-100% top-1 agreement, 3.5× smaller |
Files
| File | Size | Description |
|---|---|---|
lightonai-lateon-code-edge-f16.gguf |
34 MB | ModernBERT encoder in GGUF f16 format |
lightonai-lateon-code-edge-f16.projection |
49 KB | Composed projection matrix (48×256, float32) |
Usage with litembeddings
.load ./build/litembeddings
-- Load model with projection
SELECT lembed_model('lightonai-lateon-code-edge-f16.gguf',
'{"colbert_projection": "lightonai-lateon-code-edge-f16.projection"}');
-- Generate token embeddings for code
SELECT lembed_tokens('async fn get_connection(pool: &Pool) -> Result<Connection>');
-- Code search with MaxSim
SELECT
id, code,
lembed_maxsim(lembed_tokens('database connection pool'), token_emb) AS score
FROM code_embeddings
ORDER BY score DESC
LIMIT 10;
Quantization Quality Benchmark
Tested across 3 codebases (jq/C, Rails/Ruby, FastAPI/Python) with 150 questions total (15 easy + 20 medium + 15 hard per codebase). Weighted scoring: easy×1, medium×2, hard×3 = 100 points per codebase, 300 total.
Aggregate Weighted Scores
| Variant | Weighted Score | Percentage |
|---|---|---|
| f32 | 240 / 300 | 80.0% |
| f16 | 240 / 300 | 80.0% |
| Q8_0 | 237 / 300 | 79.0% |
Per-Corpus Scores
| Corpus | f32 | f16 | Q8_0 |
|---|---|---|---|
| jq (C) | 66/100 | 66/100 | 63/100 |
| Rails (Ruby) | 79/100 | 79/100 | 79/100 |
| FastAPI (Python) | 95/100 | 95/100 | 95/100 |
Quantization Quality (Top-1 Agreement vs f32)
| Corpus | f16 | Q8_0 |
|---|---|---|
| jq | 100.0% | 96.0% |
| Rails | 100.0% | 100.0% |
| FastAPI | 100.0% | 98.0% |
Key Findings
- f16 is lossless — identical weighted score (240/300) and 100% top-1 agreement across all codebases
- Q8_0 loses only 1% — 237/300 vs 240/300, drops only on hard queries in jq corpus
- Q8_0 is fastest — 2.5s avg query vs 3.4s f32 vs 13.4s f16 (CPU without FP16 hardware)
- Easy/medium questions show zero quality difference between all variants
Conversion
Converted using litembeddings' ColBERT converter with PyLate projection support:
python scripts/convert_colbert_to_gguf.py lightonai/LateOn-Code-edge ./models \
--name lightonai-lateon-code-edge-f16 --quantize f16
The converter automatically detects the PyLate two-layer projection structure (1_Dense + 2_Dense) and composes them into a single projection matrix via W_composed = W2 @ W1.
- Downloads last month
- 21
16-bit
Model tree for embedme/lightonai-lateon-code-edge-f16
Base model
lightonai/LateOn-Code-edge