Local Installation Video and Testing - Step by Step

#1
by fahdmirzac - opened

Hi,
Kudos on producing such a sublime model. I did a local installation and testing video :

https://youtu.be/tMZSo21cIPs?si=SkGjJglyclwE7_jz

Thanks and regards,
Fahd

This may not be seen, but it would be great to see some Gemma3 models tuned for text-embedding benchmarks (eg. MTEB Leaderboard). In most of my LLM work I use embedding models like the Qwen3-Embedding series, but there are currently very few high quality alternatives.

Thanks for the release :)

Google org

Hi @fahdmirzac ,

Thanks for your interest and great suggestion! We're actively evaluating possible directions for fine-tuning, including for embedding use cases. Your input helps guide priorities — much appreciated!

for some odd reasons I am getting stuck here

outputs = model.generate(**inputs, 
        max_new_tokens=256, 
        temperature=0.7, 
        do_sample=True,
        pad_token_id = tokenizer.eos_token_id)

Anyone else having this issue?

@kaliaanup I'm not having this issue! Did you fix it? Here's the code I wrote to get it to work (I write integration tests in pytest): https://github.com/InServiceOfX/InServiceOfX/blob/master/PythonLibraries/HuggingFace/MoreTransformers/tests/integration_tests/Models/LLMs/test_google_gemma-3-1b-it.py

Basically,

    tokenizer = AutoTokenizer.from_pretrained(
        model_path,
        local_files_only=True,
        trust_remote_code=True)

    model = Gemma3ForCausalLM.from_pretrained(
        model_path,
        device_map="cuda:0",
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        )

    prompt = "What is C. elegans?"
    prompt_str = tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        add_generation_prompt=True,
        return_tensors="pt",
        tokenize=False)

    encoded = tokenizer(prompt_str, return_tensors='pt', padding=True).to(
        model.device)

    encoded = {k: v.to(model.device) for k, v in encoded.items()}

    output = model.generate(
        input_ids=encoded["input_ids"],
        attention_mask=encoded["attention_mask"],
        do_sample=True,
        # temperature, min_p, repetition_penalty suggested by
        # https://huggingface.co/LiquidAI/LFM2-1.2B
        temperature=0.9,
        min_p=0.15,
        repetition_penalty=1.05,
        max_new_tokens=65536
        )

    print(
        "With special tokens: ",
        tokenizer.decode(output[0], skip_special_tokens=False))
    print(
        "Without special tokens: ",
        tokenizer.decode(output[0], skip_special_tokens=True))

I wrote my own wrapper class, but tested it on google's gemma-3-270m-it (the wrapper class code is basically same as above):

https://github.com/InServiceOfX/InServiceOfX/blob/master/PythonLibraries/HuggingFace/MoreTransformers/tests/integration_tests/Applications/test_ModelAndTokenizer.py#L190-L220

Sign up or log in to comment