Lumees-177M: A Compact Language Model with Rotary Position Embeddings

Model Details

Model Name: Lumees-177M Base
Model Type: Autoregressive Language Model
Architecture: Transformer with Rotary Position Embeddings (RoPE)
Parameters: ~177 million
Authors: Hasan KURŞUN, Kerem Berkay YANIK
Organization: Lumees
Year: 2025
License: Apache 2.0

Model Description

Lumees-177M is a compact autoregressive language model built with a transformer architecture enhanced by Rotary Position Embeddings (RoPE). The model demonstrates that careful architectural choices and training techniques can achieve strong performance in a relatively small parameter footprint, making it suitable for resource-constrained environments while maintaining competitive text generation capabilities.

Architecture Highlights

  • Rotary Position Embeddings (RoPE): Enables better handling of positional information and improved generalization to longer sequences
  • Optimized Attention Mechanism: Utilizes Flash Attention when available for efficient computation
  • Layer Normalization: Pre-normalization design for training stability
  • Separate Q/K/V Projections: Dedicated projection layers for improved attention quality
  • GELU Activation: Modern activation function in feed-forward networks

Key Features

  • Efficient Design: Optimized for inference speed while maintaining quality
  • Robust Training: Incorporates gradient clipping, proper weight initialization, and dropout for stable training
  • Modern Tokenization: Compatible with tiktoken encoding (cl100k_base)
  • Causal Masking: Designed for autoregressive text generation tasks

Intended Use

Primary Use Cases

  • Text Generation: Creative writing, story completion, and general text synthesis
  • Language Modeling: Research into efficient transformer architectures
  • Educational Purposes: Understanding modern language model design and training
  • Prototyping: Base model for fine-tuning on specific domains or tasks
  • Resource-Constrained Deployment: Applications where model size and inference speed matter

Intended Users

  • Researchers studying efficient language model architectures
  • Developers building text generation applications with size constraints
  • Educational institutions teaching modern NLP techniques
  • Organizations requiring local deployment of language models

How to Use

Loading the Model

from transformers import AutoModelForCausalLM
import tiktoken

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    "lumees/lumees-177m-base", 
    trust_remote_code=True
)

# Load tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

Text Generation

import torch

# Prepare input
prompt = "Once upon a time"
tokens = tokenizer.encode(prompt)
input_ids = torch.tensor([tokens])
attention_mask = torch.ones_like(input_ids)

# Generate text
outputs = model.generate(
    input_ids, 
    attention_mask=attention_mask,
    max_length=100, 
    temperature=0.8, 
    do_sample=True,
    pad_token_id=tokenizer.eot_token
)

# Decode output
generated_text = tokenizer.decode(outputs[0].tolist())
print(generated_text)

Training Data

The model was trained on a diverse collection of text data processed into Arrow format shards for efficient loading. The training corpus includes various domains to ensure broad language understanding and generation capabilities.

Data Processing

  • Tokenization: tiktoken cl100k_base encoding
  • Sequence Length: 1024 tokens maximum
  • Format: Arrow format for optimized data loading
  • Preprocessing: Standard text cleaning and formatting

Training Procedure

Training Configuration

  • Architecture: 10-layer transformer with 8 attention heads
  • Hidden Dimension: 640
  • Vocabulary Size: ~100,000 tokens (tiktoken cl100k_base)
  • Maximum Sequence Length: 1024 tokens
  • Dropout: 0.1 for regularization

Optimization

  • Optimizer: AdamW with weight decay and fused operations
  • Learning Rate Schedule: Linear warmup followed by linear decay
  • Gradient Clipping: Applied for training stability
  • Mixed Precision: FP16 training with TF32 matrix operations for efficiency
  • Performance Optimizations: TF32 and cuDNN benchmarking enabled
  • Distributed Training: Support for multi-GPU setups

Training Features

  • Flash Attention: Utilized when available for memory efficiency
  • TF32 Optimization: Enabled TensorFloat-32 for faster training on modern GPUs
  • cuDNN Optimizations: Benchmarking enabled for optimal performance
  • Gradient Accumulation: For effective large batch training
  • Checkpointing: Regular model saves during training
  • Monitoring: Comprehensive metrics tracking with Weights & Biases

Evaluation

The model has been evaluated across multiple dimensions including perplexity on held-out text, next-word prediction accuracy, text generation quality, and basic reading comprehension tasks. Performance demonstrates competitive results for a model of this size, with particular strengths in coherent text generation and local language modeling.

Evaluation Domains

  • Perplexity Assessment: Standard language modeling evaluation
  • Next-Word Prediction: Token-level prediction accuracy
  • Text Generation Quality: Coherence, creativity, and relevance
  • Basic Comprehension: Simple reasoning and completion tasks

Limitations and Considerations

Model Limitations

  • Knowledge Cutoff: Limited to training data knowledge
  • Scale Constraints: 177M parameters may limit complex reasoning capabilities
  • Domain Specificity: Performance may vary across specialized domains
  • Context Length: Maximum sequence length of 1024 tokens
  • Factual Accuracy: May generate plausible but incorrect information

Computational Requirements

  • Memory: Approximately 350MB in FP16 format
  • Inference Speed: Optimized for CPU and GPU inference
  • Hardware: Can run on modest computational resources

Ethical Considerations

  • Bias: May reflect biases present in training data
  • Misuse Prevention: Should not be used for harmful content generation
  • Transparency: Model outputs should be clearly attributed as AI-generated
  • Responsibility: Users should validate outputs for factual accuracy

Environmental Impact

As a compact 177M parameter model, Lumees-177M has a relatively low environmental footprint compared to larger language models. The efficient architecture and training procedures minimize computational requirements while maintaining performance, contributing to more sustainable AI deployment.

Technical Specifications

Model Architecture

- Embedding Dimension: 640
- Number of Layers: 10
- Attention Heads: 8
- Feed-Forward Dimension: 2560 (4x hidden size)
- Positional Encoding: Rotary Position Embeddings (RoPE)
- Activation Function: GELU
- Normalization: Layer Normalization (pre-norm)
- Dropout Rate: 0.1

Compatibility

  • Framework: PyTorch, Transformers
  • Python: 3.8+
  • Hardware: CPU, CUDA-compatible GPUs
  • Operating Systems: Linux, macOS, Windows

Citation

If you use Lumees-177M in your research or applications, please cite:

@misc{lumees177m2025,
  title={Lumees-177M: A Compact Language Model with Rotary Position Embeddings},
  author={Hasan KURŞUN and Kerem Berkay YANIK},
  organization={Lumees},
  year={2025},
  license={Apache-2.0}
}

Acknowledgments

We thank the open-source community for the foundational tools and techniques that made this work possible, including the Transformers library, PyTorch framework, and the research community's contributions to efficient transformer architectures.

Contact

For questions, issues, or collaboration opportunities regarding Lumees-177M, please contact the Lumees team or refer to the model repository for updates and documentation.


This model card follows best practices for responsible AI documentation and transparency in model development and deployment.

Downloads last month
146
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results