boolean-search-model / MODEL_CARD.md
Zwounds's picture
Upload MODEL_CARD.md with huggingface_hub
5043f4a verified

Boolean Search Query Model

This model is fine-tuned to convert natural language queries into boolean search expressions, optimized for academic and research database searching.

Model Details

  • Base Model: Meta-Llama-3.1-8B
  • Training Type: LoRA fine-tuning
  • Task: Converting natural language to boolean search queries
  • Languages: English
  • License: Same as base model

Intended Use

  • Converting natural language search requests into proper boolean expressions
  • Academic and research database searching
  • Information retrieval query formulation

Performance

Test Results

Base Model vs Fine-tuned Model comparison:

Natural Query: "Studies examining the relationship between exercise and mental health"
Base: exercise AND mental health
Fine-tuned: exercise AND "mental health"  # Properly handles multi-word terms

Natural Query: "Articles about artificial intelligence ethics and regulation or policy"
Base: "artificial intelligence ethics" AND ("regulation" OR "policy")  # Treats AI ethics as one concept
Fine-tuned: "artificial intelligence" AND (ethics OR regulation OR policy)  # Properly splits concepts

Key Improvements

  1. Meta-term Removal

    • Automatically removes terms like "articles", "papers", "research", "studies"
    • Focuses on actual search concepts
  2. Proper Term Quoting

    • Only quotes multi-word phrases
    • Single words remain unquoted
  3. Logical Grouping

    • Appropriate use of parentheses for OR groups
    • Clear operator precedence
  4. Minimal Formatting

    • No unnecessary parentheses
    • No duplicate terms

Limitations

  • English language only
  • May not handle specialized domain terminology optimally
  • Limited to boolean operators (AND, OR, NOT)
  • Designed for academic/research context

Training Data

The model was trained on a curated dataset of natural language queries paired with their correct boolean translations. Dataset characteristics:

  • Size: 135 examples
  • Format: Natural query → Boolean expression pairs
  • Source: Manually curated academic search examples
  • Validation: Expert-reviewed for accuracy

Training Process

  • Method: LoRA fine-tuning
  • Hardware: NVIDIA GeForce RTX 4070 Ti SUPER

How to Use

from unsloth import FastLanguageModel

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    "Zwounds/boolean-search-model",
    max_seq_length=2048,
    dtype=None,  # Auto-detect
    load_in_4bit=True
)
FastLanguageModel.for_inference(model)

# Format query
query = "Find papers about climate change and renewable energy"
formatted = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Convert this natural language query into a boolean search query by following these rules:

1. FIRST: Remove all meta-terms from this list (they should NEVER appear in output):
   - articles, papers, research, studies
   - examining, investigating, analyzing
   - findings, documents, literature
   - publications, journals, reviews
   Example: "Research examining X" → just "X"

2. SECOND: Remove generic implied terms that don't add search value:
   - Remove words like "practices," "techniques," "methods," "approaches," "strategies"
   - Remove words like "impacts," "effects," "influences," "role," "applications"
   - For example: "sustainable agriculture practices" → "sustainable agriculture"
   - For example: "teaching methodologies" → "teaching"
   - For example: "leadership styles" → "leadership"

3. THEN: Format the remaining terms:
   CRITICAL QUOTING RULES:
   - Multi-word phrases MUST ALWAYS be in quotes - NO EXCEPTIONS
   - Examples of correct quoting:
     - Wrong: machine learning AND deep learning
     - Right: "machine learning" AND "deep learning"
     - Wrong: natural language processing
     - Right: "natural language processing"
   - Single words must NEVER have quotes (e.g., science, research, learning)
   - Use AND to connect required concepts
   - Use OR with parentheses for alternatives (e.g., ("soil health" OR biodiversity))

Example conversions showing proper quoting:
"Research on machine learning for natural language processing"
→ "machine learning" AND "natural language processing"

"Studies examining anxiety depression stress in workplace"
→ (anxiety OR depression OR stress) AND workplace

"Articles about deep learning impact on computer vision"
→ "deep learning" AND "computer vision"

"Research on sustainable agriculture practices and their impact on soil health or biodiversity"
→ "sustainable agriculture" AND ("soil health" OR biodiversity)

"Articles about effective teaching methods for second language acquisition"
→ teaching AND "second language acquisition"

### Input:
{query}

### Response:
"""

# Generate boolean query
inputs = tokenizer(formatted, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)  # "climate change" AND "renewable energy"

Citation

If you use this model in your research, please cite:

@misc{boolean-search-llm,
  title={Boolean Search Query LLM},
  author={Stephen Zweibel},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/Zwounds/boolean-search-model}
}