Update README.md

5463267 verified 9 months ago

9.68 kB

	Below is an improved version of the README in Markdown format. You can copy and paste the following text into your README file.

	---

	# MasterControlAIML R1-Qwen2.5-1.5b SFT R1 JSON Unstructured-To-Structured LoRA Model

	[![Unsloth](https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png)](https://github.com/unslothai/unsloth)

	This repository provides a fine-tuned Qwen2 model optimized for transforming unstructured text into structured JSON outputs according to a predefined schema. The model is finetuned from the base model MasterControlAIML/DeepSeek-R1-Strategy-Qwen-2.5-1.5b-Unstructured-To-Structured and leverages LoRA techniques for efficient adaptation.

	> Key Highlights:
	>
	> - Developed by: [bhaviktheslider](https://github.com/bhaviktheslider)
	> - License: [Apache-2.0](LICENSE)
	> - Finetuned from: `MasterControlAIML/DeepSeek-R1-Strategy-Qwen-2.5-1.5b-Unstructured-To-Structured`
	> - Accelerated Training: Achieved 2x faster training using [Unsloth](https://github.com/unslothai/unsloth) and Hugging Face's TRL library.

	---

	## Table of Contents

	- [Overview](#overview)
	- [Features](#features)
	- [Installation](#installation)
	- [Quick Start](#quick-start)
	- [Using Unsloth for Fast Inference](#using-unsloth-for-fast-inference)
	- [Using Transformers for Inference](#using-transformers-for-inference)
	- [Advanced Example with LangChain Prompt](#advanced-example-with-langchain-prompt)
	- [Contributing](#contributing)
	- [License](#license)
	- [Acknowledgments](#acknowledgments)

	---

	## Overview

	This model is tailored for tasks where mapping unstructured text (e.g., manuals, QA documents) into a structured JSON format is required. It supports hierarchical data extraction based on a given JSON Schema, ensuring that the generated outputs follow the exact structure and rules defined by the schema.

	---

	## Features

	- Efficient Inference: Utilizes the [Unsloth](https://github.com/unslothai/unsloth) library for fast model inference.
	- Structured Output: Maps text inputs into a strict JSON schema with hierarchical relationships.
	- Flexible Integration: Example code snippets show how to use both the Unsloth API and Hugging Face’s Transformers.
	- Advanced Prompting: Includes an example of using LangChain prompt templates for detailed instruction-driven output.

	---

	## Installation

	### Prerequisites

	- Python: 3.8+
	- PyTorch: (Preferably with CUDA support)
	- Required Libraries: `transformers`, `torch`, `unsloth`, `langchain` (for advanced usage)

	### Installation Command

	Install the required Python packages with:

	```bash
	pip install torch transformers unsloth langchain
	```

	---

	## Quick Start

	### Using Unsloth for Fast Inference

	The Unsloth library allows you to quickly load and run inference with the model. Below is a basic example:

	```python
	from unsloth import FastLanguageModel
	import torch

	# Specify the model name
	MODEL = "MasterControlAIML/R1-Qwen2.5-1.5b-SFT-R1-JSON-Unstructured-To-Structured-lora"

	# Load the model and tokenizer
	model, tokenizer = FastLanguageModel.from_pretrained(
	model_name=MODEL,
	max_seq_length=2048,
	dtype=None,
	load_in_4bit=False,
	)

	# Prepare the model for inference
	FastLanguageModel.for_inference(model)

	# Define a prompt template
	ALPACA_PROMPT = """
	Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
	### Instruction:
	{}
	### Response:
	{}
	"""

	# Example: Create input and generate output
	instruction = "Provide a summary of the Quality Assurance Manual."
	prompt = ALPACA_PROMPT.format(instruction, "")
	inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
	output = model.generate(**inputs, max_new_tokens=2000)

	# Decode and print the generated text
	print(tokenizer.batch_decode(output, skip_special_tokens=True)[0])
	```

	---

	### Using Transformers for Inference

	If you prefer to use Hugging Face's Transformers directly, here’s an alternative example:

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
	import torch

	MODEL = "MasterControlAIML/R1-Qwen2.5-1.5b-SFT-R1-JSON-Unstructured-To-Structured-lora"

	# Initialize tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(MODEL)
	model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.float16, device_map="auto")

	ALPACA_PROMPT = """
	Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
	### Instruction:
	{}
	### Response:
	{}
	"""

	# Define your text input
	TEXT = "Provide a detailed explanation of the QA processes in manufacturing."
	prompt = ALPACA_PROMPT.format(TEXT, "")
	inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
	text_streamer = TextStreamer(tokenizer)

	# Generate output with specific generation parameters
	with torch.no_grad():
	output_ids = model.generate(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	max_new_tokens=2000,
	temperature=0.7,
	top_p=0.9,
	repetition_penalty=1.1,
	streamer=text_streamer,
	pad_token_id=tokenizer.pad_token_id,
	)

	# Print the decoded output
	print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
	```

	---

	### Advanced Example with LangChain Prompt

	For advanced users, the repository includes an example that integrates with LangChain to map hierarchical text data into a JSON schema. This example uses a prompt template to instruct the model on how to generate an output that includes both the JSON object (`<answer>`) and the reasoning behind the mapping decisions (`<think>`).

	```python
	from langchain_core.prompts import PromptTemplate

	SYSTEM_PROMPT = """
	### Role:
	You are an expert data extractor specializing in mapping hierarchical text data into a given JSON Schema.

	### DATA INPUT:
	- Text: ```{TEXT}```
	- Blank JSON Schema: ```{SCHEMA}```

	### TASK REQUIREMENT:
	1. Analyze the given text and map all relevant information strictly into the provided JSON Schema.
	2. Provide your output in two mandatory sections:
	- `<answer>`: The filled JSON object
	- `<think>`: Reasoning for the mapping decisions

	### OUTPUT STRUCTURE:
	```
	<think> /* Explanation of mapping logic */ </think>
	<answer> /* Completed JSON Object */ </answer>
	```

	### STRICT RULES FOR GENERATING OUTPUT:
	1. Both Tags Required:
	- Always provide both the `<think>` and `<answer>` sections.
	- If reasoning is minimal, state: "Direct mapping from text to schema."
	2. JSON Schema Mapping:
	- Strictly map the text data to the given JSON Schema without modification or omissions.
	3. Hierarchy Preservation:
	- Maintain proper parent-child relationships and follow the schema's hierarchical structure.
	4. Correct Mapping of Attributes:
	- Map key attributes, including `id`, `idc`, `idx`, `level_type`, and `component_type`.
	5. JSON Format Compliance:
	- Escape quotes, replace newlines with `\\n`, avoid trailing commas, and use double quotes exclusively.
	6. Step-by-Step Reasoning:
	- Explain your reasoning within the `<think>` tag.

	### IMPORTANT:
	If either the `<think>` or `<answer>` tags is missing, the response will be considered incomplete.
	"""

	# Create a prompt template with LangChain
	system_prompt_template = PromptTemplate(template=SYSTEM_PROMPT, input_variables=["TEXT", "SCHEMA"])

	# Format the prompt with your text and JSON schema
	system_prompt_str = system_prompt_template.format(
	TEXT="Your detailed text input here...",
	SCHEMA="""{
	"type": "object",
	"properties": {
	"id": {"type": "string", "description": "Unique identifier."},
	"title": {"type": "string", "description": "Section title."},
	"level": {"type": "integer", "description": "Hierarchy level."},
	"level_type": {"type": "string", "enum": ["ROOT", "SECTION", "SUBSECTION", "DETAIL_N"], "description": "Hierarchy type."},
	"component": {
	"type": "array",
	"items": {
	"type": "object",
	"properties": {
	"idc": {"type": "integer", "description": "Component ID."},
	"component_type": {"type": "string", "enum": ["PARAGRAPH", "TABLE", "CALCULATION", "CHECKBOX"], "description": "Component type."},
	"metadata": {"type": "string", "description": "Additional metadata."},
	"properties": {"type": "object"}
	},
	"required": ["idc", "component_type", "metadata", "properties"]
	}
	},
	"children": {"type": "array", "items": {}}
	},
	"required": ["id", "title", "level", "level_type", "component", "children"]
	}"""
	)

	# Use the system prompt with your inference code as shown in previous examples.
	```

	---

	## Contributing

	Contributions, bug reports, and feature requests are welcome! Please open an issue or submit a pull request if you would like to contribute to this project.

	---

	## License

	This project is licensed under the [Apache-2.0 License](LICENSE).

	---

	## Acknowledgments

	- Unsloth: For providing fast model inference capabilities. ([GitHub](https://github.com/unslothai/unsloth))
	- Hugging Face: For the [Transformers](https://github.com/huggingface/transformers) and [TRL](https://github.com/huggingface/trl) libraries.
	- LangChain: For advanced prompt management and integration.
	- And, of course, thanks to the community and contributors who helped shape this project.

	---

	Enjoy using the model, and happy coding!