|  | --- | 
					
						
						|  | license: mit | 
					
						
						|  | pipeline_tag: text-generation | 
					
						
						|  | library_name: transformers | 
					
						
						|  | language: [ | 
					
						
						|  | 'en', 'am', 'ar', 'as', 'az', 'be', 'bg', 'bn', 'br', 'bs', 'ca', 'cs', 'cy', 'da', 'de', 'el', | 
					
						
						|  | 'eo', 'es', 'et', 'eu', 'fa', 'ff', 'fi', 'fr', 'fy', 'ga', 'gd', 'gl', 'gn', 'gu', 'ha', 'he', | 
					
						
						|  | 'hi', 'hr', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'ja', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', | 
					
						
						|  | 'ku', 'ky', 'la', 'lg', 'li', 'ln', 'lo', 'lt', 'lv', 'mg', 'mk', 'ml', 'mn', 'mr', 'ms', 'my', | 
					
						
						|  | 'ne', 'nl', 'no', 'ns', 'om', 'or', 'pa', 'pl', 'ps', 'pt', 'qu', 'rm', 'ro', 'ru', 'sa', 'si', | 
					
						
						|  | 'sc', 'sd', 'sk', 'sl', 'so', 'sq', 'sr', 'ss', 'su', 'sv', 'sw', 'ta', 'te', 'th', 'tl', 'tn', | 
					
						
						|  | 'tr', 'ug', 'uk', 'ur', 'uz', 'vi', 'wo', 'xh', 'yi', 'yo', 'zu', | 
					
						
						|  | ] | 
					
						
						|  | datasets: | 
					
						
						|  |  | 
					
						
						|  | - ontocord/fineweb-permissive-multilingual-2m | 
					
						
						|  | - distily/c4_multilingual_1M | 
					
						
						|  | - data-silence/sumnews | 
					
						
						|  | - xu-song/cc100-samples | 
					
						
						|  | - badrex/llm-emoji-dataset | 
					
						
						|  | - fblgit/simple-math | 
					
						
						|  | - Gusarich/math-expressions-1m | 
					
						
						|  | - neuralwork/arxiver | 
					
						
						|  | - christopher/rosetta-code | 
					
						
						|  | - nampdn-ai/tiny-codes | 
					
						
						|  | - JeanKaddour/minipile | 
					
						
						|  |  | 
					
						
						|  | - NousResearch/hermes-function-calling-v1 | 
					
						
						|  | - simplescaling/s1K-1.1 | 
					
						
						|  |  | 
					
						
						|  | - mlabonne/open-perfectblend | 
					
						
						|  | - allenai/tulu-3-sft-mixture | 
					
						
						|  | - rombodawg/Everything_Instruct_Multilingual | 
					
						
						|  |  | 
					
						
						|  | - open-r1/OpenR1-Math-220k | 
					
						
						|  | - open-thoughts/OpenThoughts-114k | 
					
						
						|  | - cognitivecomputations/dolphin-r1 | 
					
						
						|  | - simplescaling/s1K-1.1 | 
					
						
						|  | tags: | 
					
						
						|  | - chat | 
					
						
						|  | - core | 
					
						
						|  | - base | 
					
						
						|  | - instruct | 
					
						
						|  | - reason | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | # tangled-alpha-0.13-core | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | time python -B prepare_base_datasets.py | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ``` | 
					
						
						|  | i=0, min_len=0, max_len=1073741824, block_size=8193, chunk_size=16386000, len(dataset)=1496631, len(dataset) * block_size=12261897783 | 
					
						
						|  | Total number of tokens in the optimized dataset '../base-data-0-0-1073741824-8193-2000' is 12261897783 | 
					
						
						|  |  | 
					
						
						|  | i=1, min_len=8193, max_len=16385, block_size=16385, chunk_size=16385000, len(dataset)=78802, len(dataset) * block_size=1291170770 | 
					
						
						|  | Total number of tokens in the optimized dataset '../base-data-1-8193-16385-16385-1000' is 1291170770 | 
					
						
						|  |  | 
					
						
						|  | i=2, min_len=16385, max_len=32769, block_size=32769, chunk_size=16384500, len(dataset)=23511, len(dataset) * block_size=770431959 | 
					
						
						|  | Total number of tokens in the optimized dataset '../base-data-2-16385-32769-32769-500' is 770431959 | 
					
						
						|  |  | 
					
						
						|  | i=3, min_len=32769, max_len=65537, block_size=65537, chunk_size=16384250, len(dataset)=5128, len(dataset) * block_size=336073736 | 
					
						
						|  | Total number of tokens in the optimized dataset '../base-data-3-32769-65537-65537-250' is 336073736 | 
					
						
						|  |  | 
					
						
						|  | i=4, min_len=65537, max_len=131073, block_size=131073, chunk_size=16384125, len(dataset)=1169, len(dataset) * block_size=153224337 | 
					
						
						|  | Total number of tokens in the optimized dataset '../base-data-4-65537-131073-131073-125' is 153224337 | 
					
						
						|  |  | 
					
						
						|  | 46G     ../base-data-0-0-1073741824-8193-2000 | 
					
						
						|  | 4.9G    ../base-data-1-8193-16385-16385-1000 | 
					
						
						|  | 2.9G    ../base-data-2-16385-32769-32769-500 | 
					
						
						|  | 1.3G    ../base-data-3-32769-65537-65537-250 | 
					
						
						|  | 589M    ../base-data-4-65537-131073-131073-125 | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True litgpt pretrain --config pretrain_base_model_0.yaml | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ``` | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | Backup `wandb`: | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | mv wandb wandb-pretrain-base-0 | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | Copy config: | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | cp ../config-0.json ../out/pretrain-base-0/final/config.json | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | Chat with model: | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True litgpt chat ../out/pretrain-base-0/final | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True time litgpt evaluate --tasks 'leaderboard' --out_dir '../evaluate/pretrain-base-0/leaderboard/' --batch_size '4' --dtype 'bfloat16' '../out/pretrain-base-0/final' | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ``` | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | litgpt convert_pretrained_checkpoint ../out/pretrain-base-0/final ../out/pretrain-base-0/checkpoint | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True litgpt pretrain --config pretrain_base_model_1.yaml | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | litgpt convert_pretrained_checkpoint ../out/pretrain-base-1/final ../out/pretrain-base-1/checkpoint | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True litgpt pretrain --config pretrain_base_model_2.yaml | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | litgpt convert_pretrained_checkpoint ../out/pretrain-base-2/final ../out/pretrain-base-2/checkpoint | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True litgpt pretrain --config pretrain_base_model_3.yaml | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True time litgpt evaluate --tasks 'leaderboard' --out_dir '../evaluate/pretrain-base-3/leaderboard/' --batch_size '4' --dtype 'bfloat16' '../out/pretrain-base-3/final' | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ``` | 
					
						
						|  | ``` | 
					
						
						|  |  |