Welcome to Crest 20M Base
This is a tiny 20.75M parameter model showing how small models can perform on a little bunch of data.
Training data
We used the first 100 million tokens of the 10BT Sample of Fineweb-Edu to train this model for 5000 steps for a final loss of ~4.0 and a val loss of 4.1566.
Training specs
- Architecture: nanoGPT
- Parameters: 20.75M
- Train Steps: 5000 (5k)
- Learning Rate: 5e-4
- Layers: 10
- Heads: 8
- Embed Layers: 256
- Block Size (Context lenght): 512
- Batch Size: 32
- Gradient Accumulation Steps: 4
- Compile model: False
- Device Type: float16 - CUDA on Kaggle T4 16GB GPU
Training code
As in all of our models, you can find the full training code in this repo in the files train.py, model.py, configurator.py and prepare.py.
Model weights
The final model weights can be found as model.pt in this repo. Use use.py to try out the model :D
Example outputs
Prompt: Artificial Intelligence is
Output:
Artificial Intelligence is the ability to make intelligent decisions.
It is a process of understanding how to do things. It is designed to understand the principles of intelligence and the skills to be successful.
There are various types of intelligence and the ability to communicate information about the process. They can use more than one or more of these functions.
What is the reason for being successful is that they are successful in one or more of those of the tasks. They must be able to use the knowledge to understand and understand information about the process.
What is the best way to understand how to communicate information.
The simplest way to understand the concept of intelligence is to understand how to communicate information about the process of communication.
In addition to being successful in the process of
Prompt: The main concept of physics is
Output:
The main concept of physics is the energy of the universe, the natural world, and the space in which the universe are determined.
When we are in a universe, there are no other elements to go with, or a sphere or sphere or sphere. The universe of the universe is determined by the universe, which they are based on the laws of nature and the universe.
Since we are in the universe, the universe is not just the universe, but the universe is not just the universe. The universe is determined by the universe in the universe by the universe. In the universe, the universe is determined by the universe.
For the universe, the universe is determined by the universe, because the universe is determined by the universe. The universe is determined by the universe to
Prompt: Albert Einstein was
Output:
Albert Einstein was the first to study the evolution of the universe. The universe of stars in the universe is the same as the universe of stars, which is the same as the universe of stars, which is the one and the other. Astronomers are the smallest universe of stars, which are very different from other stars.
According to Einstein, this means that the universe of stars is the same with the same star, which are the same as the universe of stars. These galaxies are called stars. But if we see the universe of stars, we see the stars of stars, which are the same that are the same. As we see the universe of stars in the universe of stars in the universe of stars in the universe of stars in the universe of stars.
Quick Start
Please install tiktoken first (pip install tiktoken)!
If you want to train the model yourself, boot up a fresh T4 (or any other GPU with at least 16GB of VRAM; if you have less VRAM, decrease the Batch Size and increase the Gradient Accumulation Steps.) and start by downloading the needed files from this repository:
mkdir crest_base_20m
cd crest_base_20m
wget https://huggingface.co/LH-Tech-AI/Crest-20M-Base/resolve/main/prepare.py
wget https://huggingface.co/LH-Tech-AI/Crest-20M-Base/resolve/main/model.py
wget https://huggingface.co/LH-Tech-AI/Crest-20M-Base/resolve/main/train.py
wget https://huggingface.co/LH-Tech-AI/Crest-20M-Base/resolve/main/configurator.py
The next step is to prepare the data, so run:
python3 prepare.py
If all data has loaded, you can start the training:
python train.py \
--n_layer=10 \
--n_head=8 \
--n_embd=256 \
--block_size=512 \
--batch_size=32 \
--gradient_accumulation_steps=4 \
--max_iters=5000 \
--eval_interval=100 \
--learning_rate=5e-4 \
--compile=False \
--dtype='float16' \
--device='cuda'
Then, you'll have to wait until iteration 5000 is reached (will log something like iter 5000: loss 4.2044, time 50601.67ms, mfu 2.23%).
Use the final model
To use your trained model - or ours that you can find in this repo as model.pt - you can run:
import torch
import tiktoken
import os
from model import GPTConfig, GPT
out_dir = 'out'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt_path = os.path.join(out_dir, 'ckpt.pt')
checkpoint = torch.load(ckpt_path, map_location=device)
gptconf = GPTConfig(**checkpoint['model_args'])
model = GPT(gptconf)
state_dict = checkpoint['model']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
if k.startswith(unwanted_prefix):
state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
model.load_state_dict(state_dict)
model.to(device)
model.eval()
enc = tiktoken.get_encoding("gpt2")
EOS_TOKEN_ID = 50256
def ask_gpt(prompt, max_new_tokens=150, temperature=0.7, top_k=25):
start_ids = enc.encode(prompt)
x = torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...]
with torch.no_grad():
y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
full_ids = y[0].tolist()
new_ids = full_ids[len(start_ids):]
response = enc.decode(new_ids)
response = response.split('<|endoftext|>')[0]
return response
print("--- Crest Completion Chat started ---")
while True:
user_input = input("\nYour Prompt: ")
if user_input.lower() in ['exit', 'quit']: break
antwort_rest = ask_gpt(user_input)
print(f"\nCrest Completion: {user_input}{antwort_rest}")
print("-" * 30)
This will produce something like (You: "The climate change is"):
Crest Completion: The climate change is about as much as the global warming is changing. The climate is the result of the climate change.
In the world that is the case with extreme weather conditions and climate change, it makes the world more productive. And it makes the world more productive, like the planet’s climate change.
It’s also why we are interested in climate change, we are interested in climate change, like climate change and climate change. We are interested in climate change and climate change.
The climate change in the world is already underway. It is the next step. The world is going to grow in a world where we live in a world where we live in a global society.
While we are interested in climate change, we are interested
Limitations
- This model can't chat - it's a base model!
- This model is really dumb. It just has learned 100 million tokens for ~3.28 epochs.
- This model is not GPT-5.4 or Opus-4.7! Definitely not. :D
Final thoughts
We think, this model shows perfectly on how very small models can perform on general world knowledge data if we train them for multiple epochs. We are kind of satisfied with these results and wonder what would happen, if we'd finetune this model with SFT to make it chat.