Welcome to Crest 20M Base

This is a tiny 20.75M parameter model showing how small models can perform on a little bunch of data.

Training data

We used the first 100 million tokens of the 10BT Sample of Fineweb-Edu to train this model for 5000 steps for a final loss of 4.2044 and a val loss of 4.1566.

Training specs

Architecture: nanoGPT
Parameters: 20.75M
Train Steps: 5000 (5k)
Learning Rate: 5e-4
Layers: 10
Heads: 8
Embed Layers: 256
Block Size (Context lenght): 512
Batch Size: 32
Gradient Accumulation Steps: 4
Compile model: False
Device Type: float16 - CUDA on Kaggle T4 16GB GPU (train time: ~71min)

Training code

As in all of our models, you can find the full training code in this repo in the files train.py, model.py, configurator.py and prepare.py.

Model weights

The final model weights can be found as model.pt in this repo. Use use.py to try out the model :D

Example outputs

Prompt: Artificial Intelligence is
Output:

Artificial Intelligence is the ability to make intelligent decisions.
It is a process of understanding how to do things. It is designed to understand the principles of intelligence and the skills to be successful.
There are various types of intelligence and the ability to communicate information about the process. They can use more than one or more of these functions.
What is the reason for being successful is that they are successful in one or more of those of the tasks. They must be able to use the knowledge to understand and understand information about the process.
What is the best way to understand how to communicate information.
The simplest way to understand the concept of intelligence is to understand how to communicate information about the process of communication.
In addition to being successful in the process of

Prompt: The main concept of physics is
Output:

The main concept of physics is the energy of the universe, the natural world, and the space in which the universe are determined.
When we are in a universe, there are no other elements to go with, or a sphere or sphere or sphere. The universe of the universe is determined by the universe, which they are based on the laws of nature and the universe.
Since we are in the universe, the universe is not just the universe, but the universe is not just the universe. The universe is determined by the universe in the universe by the universe. In the universe, the universe is determined by the universe.
For the universe, the universe is determined by the universe, because the universe is determined by the universe. The universe is determined by the universe to

Prompt: Albert Einstein was
Output:

Albert Einstein was the first to study the evolution of the universe. The universe of stars in the universe is the same as the universe of stars, which is the same as the universe of stars, which is the one and the other. Astronomers are the smallest universe of stars, which are very different from other stars.
According to Einstein, this means that the universe of stars is the same with the same star, which are the same as the universe of stars. These galaxies are called stars. But if we see the universe of stars, we see the stars of stars, which are the same that are the same. As we see the universe of stars in the universe of stars in the universe of stars in the universe of stars in the universe of stars.

Quick Start

Please install tiktoken first (pip install tiktoken)!

If you want to train the model yourself, boot up a fresh T4 (or any other GPU with at least 16GB of VRAM; if you have less VRAM, decrease the Batch Size and increase the Gradient Accumulation Steps.) and start by downloading the needed files from this repository:

mkdir crest_base_20m
cd crest_base_20m
wget https://huggingface.co/LH-Tech-AI/Crest-20M-Base/resolve/main/prepare.py
wget https://huggingface.co/LH-Tech-AI/Crest-20M-Base/resolve/main/model.py
wget https://huggingface.co/LH-Tech-AI/Crest-20M-Base/resolve/main/train.py
wget https://huggingface.co/LH-Tech-AI/Crest-20M-Base/resolve/main/configurator.py

The next step is to prepare the data, so run:

python3 prepare.py

If all data has loaded, you can start the training:

python train.py \
    --n_layer=10 \
    --n_head=8 \
    --n_embd=256 \
    --block_size=512 \
    --batch_size=32 \
    --gradient_accumulation_steps=4 \
    --max_iters=5000 \
    --eval_interval=100 \
    --learning_rate=5e-4 \
    --compile=False \
    --dtype='float16' \
    --device='cuda'

Then, you'll have to wait until iteration 5000 is reached (will log something like iter 5000: loss 4.2044, time 50601.67ms, mfu 2.23%).

Use the final model

To use your trained model - or ours that you can find in this repo as model.pt - you can run:

import torch
import tiktoken
import os
from model import GPTConfig, GPT

out_dir = 'out'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt_path = os.path.join(out_dir, 'ckpt.pt')
checkpoint = torch.load(ckpt_path, map_location=device)
gptconf = GPTConfig(**checkpoint['model_args'])
model = GPT(gptconf)
state_dict = checkpoint['model']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
model.load_state_dict(state_dict)
model.to(device)
model.eval()

enc = tiktoken.get_encoding("gpt2")
EOS_TOKEN_ID = 50256 

def ask_gpt(prompt, max_new_tokens=150, temperature=0.7, top_k=25):
    start_ids = enc.encode(prompt)
    x = torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...]

    with torch.no_grad():
        y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
        
        full_ids = y[0].tolist()
        new_ids = full_ids[len(start_ids):]
        
        response = enc.decode(new_ids)
        response = response.split('<|endoftext|>')[0]
        return response

print("--- Crest Completion Chat started ---")
while True:
    user_input = input("\nYour Prompt: ")
    if user_input.lower() in ['exit', 'quit']: break
    
    antwort_rest = ask_gpt(user_input)
    
    print(f"\nCrest Completion: {user_input}{antwort_rest}")
    print("-" * 30)

This will produce something like (You: "The climate change is"):

Crest Completion: The climate change is about as much as the global warming is changing. The climate is the result of the climate change.
In the world that is the case with extreme weather conditions and climate change, it makes the world more productive. And it makes the world more productive, like the planet’s climate change.
It’s also why we are interested in climate change, we are interested in climate change, like climate change and climate change. We are interested in climate change and climate change.
The climate change in the world is already underway. It is the next step. The world is going to grow in a world where we live in a world where we live in a global society.
While we are interested in climate change, we are interested

Limitations

This model can't chat - it's a base model!
This model is really dumb. It just has learned 100 million tokens for ~3.28 epochs.
This model is not GPT-5.4 or Opus-4.7! Definitely not. :D

Final thoughts

We think, this model shows perfectly on how very small models can perform on general world knowledge data if we train them for multiple epochs. We are kind of satisfied with these results and wonder what would happen, if we'd finetune this model with SFT to make it chat.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train LH-Tech-AI/Crest-20M-Base

Collection including LH-Tech-AI/Crest-20M-Base

Very tiny base models

Collection

Some very tiny BASE models to see how small models perform on e.g world knowledge • 5 items • Updated 29 days ago