Sentence Similarity
Safetensors
sentence-transformers
English
code
PyLate
modernbert
ColBERT
code-search
code-retrieval
late-interaction
reasoning
text-embeddings-inference
Instructions to use ctrltokyo/Reason-Code-ModernColBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use ctrltokyo/Reason-Code-ModernColBERT with sentence-transformers:
from pylate import models queries = [ "Which planet is known as the Red Planet?", "What is the largest planet in our solar system?", ] documents = [ ["Mars is the Red Planet.", "Venus is Earth's twin."], ["Jupiter is the largest planet.", "Saturn has rings."], ] model = models.ColBERT(model_name_or_path="ctrltokyo/Reason-Code-ModernColBERT") queries_emb = model.encode(queries, is_query=True) docs_emb = model.encode(documents, is_query=False) - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| - code | |
| library_name: PyLate | |
| tags: | |
| - ColBERT | |
| - PyLate | |
| - sentence-transformers | |
| - code-search | |
| - code-retrieval | |
| - late-interaction | |
| - reasoning | |
| base_model: lightonai/GTE-ModernColBERT-v1 | |
| datasets: | |
| - nomic-ai/cornstack-python-v1 | |
| - nomic-ai/cornstack-java-v1 | |
| - nomic-ai/cornstack-javascript-v1 | |
| - nomic-ai/cornstack-php-v1 | |
| - nomic-ai/cornstack-go-v1 | |
| - nomic-ai/cornstack-ruby-v1 | |
| pipeline_tag: sentence-similarity | |
| # Reason-Code-ModernColBERT | |
| The **first reasoning-enhanced ColBERT model for code search and retrieval**. | |
| Extends the [ReasonIR methodology](https://arxiv.org/abs/2504.20595) to the code domain — generating reasoning-intensive queries that require understanding algorithms, edge cases, and design patterns, not just keyword matching. Built on research from [LightOn AI](https://huggingface.co/lightonai) (ColBERT for code) and [Facebook Research](https://github.com/facebookresearch/ReasonIR) (reasoning-enhanced retrieval). | |
| ## Why Reasoning-Enhanced Training for Code? | |
| Standard code search training uses docstring→code pairs. Our approach generates **reasoning-intensive queries** that require understanding the code's algorithm, behavior, and edge cases — not just surface-level keyword matching. This is the same methodology that enabled [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) to outperform 7B dense models on reasoning tasks at only 150M parameters. | |
| ## Model Details | |
| | Property | Value | | |
| |---|---| | |
| | **Base model** | [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) | | |
| | **Architecture** | ColBERT (late-interaction, multi-vector) | | |
| | **Parameters** | 150M | | |
| | **Embedding dim** | 128 per token | | |
| | **Document length** | 512 tokens | | |
| | **Query length** | 128 tokens | | |
| | **Similarity** | MaxSim | | |
| | **Languages** | Python, Java, JavaScript, PHP, Go, Ruby | | |
| | **License** | Apache 2.0 | | |
| ## Training | |
| ### Two-Stage Training Pipeline | |
| **Stage 1: CoRNStack Base (1 epoch)** | |
| - 100,000 high-quality code search pairs from [CoRNStack](https://huggingface.co/collections/nomic-ai/cornstack-67c60fda17322ce742fe9dac) (Apache 2.0) | |
| - 6 languages: Python (25K), Java (20K), JavaScript (15K), PHP (15K), Go (15K), Ruby (10K) | |
| - Loss: 2.42 → 0.63 | |
| **Stage 2: Reasoning-Enhanced Fine-Tuning (3 epochs)** | |
| - 9,959 reasoning-intensive code search queries generated from CoRNStack code samples | |
| - Queries require understanding algorithms, edge cases, design patterns, and complexity | |
| - Each query includes a chain-of-thought reasoning prefix (ReasonIR methodology) | |
| - Loss: 2.36 → 0.54 | |
| ### Training Configuration | |
| ```python | |
| # Both stages | |
| model = ColBERT(document_length=512, query_length=128) | |
| loss = CachedContrastive(temperature=1.0, mini_batch_size=32) | |
| batch_size = 256 | |
| optim = "adamw_torch" | |
| bf16 = True | |
| # Stage 1: lr=1e-5, 1 epoch, warmup=5% | |
| # Stage 2: lr=5e-6, 3 epochs, warmup=5% | |
| ``` | |
| ### Hardware | |
| Trained on a single NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory). | |
| - Stage 1: ~130 min (391 steps) | |
| - Stage 2: ~37 min (117 steps) | |
| ## Benchmark Results | |
| ### CodeSearchNet MRR (500 queries per language, 500 candidates) | |
| | Language | GTE-ModernColBERT (base) | **Reason-Code-ModernColBERT (ours)** | Δ | | |
| |------------|:---:|:---:|:---:| | |
| | Python | 0.991 | 0.989 | -0.002 | | |
| | Java | 0.829 | **0.866** | +0.037 | | |
| | JavaScript | 0.802 | **0.839** | +0.037 | | |
| | PHP | 0.841 | **0.862** | +0.021 | | |
| | Go | 0.879 | **0.887** | +0.008 | | |
| | Ruby | 0.773 | **0.831** | +0.058 | | |
| | **Average** | 0.853 | **0.879** | **+0.026** | | |
| Improves on the base model in 5 of 6 languages. Largest gains in Ruby (+5.8pp), Java (+3.7pp), and JavaScript (+3.7pp) — languages that benefited most from reasoning-enhanced training data. Python is near-ceiling at 0.99. | |
| ## Usage | |
| ```python | |
| from pylate import models | |
| model = models.ColBERT(model_name_or_path="ctrltokyo/Reason-Code-ModernColBERT") | |
| queries = ["function that sorts an array in descending order using a comparison-based algorithm"] | |
| code_docs = ["def sort_desc(arr):\n return sorted(arr, reverse=True)"] | |
| query_embeddings = model.encode(queries, is_query=True) | |
| doc_embeddings = model.encode(code_docs, is_query=False) | |
| ``` | |
| ## Citation | |
| This model extends the methodology from: | |
| ```bibtex | |
| @article{shao2025reasonir, | |
| title={ReasonIR: Training Retrievers for Reasoning Tasks}, | |
| author={Shao, Rulin and Jiang, Rui and Yu, Tao and Hashimoto, Tatsunori}, | |
| journal={arXiv preprint arXiv:2504.20595}, | |
| year={2025} | |
| } | |
| @misc{Reason-ModernColBERT, | |
| title={Reason-ModernColBERT}, | |
| author={LightOn AI}, | |
| year={2025}, | |
| url={https://huggingface.co/lightonai/Reason-ModernColBERT} | |
| } | |
| @inproceedings{cornstack2025, | |
| title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking}, | |
| author={Gangisetty, Zach and others}, | |
| booktitle={ICLR}, | |
| year={2025} | |
| } | |
| ``` | |
| Built with [PyLate](https://github.com/lightonai/pylate) and [Sentence Transformers](https://www.sbert.net/). | |