First reasoning-enhanced ColBERT for code search

2e8b74a verified about 2 months ago

5 kB

	---
	license: apache-2.0
	language:
	- en
	- code
	library_name: PyLate
	tags:
	- ColBERT
	- PyLate
	- sentence-transformers
	- code-search
	- code-retrieval
	- late-interaction
	- reasoning
	base_model: lightonai/GTE-ModernColBERT-v1
	datasets:
	- nomic-ai/cornstack-python-v1
	- nomic-ai/cornstack-java-v1
	- nomic-ai/cornstack-javascript-v1
	- nomic-ai/cornstack-php-v1
	- nomic-ai/cornstack-go-v1
	- nomic-ai/cornstack-ruby-v1
	pipeline_tag: sentence-similarity
	---

	# Reason-Code-ModernColBERT

	The first reasoning-enhanced ColBERT model for code search and retrieval.

	Extends the [ReasonIR methodology](https://arxiv.org/abs/2504.20595) to the code domain — generating reasoning-intensive queries that require understanding algorithms, edge cases, and design patterns, not just keyword matching. Built on research from [LightOn AI](https://huggingface.co/lightonai) (ColBERT for code) and [Facebook Research](https://github.com/facebookresearch/ReasonIR) (reasoning-enhanced retrieval).

	## Why Reasoning-Enhanced Training for Code?

	Standard code search training uses docstring→code pairs. Our approach generates reasoning-intensive queries that require understanding the code's algorithm, behavior, and edge cases — not just surface-level keyword matching. This is the same methodology that enabled [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) to outperform 7B dense models on reasoning tasks at only 150M parameters.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) \|
	\| Architecture \| ColBERT (late-interaction, multi-vector) \|
	\| Parameters \| 150M \|
	\| Embedding dim \| 128 per token \|
	\| Document length \| 512 tokens \|
	\| Query length \| 128 tokens \|
	\| Similarity \| MaxSim \|
	\| Languages \| Python, Java, JavaScript, PHP, Go, Ruby \|
	\| License \| Apache 2.0 \|

	## Training

	### Two-Stage Training Pipeline

	Stage 1: CoRNStack Base (1 epoch)
	- 100,000 high-quality code search pairs from [CoRNStack](https://huggingface.co/collections/nomic-ai/cornstack-67c60fda17322ce742fe9dac) (Apache 2.0)
	- 6 languages: Python (25K), Java (20K), JavaScript (15K), PHP (15K), Go (15K), Ruby (10K)
	- Loss: 2.42 → 0.63

	Stage 2: Reasoning-Enhanced Fine-Tuning (3 epochs)
	- 9,959 reasoning-intensive code search queries generated from CoRNStack code samples
	- Queries require understanding algorithms, edge cases, design patterns, and complexity
	- Each query includes a chain-of-thought reasoning prefix (ReasonIR methodology)
	- Loss: 2.36 → 0.54

	### Training Configuration

	```python
	# Both stages
	model = ColBERT(document_length=512, query_length=128)
	loss = CachedContrastive(temperature=1.0, mini_batch_size=32)
	batch_size = 256
	optim = "adamw_torch"
	bf16 = True

	# Stage 1: lr=1e-5, 1 epoch, warmup=5%
	# Stage 2: lr=5e-6, 3 epochs, warmup=5%
	```

	### Hardware

	Trained on a single NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory).
	- Stage 1: ~130 min (391 steps)
	- Stage 2: ~37 min (117 steps)

	## Benchmark Results

	### CodeSearchNet MRR (500 queries per language, 500 candidates)

	\| Language \| GTE-ModernColBERT (base) \| Reason-Code-ModernColBERT (ours) \| Δ \|
	\|------------\|:---:\|:---:\|:---:\|
	\| Python \| 0.991 \| 0.989 \| -0.002 \|
	\| Java \| 0.829 \| 0.866 \| +0.037 \|
	\| JavaScript \| 0.802 \| 0.839 \| +0.037 \|
	\| PHP \| 0.841 \| 0.862 \| +0.021 \|
	\| Go \| 0.879 \| 0.887 \| +0.008 \|
	\| Ruby \| 0.773 \| 0.831 \| +0.058 \|
	\| Average \| 0.853 \| 0.879 \| +0.026 \|

	Improves on the base model in 5 of 6 languages. Largest gains in Ruby (+5.8pp), Java (+3.7pp), and JavaScript (+3.7pp) — languages that benefited most from reasoning-enhanced training data. Python is near-ceiling at 0.99.

	## Usage

	```python
	from pylate import models

	model = models.ColBERT(model_name_or_path="ctrltokyo/Reason-Code-ModernColBERT")

	queries = ["function that sorts an array in descending order using a comparison-based algorithm"]
	code_docs = ["def sort_desc(arr):\n return sorted(arr, reverse=True)"]

	query_embeddings = model.encode(queries, is_query=True)
	doc_embeddings = model.encode(code_docs, is_query=False)
	```

	## Citation

	This model extends the methodology from:

	```bibtex
	@article{shao2025reasonir,
	title={ReasonIR: Training Retrievers for Reasoning Tasks},
	author={Shao, Rulin and Jiang, Rui and Yu, Tao and Hashimoto, Tatsunori},
	journal={arXiv preprint arXiv:2504.20595},
	year={2025}
	}

	@misc{Reason-ModernColBERT,
	title={Reason-ModernColBERT},
	author={LightOn AI},
	year={2025},
	url={https://huggingface.co/lightonai/Reason-ModernColBERT}
	}

	@inproceedings{cornstack2025,
	title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
	author={Gangisetty, Zach and others},
	booktitle={ICLR},
	year={2025}
	}
	```

	Built with [PyLate](https://github.com/lightonai/pylate) and [Sentence Transformers](https://www.sbert.net/).