Instructions to use cisco-ai/SecureBERT2.0-biencoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use cisco-ai/SecureBERT2.0-biencoder with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("cisco-ai/SecureBERT2.0-biencoder") sentences = [ "What is the primary responsibility of the Information Security Oversight Committee in an organization?", "Least privilege", "By searching for repeating ciphertext sequences at fixed displacements.", "Ensuring and supporting information protection awareness and training programs" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Inference
- Notebooks
- Google Colab
- Kaggle
Model Card for cisco-ai/SecureBERT2.0-biencoder
The SecureBERT 2.0 Bi-Encoder is a cybersecurity-domain sentence-similarity and document-embedding model fine-tuned from SecureBERT 2.0.
It independently encodes queries and documents into a shared vector space for semantic search, information retrieval, and cybersecurity knowledge retrieval.
Model Details
Model Description
- Developed by: Cisco AI
- Model type: Bi-Encoder (Sentence Transformer)
- Architecture: ModernBERT backbone with dual encoders
- Max sequence length: 1024 tokens
- Output dimension: 768
- Language: English
- License: Apache-2.0
- Finetuned from: cisco-ai/SecureBERT2.0-base
Uses
Direct Use
- Semantic search and document similarity in cybersecurity corpora
- Information retrieval and ranking for threat intelligence reports, advisories, and vulnerability notes
- Document embedding for retrieval-augmented generation (RAG) and clustering
Downstream Use
- Threat intelligence knowledge graph construction
- Cybersecurity QA and reasoning systems
- Security operations center (SOC) data mining
Out-of-Scope Use
- Non-technical or general-domain text similarity
- Generative or conversational tasks
Model Architecture
The Bi-Encoder encodes queries and documents independently into a joint vector space.
This architecture enables scalable approximate nearest-neighbor search for candidate retrieval and semantic ranking.
Datasets
Fine-Tuning Datasets
| Dataset Category | Number of Records |
|---|---|
| Cybersecurity QA corpus | 43 000 |
| Security governance QA corpus | 60 000 |
| Cybersecurity instruction–response corpus | 25 000 |
| Cybersecurity rules corpus (evaluation) | 5 000 |
Dataset Descriptions
- Cybersecurity QA corpus: 43 k question–answer pairs, reports, and technical documents covering network security, malware analysis, cryptography, and cloud security.
- Security governance QA corpus: 60 k expert-curated governance and compliance QA pairs emphasizing clear, validated responses.
- Cybersecurity instruction–response corpus: 25 k instructional pairs enabling reasoning and instruction-following.
- Cybersecurity rules corpus: 5 k structured policy and guideline records used for evaluation.
How to Get Started with the Model
Using Sentence Transformers
pip install -U sentence-transformers
Run Model to Encode
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("cisco-ai/SecureBERT2.0-biencoder")
sentences = [
"How would you use Amcache analysis to detect fileless malware?",
"Amcache analysis provides forensic artifacts for detecting fileless malware ...",
"To capture and display network traffic"
]
embeddings = model.encode(sentences)
print(embeddings.shape)
Compute Similarity
from sentence_transformers import util
similarity = util.cos_sim(embeddings, embeddings)
print(similarity)
Framework Versions
- python: 3.10.10
- sentence_transformers: 5.0.0
- transformers: 4.52.4
- PyTorch: 2.7.0+cu128
- accelerate: 1.9.0
- datasets: 3.6.0
Training Details
Training Dataset
The model was fine-tuned on cybersecurity-specific paired-sentence data for document embedding and similarity learning.
- Dataset Size: 35,705 samples
- Columns:
sentence_0,sentence_1,label
Example Schema
| Field | Type | Description |
|---|---|---|
| sentence_0 | string | Query or short text input |
| sentence_1 | string | Candidate or document text |
| label | float | Similarity score (1.0 = relevant) |
Example Samples
| sentence_0 | sentence_1 | label |
|---|---|---|
| Under what circumstances does attribution bias distort intrusion linking? | Attribution bias in intrusion linking occurs when analysts allow preconceived notions, organizational pressures, or cognitive shortcuts to influence their assessment of attack origins and relationships between incidents... | 1.0 |
| How can you identify store buffer bypass speculation artifacts? | Store buffer bypass speculation artifacts represent side-channel vulnerabilities that exploit speculative execution to leak sensitive information... | 1.0 |
Training Objective and Loss
The model was optimized to maximize semantic similarity between relevant cybersecurity text pairs using contrastive learning.
- Loss Function: MultipleNegativesRankingLoss
Loss Parameters
{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
Reference
@article{aghaei2025securebert,
title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
journal={arXiv preprint arXiv:2510.00240},
year={2025}
}
Model Card Authors
Cisco AI
Model Card Contact
For inquiries, please contact ai-threat-intel@cisco.com
- Downloads last month
- 123,634