Text Classification
Transformers
Safetensors
Russian
English
bert
tiny-bert
rubert-tiny2
binary-classification
jobs
developer-classification
data-analyst-classification
business-analyst-classification
dev-plus-da-plus-ba
r95
v2
Eval Results (legacy)
text-embeddings-inference
Instructions to use AndreiTolmachev/dev_da_roles_1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AndreiTolmachev/dev_da_roles_1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="AndreiTolmachev/dev_da_roles_1")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("AndreiTolmachev/dev_da_roles_1") model = AutoModelForSequenceClassification.from_pretrained("AndreiTolmachev/dev_da_roles_1") - Notebooks
- Google Colab
- Kaggle
dev_da_roles_1 — Developer + Data Analyst + Business Analyst Classifier
Binary job-vacancy classifier: detects developer, Data Analyst, or Business Analyst roles (tech) versus other roles (other).
Built on top of cointegrated/rubert-tiny2, a compact BERT model for Russian and English text.
v2 — extends v1 by adding Business Analyst to the positive class and using a longer input context (384 tokens / 2000 chars). Precision improved from 0.880 → 0.922.
Task Definition
The positive class (tech) is defined as:
role_category in TECH_CLASSES AND team_lead == 0
TECH_CLASSES:
- Backend
- Desktop / Systems
- Embedded
- Frontend
- Fullstack
- ML / AI / Data Scientist
- Mobile
- Data Analyst
- Бизнес аналитик (Business Analyst)
Team leads and management roles are intentionally excluded from the positive class.
Labels
| id | label |
|---|---|
| 0 | other |
| 1 | tech |
Validation Metrics
| Metric | Value |
|---|---|
| ROC AUC | 0.9815 |
| Precision @ threshold | 0.9219 |
| Recall @ threshold | 0.9506 |
| Best threshold | 0.8791 |
| Target recall | 0.95 |
| Best epoch | 7 |
Recall by key category (held-out test set):
| Category | Recall |
|---|---|
| Backend | 0.984 |
| Frontend | 1.000 |
| Mobile | 1.000 |
| ML / AI / Data Scientist | 0.976 |
| Data Analyst | 0.916 |
| Business Analyst | 0.895 |
Inference Parameters
max_length: 384 tokens- Vacancy text:
title + " . " + description, description truncated to 2000 characters - Decision threshold for class
tech: 0.8791
Usage
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
MODEL_ID = "AndreiTolmachev/dev_da_roles_1"
THRESHOLD = 0.8791
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).eval()
def is_tech_role(title: str, description: str = "") -> bool:
text = f"{title.strip()} . {description[:2000].strip()}"
enc = tokenizer(text, truncation=True, max_length=384, return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits
prob_tech = torch.softmax(logits, dim=-1)[0, 1].item()
return prob_tech >= THRESHOLD
# Developer
print(is_tech_role("Backend Python Developer", "FastAPI, PostgreSQL, Docker, Kubernetes..."))
# Data Analyst
print(is_tech_role("Data Analyst", "SQL, Python, dashboards, product metrics, A/B tests..."))
# Business Analyst
print(is_tech_role("Бизнес аналитик", "Сбор требований, UML, BPMN, работа с командой разработки..."))
# Manager — should return False
print(is_tech_role("Project Manager", "Agile, управление командой, планирование спринтов..."))
Architecture
- Model:
BertForSequenceClassification - Base model:
cointegrated/rubert-tiny2 - Layers: 3, hidden size: 312, attention heads: 12
- Vocab size: 83,828
- Parameters: ~29M
max_position_embeddings: 2048
Training
- Dataset: internal job-vacancy dataset (
vacancies_labeled.csv), labeled by an LLM pipeline - Train/test split: 85% / 15%, stratified by role and team_lead flag
- Loss: weighted cross-entropy (
pos_weight= 2.115) - Optimizer: AdamW,
lr=2e-5, linear warmup 10%, grad clip 1.0 - Early stopping: patience=3 on F1 at target recall ≥ 0.95
- Threshold selected to achieve target recall = 0.95
Limitations
- Trained primarily on Russian-language IT job vacancies; quality on other domains/languages is not guaranteed.
- Team lead and management roles are treated as
otherby design. - Description is truncated to 2000 characters before tokenization.
- The model groups developers, Data Analysts, and Business Analysts into one positive class; it does not distinguish between them.
- Data Analyst recall is ~0.92: vacancies with heavy business/finance framing may be missed.
Version
Hub tag: v2.0-dev-da-ba-r95
Changelog vs v1:
- Added Business Analyst (
Бизнес аналитик) to positive class - Input context extended:
max_length256→384, description 1200→2000 chars - Precision improved: 0.880 → 0.922
lrlowered to 2e-5, batch size 32→24 to accommodate longer sequences
License
MIT.
- Downloads last month
- 42
Model tree for AndreiTolmachev/dev_da_roles_1
Base model
cointegrated/rubert-tiny2Evaluation results
- roc_aucself-reported0.982
- precisionself-reported0.922
- recallself-reported0.951