dev_da_roles_1 — Developer + Data Analyst + Business Analyst Classifier

Binary job-vacancy classifier: detects developer, Data Analyst, or Business Analyst roles (tech) versus other roles (other).

Built on top of cointegrated/rubert-tiny2, a compact BERT model for Russian and English text.

v2 — extends v1 by adding Business Analyst to the positive class and using a longer input context (384 tokens / 2000 chars). Precision improved from 0.880 → 0.922.

Task Definition

The positive class (tech) is defined as:

role_category in TECH_CLASSES AND team_lead == 0

TECH_CLASSES:

Backend
Desktop / Systems
Embedded
Frontend
Fullstack
ML / AI / Data Scientist
Mobile
Data Analyst
Бизнес аналитик (Business Analyst)

Team leads and management roles are intentionally excluded from the positive class.

Labels

id	label
0	other
1	tech

Validation Metrics

Metric	Value
ROC AUC	0.9815
Precision @ threshold	0.9219
Recall @ threshold	0.9506
Best threshold	0.8791
Target recall	0.95
Best epoch	7

Recall by key category (held-out test set):

Category	Recall
Backend	0.984
Frontend	1.000
Mobile	1.000
ML / AI / Data Scientist	0.976
Data Analyst	0.916
Business Analyst	0.895

Inference Parameters

max_length: 384 tokens
Vacancy text: title + " . " + description, description truncated to 2000 characters
Decision threshold for class tech: 0.8791

Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_ID = "AndreiTolmachev/dev_da_roles_1"
THRESHOLD = 0.8791

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).eval()

def is_tech_role(title: str, description: str = "") -> bool:
    text = f"{title.strip()} . {description[:2000].strip()}"
    enc = tokenizer(text, truncation=True, max_length=384, return_tensors="pt")
    with torch.no_grad():
        logits = model(**enc).logits
    prob_tech = torch.softmax(logits, dim=-1)[0, 1].item()
    return prob_tech >= THRESHOLD

# Developer
print(is_tech_role("Backend Python Developer", "FastAPI, PostgreSQL, Docker, Kubernetes..."))

# Data Analyst
print(is_tech_role("Data Analyst", "SQL, Python, dashboards, product metrics, A/B tests..."))

# Business Analyst
print(is_tech_role("Бизнес аналитик", "Сбор требований, UML, BPMN, работа с командой разработки..."))

# Manager — should return False
print(is_tech_role("Project Manager", "Agile, управление командой, планирование спринтов..."))

Architecture

Model: BertForSequenceClassification
Base model: cointegrated/rubert-tiny2
Layers: 3, hidden size: 312, attention heads: 12
Vocab size: 83,828
Parameters: ~29M
max_position_embeddings: 2048

Training

Dataset: internal job-vacancy dataset (vacancies_labeled.csv), labeled by an LLM pipeline
Train/test split: 85% / 15%, stratified by role and team_lead flag
Loss: weighted cross-entropy (pos_weight = 2.115)
Optimizer: AdamW, lr=2e-5, linear warmup 10%, grad clip 1.0
Early stopping: patience=3 on F1 at target recall ≥ 0.95
Threshold selected to achieve target recall = 0.95

Limitations

Trained primarily on Russian-language IT job vacancies; quality on other domains/languages is not guaranteed.
Team lead and management roles are treated as other by design.
Description is truncated to 2000 characters before tokenization.
The model groups developers, Data Analysts, and Business Analysts into one positive class; it does not distinguish between them.
Data Analyst recall is ~0.92: vacancies with heavy business/finance framing may be missed.

Version

Hub tag: v2.0-dev-da-ba-r95

Changelog vs v1:

Added Business Analyst (Бизнес аналитик) to positive class
Input context extended: max_length 256→384, description 1200→2000 chars
Precision improved: 0.880 → 0.922
lr lowered to 2e-5, batch size 32→24 to accommodate longer sequences

License

MIT.

Downloads last month: 42

Safetensors

Model size

29.2M params

Tensor type

F32

Model tree for AndreiTolmachev/dev_da_roles_1

Base model

cointegrated/rubert-tiny2

Finetuned

(66)

this model

Evaluation results

roc_auc
self-reported

0.982
precision
self-reported

0.922
recall
self-reported

0.951