dev_da_roles_1 — Developer + Data Analyst + Business Analyst Classifier

Binary job-vacancy classifier: detects developer, Data Analyst, or Business Analyst roles (tech) versus other roles (other).

Built on top of cointegrated/rubert-tiny2, a compact BERT model for Russian and English text.

v2 — extends v1 by adding Business Analyst to the positive class and using a longer input context (384 tokens / 2000 chars). Precision improved from 0.880 → 0.922.

Task Definition

The positive class (tech) is defined as:

role_category in TECH_CLASSES AND team_lead == 0

TECH_CLASSES:

  • Backend
  • Desktop / Systems
  • Embedded
  • Frontend
  • Fullstack
  • ML / AI / Data Scientist
  • Mobile
  • Data Analyst
  • Бизнес аналитик (Business Analyst)

Team leads and management roles are intentionally excluded from the positive class.

Labels

id label
0 other
1 tech

Validation Metrics

Metric Value
ROC AUC 0.9815
Precision @ threshold 0.9219
Recall @ threshold 0.9506
Best threshold 0.8791
Target recall 0.95
Best epoch 7

Recall by key category (held-out test set):

Category Recall
Backend 0.984
Frontend 1.000
Mobile 1.000
ML / AI / Data Scientist 0.976
Data Analyst 0.916
Business Analyst 0.895

Inference Parameters

  • max_length: 384 tokens
  • Vacancy text: title + " . " + description, description truncated to 2000 characters
  • Decision threshold for class tech: 0.8791

Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_ID = "AndreiTolmachev/dev_da_roles_1"
THRESHOLD = 0.8791

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).eval()

def is_tech_role(title: str, description: str = "") -> bool:
    text = f"{title.strip()} . {description[:2000].strip()}"
    enc = tokenizer(text, truncation=True, max_length=384, return_tensors="pt")
    with torch.no_grad():
        logits = model(**enc).logits
    prob_tech = torch.softmax(logits, dim=-1)[0, 1].item()
    return prob_tech >= THRESHOLD

# Developer
print(is_tech_role("Backend Python Developer", "FastAPI, PostgreSQL, Docker, Kubernetes..."))

# Data Analyst
print(is_tech_role("Data Analyst", "SQL, Python, dashboards, product metrics, A/B tests..."))

# Business Analyst
print(is_tech_role("Бизнес аналитик", "Сбор требований, UML, BPMN, работа с командой разработки..."))

# Manager — should return False
print(is_tech_role("Project Manager", "Agile, управление командой, планирование спринтов..."))

Architecture

  • Model: BertForSequenceClassification
  • Base model: cointegrated/rubert-tiny2
  • Layers: 3, hidden size: 312, attention heads: 12
  • Vocab size: 83,828
  • Parameters: ~29M
  • max_position_embeddings: 2048

Training

  • Dataset: internal job-vacancy dataset (vacancies_labeled.csv), labeled by an LLM pipeline
  • Train/test split: 85% / 15%, stratified by role and team_lead flag
  • Loss: weighted cross-entropy (pos_weight = 2.115)
  • Optimizer: AdamW, lr=2e-5, linear warmup 10%, grad clip 1.0
  • Early stopping: patience=3 on F1 at target recall ≥ 0.95
  • Threshold selected to achieve target recall = 0.95

Limitations

  • Trained primarily on Russian-language IT job vacancies; quality on other domains/languages is not guaranteed.
  • Team lead and management roles are treated as other by design.
  • Description is truncated to 2000 characters before tokenization.
  • The model groups developers, Data Analysts, and Business Analysts into one positive class; it does not distinguish between them.
  • Data Analyst recall is ~0.92: vacancies with heavy business/finance framing may be missed.

Version

Hub tag: v2.0-dev-da-ba-r95

Changelog vs v1:

  • Added Business Analyst (Бизнес аналитик) to positive class
  • Input context extended: max_length 256→384, description 1200→2000 chars
  • Precision improved: 0.880 → 0.922
  • lr lowered to 2e-5, batch size 32→24 to accommodate longer sequences

License

MIT.

Downloads last month
42
Safetensors
Model size
29.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AndreiTolmachev/dev_da_roles_1

Finetuned
(66)
this model

Evaluation results