--- language: - code tags: - python - java - cpp - ai-detection - code-analysis - temporal-cnn - codet5 metrics: - f1: 0.9921 --- # ai_code_detect Binary classifier: human-written vs. AI-generated code. Trained on 500k samples (Python, Java, C++). Macro F1: **0.9813**. --- ## Architecture Two input streams fused into a single MLP classifier. **Stream 1 — Probabilistic** Code is passed through `Salesforce/codegen-350M-mono`. Per-token surprisal signals are extracted across a 256-token window: | # | Feature | Description | |---|---------|-------------| | 0 | `log_prob` | Log-probability of the actual token | | 1 | `log_rank` | Log-rank within the distribution | | 2 | `entropy` | Shannon entropy of the token distribution | | 3 | `varentropy` | Variance of entropy | | 4 | `top10_mass` | Probability mass in top-10 tokens | | 5 | `gap_1_2` | Log-prob gap between rank-1 and rank-2 | | 6 | `surprisal_z` | Per-token surprisal z-score | | 7 | `entropy_delta` | Entropy change from previous position | | 8 | `cum_rank` | Cumulative mean log-rank | | 9 | `is_special` | Special token flag | | 10 | `r10_flag` | Rank ≤ 10 | | 11 | `r100_flag` | 10 < rank ≤ 100 | These 12 per-token features aggregate into 32 sequence-level statistics (moments, autocorrelations, burstiness, etc.) passed downstream. **Stream 2 — Semantic** `Salesforce/codet5-base` mean-pools hidden states into a 768-dim embedding capturing style, structure, naming, and comment density. **Classifier** Token (256-dim) + sequence (64-dim) + semantic (768-dim) representations are concatenated → 1088-dim → 3-layer MLP with LayerNorm, GELU, dropout → sigmoid. --- ## Performance Evaluated on 3,000 balanced validation samples (1,000/language): | Metric | Score | |--------|-------| | Macro F1 | **0.9813** | | Accuracy | **98.13%** | | Threshold | 0.475 | | Language | Accuracy | Human p̄ | AI p̄ | Gap | |----------|----------|---------|-------|-----| | Python | 99.50% | 0.001 | 0.992 | 0.991 | | Java | 98.00% | 0.043 | 0.968 | 0.926 | | C++ | 96.90% | 0.063 | 0.966 | 0.903 | --- ## Training | Setting | Value | |---------|-------| | Optimizer | AdamW (encoder lr 8e-6, head lr 3e-5) | | Scheduler | OneCycleLR + cosine annealing | | Loss | BCEWithLogitsLoss | | Regularization | EMA (decay=0.998), dropout, LayerNorm | | Precision | fp16 via HuggingFace Accelerate | | Hardware | 2× GPU | | Epochs | 4 (500k samples) | --- ## How To Use ```python import os import sys from huggingface_hub import hf_hub_download REPO_ID = "santh-cpu/ai_code_detect" script_path = hf_hub_download(repo_id=REPO_ID, filename="model.py") sys.path.append(os.path.dirname(script_path)) from model import predict print(predict("your code here")) ```