VulBERTa-MLP-D2A

Maintained By
claudios

VulBERTa-MLP-D2A

PropertyValue
Parameter Count125M
Model TypeText Classification
LicenseMIT
PaperarXiv:2205.12424
MetricsAccuracy: 64.71%, F1: 56.93%, ROC-AUC: 71.02%

What is VulBERTa-MLP-D2A?

VulBERTa-MLP-D2A is a specialized deep learning model designed for detecting security vulnerabilities in source code. Built on RoBERTa architecture, it implements a custom tokenization pipeline specifically optimized for C/C++ code analysis. This model represents a significant advancement in automated security vulnerability detection, combining sophisticated pre-training techniques with practical applicability.

Implementation Details

The model utilizes a Multi-Layer Perceptron (MLP) classification head on top of a pre-trained RoBERTa backbone. It requires libclang for tokenization and implements a simplified tokenization process that includes built-in comment removal. The model operates in F32 tensor format and is specifically trained on the CodeXGLUE Devign dataset.

  • Custom tokenization pipeline optimized for source code
  • Pre-trained on real-world C/C++ projects
  • Simplified implementation requiring minimal setup
  • Integrated comment removal in tokenization process

Core Capabilities

  • Binary and multi-class vulnerability detection
  • Support for C/C++ code analysis
  • High ROC-AUC score of 71.02%
  • Balanced precision (64.80%) and recall (50.76%)

Frequently Asked Questions

Q: What makes this model unique?

VulBERTa-MLP-D2A stands out for its conceptual simplicity while achieving state-of-the-art performance. It requires limited training data and parameters compared to similar models, yet delivers competitive results across multiple vulnerability detection benchmarks.

Q: What are the recommended use cases?

The model is ideal for automated security reviews of C/C++ codebases, continuous integration pipelines for security scanning, and research applications in vulnerability detection. It's particularly effective for organizations looking to implement automated code security analysis.

The first platform built for prompt engineering