Feature Extraction
Transformers
PyTorch
roberta
code-understanding
unixcoder
text-embeddings-inference
Instructions to use Henry65/RepoSim4Py with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Henry65/RepoSim4Py with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Henry65/RepoSim4Py")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Henry65/RepoSim4Py") model = AutoModel.from_pretrained("Henry65/RepoSim4Py") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| tags: | |
| - code-understanding | |
| - unixcoder | |
| pipeline_tag: feature-extraction | |
| # RepoSim4Py | |
| An embedding-approach-based tool for comparing semantic similarities between different Python repositories by using different information from repositories. | |
| ## Model Details | |
| **RepoSim4Py** is a pipeline based on the HuggingFace platform for generating embeddings according to specified Github Python repositories. | |
| For each Python repository, it generates embeddings at different levels based on the source code, code documentation, requirements, and README files within the repository. | |
| By taking the mean of these embeddings, a repository-level mean embedding is generated. | |
| These embeddings can be used to compute semantic similarities at different levels, for example, using cosine similarity to get comparison. | |
| ### Model Description | |
| The model used by **RepoSim4Py** is **UniXcoder** fine-tuned on [code search task](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search), using the [AdvTest](https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv) dataset. | |
| - **Pipeline developed by:** [Henry65](https://huggingface.co/Henry65) | |
| - **Repository:** [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py) | |
| - **Model type:** **code understanding** | |
| - **Language(s):** **Python** | |
| - **License:** **MIT** | |
| ### Model Sources | |
| - **Repository:** [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder) | |
| - **Paper:** [UniXcoder: Unified Cross-Modal Pre-training for Code Representation](https://arxiv.org/pdf/2203.03850.pdf) | |
| ## Uses | |
| Below is an example of how to use the RepoSim4Py pipeline to easily generate embeddings for GitHub Python repositories. | |
| First, initialise the pipeline: | |
| ```python | |
| from transformers import pipeline | |
| model = pipeline(model="Henry65/RepoSim4Py", trust_remote_code=True) | |
| ``` | |
| Then specify one (or multiple repositories in a tuple) as input and get the result as a list of dictionaries: | |
| ```python | |
| repo_infos = model("lazyhope/python-hello-world") | |
| print(repo_infos) | |
| ``` | |
| Output (Long numpy array outputs are omitted): | |
| ```python | |
| [{'name': 'lazyhope/python-hello-world', | |
| 'topics': [], | |
| 'license': 'MIT', | |
| 'stars': 0, | |
| 'code_embeddings': array([[-2.07551336e+00, 2.81387949e+00, 2.35216689e+00, ...]], dtype=float32), | |
| 'mean_code_embedding': array([[-2.07551336e+00, 2.81387949e+00, 2.35216689e+00, ...]], dtype=float32), | |
| 'doc_embeddings': array([[-2.37494540e+00, 5.40957630e-01, 2.29580235e+00, ...]], dtype=float32), | |
| 'mean_doc_embedding': array([[-2.37494540e+00, 5.40957630e-01, 2.29580235e+00, ...]], dtype=float32), | |
| 'requirement_embeddings': array([[0., 0., 0., ...]], dtype=float32), | |
| 'mean_requirement_embedding': array([[0., 0., 0., ...]], dtype=float32), | |
| 'readme_embeddings': array([[-2.1671042 , 2.8404987 , 1.4761417 , ...]], dtype=float32), | |
| 'mean_readme_embedding': array([[-1.91171765e+00, 1.65386486e+00, 9.49612021e-01, ...]], dtype=float32), | |
| 'mean_repo_embedding': array([[-2.0755134, 2.8138795, 2.352167 , ...]], dtype=float32), | |
| 'code_embeddings_shape': (1, 768) | |
| 'mean_code_embedding_shape': (1, 768) | |
| 'doc_embeddings_shape': (1, 768) | |
| 'mean_doc_embedding_shape': (1, 768) | |
| 'requirement_embeddings_shape': (1, 768) | |
| 'mean_requirement_embedding_shape': (1, 768) | |
| 'readme_embeddings_shape': (3, 768) | |
| 'mean_readme_embedding_shape': (1, 768) | |
| 'mean_repo_embedding_shape': (1, 3072) | |
| }] | |
| ``` | |
| More specific information please refer to [Example.py](https://github.com/RepoMining/RepoSim4Py/blob/main/Script/Example.py). Note that "github_token" is unnecessary. | |
| ## Training Details | |
| Please follow the original [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search) page for details of fine-tuning it on code search task. | |
| ## Evaluation | |
| We used the [awesome-python](https://github.com/vinta/awesome-python) list which contains over 400 Python repositories categorized in different topics, in order to label similar repositories. | |
| The evaluation metrics and results can be found in the RepoSim4Py repository, under the [Embedding](https://github.com/RepoMining/RepoSim4Py/tree/main/Embedding) folder. | |
| ## Acknowledgements | |
| Many thanks to authors of the UniXcoder model and the AdvTest dataset, as well as the awesome python list for providing a useful baseline. | |
| - **UniXcoder** (https://github.com/microsoft/CodeBERT/tree/master/UniXcoder) | |
| - **AdvTest** (https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv) | |
| - **awesome-python** (https://github.com/vinta/awesome-python) | |
| ## Authors | |
| - **Honglin Zhang** (https://github.com/liaomu0926) | |
| - **Rosa Filgueira** (https://www.rosafilgueira.com) |