Datasets, tools, and benchmarks for representation learning of code.

Overview

Tests License: MIT Python 3.6 Weights-And-Biases

The CodeSearchNet challenge has been concluded

We would like to thank all participants for their submissions and we hope that this challenge provided insights to practitioners and researchers about the challenges in semantic code search and motivated new research. We would like to encourage everyone to continue using the dataset and the human evaluations, which we now provide publicly. Please, see below for details, specifically the Evaluation section.

No new submissions to the challenge will be accepted.

Table of Contents

Quickstart

If this is your first time reading this, we recommend skipping this section and reading the following sections. The below commands assume you have Docker and Nvidia-Docker, as well as a GPU that supports CUDA 9.0 or greater. Note: you should only have to run script/setup once to download the data.

# clone this repository
git clone https://github.com/github/CodeSearchNet.git
cd CodeSearchNet/
# download data (~3.5GB) from S3; build and run the Docker container
script/setup
# this will drop you into the shell inside a Docker container
script/console
# optional: log in to W&B to see your training metrics,
# track your experiments, and submit your models to the benchmark
wandb login

# verify your setup by training a tiny model
python train.py --testrun
# see other command line options, try a full training run with default values,
# and explore other model variants by extending this baseline script
python train.py --help
python train.py

# generate predictions for model evaluation
python predict.py -r github/CodeSearchNet/0123456 # this is the org/project_name/run_id

Finally, you can submit your run to the community benchmark by following these instructions.

Introduction

Project Overview

CodeSearchNet is a collection of datasets and benchmarks that explore the problem of code retrieval using natural language. This research is a continuation of some ideas presented in this blog post and is a joint collaboration between GitHub and the Deep Program Understanding group at Microsoft Research - Cambridge. We aim to provide a platform for community research on semantic code search via the following:

  1. Instructions for obtaining large corpora of relevant data
  2. Open source code for a range of baseline models, along with pre-trained weights
  3. Baseline evaluation metrics and utilities
  4. Mechanisms to track progress on a shared community benchmark hosted by Weights & Biases

We hope that CodeSearchNet is a step towards engaging with the broader machine learning and NLP community regarding the relationship between source code and natural language. We describe a specific task here, but we expect and welcome other uses of our dataset.

More context regarding the motivation for this problem is in this technical report. Please, cite the dataset and the challenge as

@article{husain2019codesearchnet,
  title={{CodeSearchNet} challenge: Evaluating the state of semantic code search},
  author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
  journal={arXiv preprint arXiv:1909.09436},
  year={2019}
}

Data

The primary dataset consists of 2 million (comment, code) pairs from open source libraries. Concretely, a comment is a top-level function or method comment (e.g. docstrings in Python), and code is an entire function or method. Currently, the dataset contains Python, Javascript, Ruby, Go, Java, and PHP code. Throughout this repo, we refer to the terms docstring and query interchangeably. We partition the data into train, validation, and test splits such that code from the same repository can only exist in one partition. Currently this is the only dataset on which we train our model. Summary statistics about this dataset can be found in this notebook

For more information about how to obtain the data, see this section.

Evaluation

The metric we use for evaluation is Normalized Discounted Cumulative Gain. Please reference this paper for further details regarding model evaluation. The evaluation script can be found here.

Annotations

We manually annotated retrieval results for the six languages from 99 general queries. This dataset is used as groundtruth data for evaluation only. Please refer to this paper for further details on the annotation process. These annotations were used to compute the scores in the leaderboard. Now that the competition has been concluded, you can find the annotations, along with the annotator comments here.

Setup

You should only have to perform the setup steps once to download the data and prepare the environment.

  1. Due to the complexity of installing all dependencies, we prepared Docker containers to run this code. You can find instructions on how to install Docker in the official docs. Additionally, you must install Nvidia-Docker to satisfy GPU-compute related dependencies. For those who are new to Docker, this blog post provides a gentle introduction focused on data science.

  2. After installing Docker, you need to download the pre-processed datasets, which are hosted on S3. You can do this by running script/setup.

    script/setup
    

    This will build Docker containers and download the datasets. By default, the data is downloaded into the resources/data/ folder inside this repository, with the directory structure described here.

The datasets you will download (most of them compressed) have a combined size of only ~ 3.5 GB.

  1. To start the Docker container, run script/console:
    script/console
    
    This will land you inside the Docker container, starting in the /src directory. You can detach from/attach to this container to pause/continue your work.

For more about the data, see Data Details below, as well as this notebook.

Data Details

Data Acquisition

If you have run the setup steps above you will already have the data, and nothing more needs to be done. The data will be available in the /resources/data folder of this repository, with this directory structure.

Schema & Format

Data is stored in jsonlines format. Each line in the uncompressed file represents one example (usually a function with an associated comment). A prettified example of one row is illustrated below.

  • repo: the owner/repo
  • path: the full path to the original file
  • func_name: the function or method name
  • original_string: the raw string before tokenization or parsing
  • language: the programming language
  • code: the part of the original_string that is code
  • code_tokens: tokenized version of code
  • docstring: the top-level comment or docstring, if it exists in the original string
  • docstring_tokens: tokenized version of docstring
  • sha: this field is not being used [TODO: add note on where this comes from?]
  • partition: a flag indicating what partition this datum belongs to of {train, valid, test, etc.} This is not used by the model. Instead we rely on directory structure to denote the partition of the data.
  • url: the url for the code snippet including the line numbers

Code, comments, and docstrings are extracted in a language-specific manner, removing artifacts of that language.

{
  'code': 'def get_vid_from_url(url):\n'
          '        """Extracts video ID from URL.\n'
          '        """\n'
          "        return match1(url, r'youtu\\.be/([^?/]+)') or \\\n"
          "          match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\n"
          "          match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n"
          "          match1(url, r'youtube\\.com/watch/([^/?]+)') or \\\n"
          "          parse_query_param(url, 'v') or \\\n"
          "          parse_query_param(parse_query_param(url, 'u'), 'v')",
  'code_tokens': ['def',
                  'get_vid_from_url',
                  '(',
                  'url',
                  ')',
                  ':',
                  'return',
                  'match1',
                  '(',
                  'url',
                  ',',
                  "r'youtu\\.be/([^?/]+)'",
                  ')',
                  'or',
                  'match1',
                  '(',
                  'url',
                  ',',
                  "r'youtube\\.com/embed/([^/?]+)'",
                  ')',
                  'or',
                  'match1',
                  '(',
                  'url',
                  ',',
                  "r'youtube\\.com/v/([^/?]+)'",
                  ')',
                  'or',
                  'match1',
                  '(',
                  'url',
                  ',',
                  "r'youtube\\.com/watch/([^/?]+)'",
                  ')',
                  'or',
                  'parse_query_param',
                  '(',
                  'url',
                  ',',
                  "'v'",
                  ')',
                  'or',
                  'parse_query_param',
                  '(',
                  'parse_query_param',
                  '(',
                  'url',
                  ',',
                  "'u'",
                  ')',
                  ',',
                  "'v'",
                  ')'],
  'docstring': 'Extracts video ID from URL.',
  'docstring_tokens': ['Extracts', 'video', 'ID', 'from', 'URL', '.'],
  'func_name': 'YouTube.get_vid_from_url',
  'language': 'python',
  'original_string': 'def get_vid_from_url(url):\n'
                      '        """Extracts video ID from URL.\n'
                      '        """\n'
                      "        return match1(url, r'youtu\\.be/([^?/]+)') or \\\n"
                      "          match1(url, r'youtube\\.com/embed/([^/?]+)') or "
                      '\\\n'
                      "          match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n"
                      "          match1(url, r'youtube\\.com/watch/([^/?]+)') or "
                      '\\\n'
                      "          parse_query_param(url, 'v') or \\\n"
                      "          parse_query_param(parse_query_param(url, 'u'), "
                      "'v')",
  'partition': 'test',
  'path': 'src/you_get/extractors/youtube.py',
  'repo': 'soimort/you-get',
  'sha': 'b746ac01c9f39de94cac2d56f665285b0523b974',
  'url': 'https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143'
}

Summary statistics such as row counts and token length histograms can be found in this notebook

Downloading Data from S3

The shell script /script/setup will automatically download these files into the /resources/data directory. Here are the links to the relevant files for visibility:

The s3 links follow this pattern:

https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{python,java,go,php,javascript,ruby}.zip

For example, the link for the java is:

https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/java.zip

The size of the dataset is approximately 20 GB. The various files and the directory structure are explained here.

Human Relevance Judgements

To train neural models with a large dataset we use the documentation comments (e.g. docstrings) as a proxy. For evaluation (and the leaderboard), we collected human relevance judgements of pairs of realistic-looking natural language queries and code snippets. Now that the challenge has been concluded, we provide the data here as a .csv, with the following fields:

  • Language: The programming language of the snippet.
  • Query: The natural language query
  • GitHubUrl: The URL of the target snippet. This matches the URL key in the data (see here).
  • Relevance: the 0-3 human relevance judgement, where "3" is the highest score (very relevant) and "0" is the lowest (irrelevant).
  • Notes: a free-text field with notes that annotators optionally provided.

Running Our Baseline Model

We encourage you to reproduce and extend these models, though most variants take several hours to train (and some take more than 24 hours on an AWS P3-V100 instance).

Model Architecture

Our baseline models ingest a parallel corpus of (comments, code) and learn to retrieve a code snippet given a natural language query. Specifically, comments are top-level function and method comments (e.g. docstrings in Python), and code is an entire function or method. Throughout this repo, we refer to the terms docstring and query interchangeably.

The query has a single encoder, whereas each programming language has its own encoder. The available encoders are Neural-Bag-Of-Words, RNN, 1D-CNN, Self-Attention (BERT), and a 1D-CNN+Self-Attention Hybrid.

The diagram below illustrates the general architecture of our baseline models:

alt text

Training

This step assumes that you have a suitable Nvidia-GPU with Cuda v9.0 installed. We used AWS P3-V100 instances (a p3.2xlarge is sufficient).

  1. Start the model run environment by running script/console:

    script/console
    

    This will drop you into the shell of a Docker container with all necessary dependencies installed, including the code in this repository, along with data that you downloaded earlier. By default, you will be placed in the src/ folder of this GitHub repository. From here you can execute commands to run the model.

  2. Set up W&B (free for open source projects) per the instructions below if you would like to share your results on the community benchmark. This is optional but highly recommended.

  3. The entry point to this model is src/train.py. You can see various options by executing the following command:

    python train.py --help
    

    To test if everything is working on a small dataset, you can run the following command:

    python train.py --testrun
    
  4. Now you are prepared for a full training run. Example commands to kick off training runs:

  • Training a neural-bag-of-words model on all languages

    python train.py --model neuralbow
    

    The above command will assume default values for the location(s) of the training data and a destination where you would like to save the output model. The default location for training data is specified in /src/data_dirs_{train,valid,test}.txt. These files each contain a list of paths where data for the corresponding partition exists. If more than one path specified (separated by a newline), the data from all the paths will be concatenated together. For example, this is the content of src/data_dirs_train.txt:

    $ cat data_dirs_train.txt
    ../resources/data/python/final/jsonl/train
    ../resources/data/javascript/final/jsonl/train
    ../resources/data/java/final/jsonl/train
    ../resources/data/php/final/jsonl/train
    ../resources/data/ruby/final/jsonl/train
    ../resources/data/go/final/jsonl/train
    

    By default, models are saved in the resources/saved_models folder of this repository.

  • Training a 1D-CNN model on Python data only:

    python train.py --model 1dcnn /trained_models ../resources/data/python/final/jsonl/train ../resources/data/python/final/jsonl/valid ../resources/data/python/final/jsonl/test
    

    The above command overrides the default locations for saving the model to trained_models and also overrides the source of the train, validation, and test sets.

Additional notes:

  • Options for --model are currently listed in src/model_restore_helper.get_model_class_from_name.

  • Hyperparameters are specific to the respective model/encoder classes. A simple trick to discover them is to kick off a run without specifying hyperparameter choices, as that will print a list of all used hyperparameters with their default values (in JSON format).

References

Benchmark

We are using a community benchmark for this project to encourage collaboration and improve reproducibility. It is hosted by Weights & Biases (W&B), which is free for open source projects. Our entries in the benchmark link to detailed logs of our training and evaluation metrics, as well as model artifacts, and we encourage other participants to provide as much detail as possible.

We invite the community to submit their runs to this benchmark to facilitate transparency by following these instructions.

How to Contribute

We anticipate that the community will design custom architectures and use frameworks other than Tensorflow. Furthermore, we anticipate that additional datasets will be useful. It is not our intention to integrate these models, approaches, and datasets into this repository as a superset of all available ideas. Rather, we intend to maintain the baseline models and links to the data in this repository as a central place of reference. We are accepting PRs that update the documentation, link to your project(s) with improved benchmarks, fix bugs, or make minor improvements to the code. Here are more specific guidelines for contributing to this repository; note particularly our Code of Conduct. Please open an issue if you are unsure of the best course of action.

Other READMEs

W&B Setup

To initialize W&B:

  1. Navigate to the /src directory in this repository.

  2. If it's your first time using W&B on a machine, you will need to log in:

    $ wandb login
    
  3. You will be asked for your API key, which appears on your W&B profile settings page.

Licenses

The licenses for source code used as data for this project are provided with the data download for each language in _licenses.pkl files.

This code and documentation for this project are released under the MIT License.

Comments
  • Request for a smaller dataset for researchers with lesser resources

    Request for a smaller dataset for researchers with lesser resources

    Thank you for making this amazing problem statement public, along with a very comprehensive dataset!

    Can a relatively smaller size dataset ( a subset ) of it be made available for independent developers/researchers who might try running this on their personal machines ?

    This will open up the problem for a larger audience and may bring in some innovative solutions!

    opened by rajvijay68 8
  • How long does it usually take to review a run?

    How long does it usually take to review a run?

    Hi!

    I've made a custom model and now I'm trying to submit it to the leaderboard. Here is the run: https://app.wandb.ai/github/codesearchnet/runs/lqqo1i4m

    Above it says 'Awaiting review from codesearchnet benchmark' and that's probably the reason why I get an error when I try to 'Publish to GitHub'. Am I doing something wrong, or do I just have to wait?

    Thanks.

    opened by novoselrok 6
  • CodeChallenge run prediction on whole code corpus?

    CodeChallenge run prediction on whole code corpus?

    Did I get it right, that for a submission on the challenge, I have to run those 99 queries against the whole code corpus, and not just the test set? Thanks in advance :)

    opened by yss14 6
  • question: changing the default leaderboard order

    question: changing the default leaderboard order

    The default ordering on the leaderboard is "Mean NDCG" over all results that are not None.

    This means it is very easy to appear #1 by overfitting on a single language.

    What about computing "Mean NDCG" over all 6 languages and replacing None by 0?

    opened by monperrus 6
  • Extract descriptions from `@return` doc comments if method has no summary?

    Extract descriptions from `@return` doc comments if method has no summary?

    e.g. some short methods may contain a description in the return tag, but not a description of the method itself. (to avoid redundancy).

    Doing this would extract more methods, but they may be of lower quality if incorrectly parsed or automatically generated. I'd expect a description such as @return bool STARTDESCRIPTION true if this is a float to be extracted (I'm not familiar with how the data representation works)

    • Some libraries/applications are strict about always having a summary in their coding standards, others aren't.
    • e.g. I've seen @return the description without a type for php
    • Could try to annotate the fact that the fallback was used and this is a non-official summary
        /**
         * @return bool true if this is a float
         */
        public function isFloat()
    

    It would be nice to account for code such as @return HasTemplate<string, stdClass>, etc. Making sure that <([{ in the first token are matched up may be useful as a basic approach (and give up if that fails). (There's no official standard and different static analyzers have their own extensions to the type syntax)

    An example implementation for PHP is https://github.com/phan/phan/blob/2.2.12/src/Phan/Language/Element/MarkupDescription.php - examples of what it generates as descriptions are https://github.com/phan/phan/blob/2.2.12/tests/Phan/Language/Element/MarkupDescriptionTest.php#L132-L155

    enhancement php 
    opened by TysonAndre 6
  • Error while processing a single Python file.

    Error while processing a single Python file.

    In CodeSearchNet/function_parser/function_parser/demo.ipynb . I kept everything same till thrid cell and then I did this processor.process_single_file(py_file_path) . Here py_file_path contains the complete path of .py file that I want to process.

    After executing the above line I got the following error : unhashable type: 'tree_sitter.Node' in file function_parser/function_parser/parsers/language_parser.py.

    Am I missing something?

    opened by vikrant-sahu 5
  • wandb custom submission error

    wandb custom submission error

    I'm currently trying to submit a custom model. After entering the run, evaluating the ndcg score, and writing a brief note on how we approached the results, I get the following error

    wandb_submission_error

    Now, when I click refresh, I will be redirected to the submission page again. When trying to re-submit, I get the following error Invalid CSV format. Please upload a well-formatted CSV file..

    opened by yss14 5
  • Preprocessing of docstrings can be improved

    Preprocessing of docstrings can be improved

    Hello,

    first thanks for the challenge, the code and the dataset! Really cool stuff that you're doing and I want to work on this task. :)

    I've read the Contribution Guidelines and know that you will not change any of the preprocessing code, but nevertheless I want to discuss the preprocessing of the docstrings here in case someone wants to produce a similar dataset (or maybe v3 ;) ) .

    Current preprocessing of docstrings

    I read your code and it seems that this is the way you preprocess the docstrings:

    1. You extract the documentation from the method and strip c-style delimiters https://github.com/github/CodeSearchNet/blob/9356b3181eaa9d4a38df4d309018158fd23448cb/function_parser/function_parser/parsers/commentutils.py#L1
    2. Then you extract the relevant part of the docstring (which acts as a summary) using the following heuristics: 2.1 If \n\n is found you take the part before 2.2 otherwise take the part before the first @ 2.3. otherwise take the full docstring https://github.com/github/CodeSearchNet/blob/9356b3181eaa9d4a38df4d309018158fd23448cb/function_parser/function_parser/parsers/commentutils.py#L18-L24
    3. Finally you tokenize the extracted summary using the following regex: https://github.com/github/CodeSearchNet/blob/9356b3181eaa9d4a38df4d309018158fd23448cb/function_parser/function_parser/parsers/language_parser.py#L5

    Problems with this pipeline

    This way of preprocessing produces a couple of results that are probably not wanted and could be improved.

    Extracting the summary

    Compare the first 12 lines of the tokenized docstrings of the java train set to the raw ones

    Bind indexed elements to the supplied collection .
    Set {
    Add {
    Set servlet names that the filter will be registered against . This will replace any previously specified servlet names .
    Add servlet names for the filter .
    Set the URL patterns that the filter will be registered against . This will replace any previously specified URL patterns .
    Add URL patterns as defined in the Servlet specification that the filter will be registered against .
    Convenience method to {
    Configure registration settings . Subclasses can override this method to perform additional configuration if required .
    Create a nested {
    Create a nested {
    Create a nested {
    

    As you can see 6/12 are basically not usable.

    **Bind indexed elements to the supplied collection.** @param name the name of the property to bind @param target the target bindable @param elementBinder the binder to use for elements @param aggregateType the aggregate type, may be a collection or an array @param elementType the element type @param result the destination for results
    **Set** {@link **ServletRegistrationBean**}**s that the filter will be registered against.** @param servletRegistrationBeans the Servlet registration beans
    **Add** {@link **ServletRegistrationBean**}**s for the filter.** @param servletRegistrationBeans the servlet registration beans to add @see #setServletRegistrationBeans
    **Set servlet names that the filter will be registered against. This will replace any previously specified servlet names.** @param servletNames the servlet names @see #setServletRegistrationBeans @see #setUrlPatterns
    **Add servlet names for the filter.** @param servletNames the servlet names to add
    **Set the URL patterns that the filter will be registered against. This will replace any previously specified URL patterns.** @param urlPatterns the URL patterns @see #setServletRegistrationBeans @see #setServletNames
    **Add URL patterns, as defined in the Servlet specification, that the filter will be registered against.** @param urlPatterns the URL patterns
    **Convenience method to** {@link **#setDispatcherTypes(EnumSet) set dispatcher types**} **using the specified elements.** @param first the first dispatcher type @param rest additional dispatcher types
    **Configure registration settings. Subclasses can override this method to perform additional configuration if required.** @param registration the registration
    **Create a nested** {@link **DependencyCustomizer**} **that only applies if any of the specified class names are not on the class path.** @param classNames the class names to test @return a nested {@link DependencyCustomizer}
    **Create a nested** {@link **DependencyCustomizer**} **that only applies if all of the specified class names are not on the class path.** @param classNames the class names to test @return a nested {@link DependencyCustomizer}
    **Create a nested** {@link **DependencyCustomizer**} **that only applies if the specified paths are on the class path.** @param paths the paths to test @return a nested {@link DependencyCustomizer}
    

    However, the relevant information is in the raw docstrings (I added the ** to highlight relevant passages). Simply using the part before the first @ produces pretty bad results (at least in java) as its common practice to highlight code blocks or links with javadoc-tags. Possible solution: Stripping everything before the first param (or maybe @param) and afterwards removing javadoc-tags (maybe keeping the tokens inside).

    Cleaning

    The preprocessing does not include any cleaning. This manifests in docstrings that contain html-tags (which are commonly found in javadoc), as well as URLs (which afterwards get pretty verbose tokenized). See these two samples from java

    Determine if a uri is in origin - form according to <a href = https : // tools . ietf . org / html / rfc7230#section - 5 . 3 > rfc7230 5 . 3< / a > .
    Determine if a uri is in asterisk - form according to <a href = https : // tools . ietf . org / html / rfc7230#section - 5 . 3 > rfc7230 5 . 3< / a > .
    

    Some stats

    >>> wc -l java.train.comment 
    454436 java.train.comment
    >>> grep -E '<p >|<p >' java.train.comment | wc -l
    42750
    

    At least 10% of the tokenized java docstrings still contain html tags.

    >>> grep '{ @' java.train.comment | wc -l
    44500
    

    Another 10% still contain javadoc.

    >>> grep "{ @inheritDoc }" java.train.comment | wc -l
    1685
    

    2k consist only of a single javadoc-tag indication that the doc was inherited.

    Many of the golang docstrings contain URLs, which are not very useful in the tokenized version the regex produces (see above in java).

    >>> wc -l go.train.comment 
    317822 go.train.comment
    >>> grep -E 'http :|https :' go.train.comment | wc -l
    19753
    

    ~6% contain URLs (starting with http :)

    >>> grep -E "autogenerated|auto generated" go.train.comment | wc -l
    4850
    

    Around 5k auto generated methods

    >>> grep "/ *" go.train.comment | wc -l
    33620
    

    10% still contain c-style comment delimiters.

    Tokenization

    Any specific reason you keep punctuation symbols like ., ,, -, /, :, <, >, *, =, @, (, ) as tokens? Is it to keep code in the docstrings?

    Summary

    I really think better cleaning and a language-dependent preprocessing would produce higher quality docstrings At least for java a removal of javadoc and html could be beneficial. As well as using everything before the first param as a summary (maybe in combination with the first paragraph \n\n heuristic).

    opened by villmow 5
  • Benchmark Submission: paloukari

    Benchmark Submission: paloukari

    This pull request represents a submission to the codesearchnet benchmark.

    It is the main communication channel for you and the reviewers.

    Wandb run results for review

    opened by paloukari 5
  • Refactor of the Model superclass

    Refactor of the Model superclass

    • encoder classes are model class fields instead of constructor arguments – the constructor arguments are now sound wrt. subtyping;
    • allow different encoder types for different code languages (not tested in practice);
    • data, metadata loaders and losses are defined in separate modules.
    opened by vilunov 4
  • script to request and download model NDCG

    script to request and download model NDCG

    Since the relevance_annotations.csv is not available, a script to upload model_predictions.csv and download a model statistics file (NDCG, MRR, for example) would be great.

    opened by celsofranssa 4
  • dataset can not be downloaded

    dataset can not be downloaded

    https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/%7Bpython,java,go,php,ruby,javascript%7D.zip 👆This URL can not be opened and says AccessDenied. Could you please tell me where should I download the dataset now?

    opened by KesuCaso 0
  • can we combine the original dataset and re-divided to perform cross-validation?

    can we combine the original dataset and re-divided to perform cross-validation?

    how do you divide the training/testing/validation set, I don't seem to find the basis for the division, only the ratio. If we want to perform cross-validation, can we combine the original training set, test set, and validation set and re-divided.

    opened by SIPUNK 0
  • Clone not working

    Clone not working

    Hello, I tried cloning this repo. Got the following error: error: invalid path 'benchmarks/02-Feb-20-16:33_github_g2od7jac.json' fatal: unable to checkout working tree warning: Clone succeeded, but checkout failed. You can inspect what was checked out with 'git status' and retry with 'git restore --source=HEAD :/'

    I tried git restore, but that resulted in a staged delete of all the code.

    opened by shreethatte 1
  • Expired or Private Links of Java Code Snippets in CodeSearchNET

    Expired or Private Links of Java Code Snippets in CodeSearchNET

    I was trying to access codesearch net dataset for my work - specifically java based data via the given link in codesearchnet repository : https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/java.zip

    While accessing the data parameters and initial data exploration, I observed that a lot of JAVA code snippets, which were taken from various sources of public github repositories, a lot of those repositories have been either turned private or their respective github repository links have expired.

    Due to the reason mentioned above, I am not able to access their original github repository.

    Can you kindly take a look and let me know if there is any way possible to extract the entire github repository from which the java code snippets and their respective documentation has been obtained?

    Actually my work requires cloning the entire project repository from which the codesearchnet java dataset has been extracted.

    opened by harshgeek4coder 0
  • Please add the commit id for each language parser

    Please add the commit id for each language parser

    In the Dockerfile of function-parser, the script downloads tree-sitter==0.0.5, however, the language parser repositories are updated, which leads to conflicts, please add the git checkout command for each language parser.

    opened by 3usi9 0
Owner
GitHub
How people build software.
GitHub
Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Deep Learning Dataset Maker Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data. How to use Down

deepbands 25 Dec 15, 2022
Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Continual learning datasets Introduction This repository contains PyTorch image

berjaoui 5 Aug 28, 2022
Source code and notebooks to reproduce experiments and benchmarks on Bias Faces in the Wild (BFW).

Face Recognition: Too Bias, or Not Too Bias? Robinson, Joseph P., Gennady Livitz, Yann Henon, Can Qin, Yun Fu, and Samson Timoner. "Face recognition:

Joseph P. Robinson 41 Dec 12, 2022
Training code and evaluation benchmarks for the "Self-Supervised Policy Adaptation during Deployment" paper.

Self-Supervised Policy Adaptation during Deployment PyTorch implementation of PAD and evaluation benchmarks from Self-Supervised Policy Adaptation dur

Nicklas Hansen 101 Nov 1, 2022
An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

EasyDatas An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results Installation pip install git+https

Ximing Yang 4 Dec 14, 2021
Python script that analyses the given datasets and comes up with the best polynomial regression representation with the smallest polynomial degree possible

Python script that analyses the given datasets and comes up with the best polynomial regression representation with the smallest polynomial degree possible, to be the most reliable with the least complexity possible

Nikolas B Virionis 2 Aug 1, 2022
"NAS-Bench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search".

NAS-Bench-301 This repository containts code for the paper: "NAS-Bench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search". The

AutoML-Freiburg-Hannover 57 Nov 30, 2022
Sequence modeling benchmarks and temporal convolutional networks

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN) This repository contains the experiments done in the work An Empirical Evaluati

CMU Locus Lab 3.5k Jan 1, 2023
A general and strong 3D object detection codebase that supports more methods, datasets and tools (debugging, recording and analysis).

ALLINONE-Det ALLINONE-Det is a general and strong 3D object detection codebase built on OpenPCDet, which supports more methods, datasets and tools (de

Michael.CV 5 Nov 3, 2022
Benchmarks for semi-supervised domain generalization.

Semi-Supervised Domain Generalization This code is the official implementation of the following paper: Semi-Supervised Domain Generalization with Stoc

Kaiyang 49 Dec 10, 2022
Benchmarks for the Optimal Power Flow Problem

Power Grid Lib - Optimal Power Flow This benchmark library is curated and maintained by the IEEE PES Task Force on Benchmarks for Validation of Emergi

A Library of IEEE PES Power Grid Benchmarks 207 Dec 8, 2022
Benchmark spaces - Benchmarks of how well different two dimensional spaces work for clustering algorithms

benchmark_spaces Benchmarks of how well different two dimensional spaces work fo

Bram Cohen 6 May 7, 2022
Eff video representation - Efficient video representation through neural fields

Neural Residual Flow Fields for Efficient Video Representations 1. Download MPI

null 41 Jan 6, 2023
Neurons Dataset API - The official dataloader and visualization tools for Neurons Datasets.

Neurons Dataset API - The official dataloader and visualization tools for Neurons Datasets. Introduction We propose our dataloader API for loading and

null 1 Nov 19, 2021
Models, datasets and tools for Facial keypoints detection

Template for Data Science Project This repo aims to give a robust starting point to any Data Science related project. It contains readymade tools setu

girafe.ai 1 Feb 11, 2022
Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Maths from examples - Learning advanced mathematical computations from examples This is the source code and data sets relevant to the paper Learning a

Facebook Research 171 Nov 23, 2022
Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.

Core ML Tools Use coremltools to convert machine learning models from third-party libraries to the Core ML format. The Python package contains the sup

Apple 3k Jan 8, 2023
Official PyTorch code for CVPR 2020 paper "Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision"

Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision https://arxiv.org/abs/2003.00393 Abstract Active learning (AL) aims to min

Denis 29 Nov 21, 2022
Implementation of the paper "Language-agnostic representation learning of source code from structure and context".

Code Transformer This is an official PyTorch implementation of the CodeTransformer model proposed in: D. ZĂĽgner, T. Kirschstein, M. Catasta, J. Leskov

Daniel ZĂĽgner 131 Dec 13, 2022