CredData is a set of files including credentials in open source projects

Samsung

Last update: Sep 7, 2022

Related tags

Text Data & NLP CredData

Overview

CredData (Credential Dataset)

Introduction
How To Use
Data Overview
- Data statistics
Data
- Selecting Target Repositories
- Ground Rules for Labeling Suspected Credential Information
Metadata
Obfuscation
License
Directory Structure
Benchmark Result
Used Tools for Benchmarking
Citation
How to Get Involved
How to Contact

Introduction

CredData (Credential Dataset) is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicious line.

CredData can be used to develop new tools or improve existing tools. Furthermore, using the benchmark result of the CredData, users can choose a proper tool among open source credential scanning tools according to their use case. We sincerely hope that CredData will help minimize credential leaks.

How To Use

We offer the dataset for download as metadata files and script that would collect required code files based on the metadata.

To download data please use:

Linux based OS
Python 3.7.10 or higher version

Then run:

pip install PyYAML
python download_data.py --data_dir data

That Linux based OS is required due to NTFS filenames issue. Some files that would be downloaded and processed have names that are invalid in Windows/NTFS systems (such as version->1.2.js).

Using python with lower versions may result in OSError: [Errno 40] Too many levels of symbolic links exception.

Resulting dataset has no invalid filenames and can be used on Windows OS.

tmp directory can be removed after dataset generated.

Data Overview

Data Statistics

Dataset consists of 19,459,282 lines of code extracted from 11,408 files in 297 different repositories. Total 59,907 lines of code are labeled, of which 5,882 (9.82%) labeled as True. Labeled data divided into 8 major categories according to their properties.

Lines of code by language

Language	Total	Labeled	True	Language	Total	Labeled	True
Text	85,144	8,718	1,634	Config	7,920	308	68
JavaScript	742,704	4,478	1,339	No Extension	48,645	991	55
Python	351,494	4,996	704	Shell	42,019	1,207	52
Go	838,816	5,814	696	Java Properties	1,878	111	38
YAML	74,643	2,521	479	AsciiDoc	27,803	418	37
Markdown	186,099	3,065	372	XML	57,377	1,312	30
Ruby	186,196	4,006	327	Haskell	5,127	67	30
Java	178,326	1,614	271	SQLPL	16,808	594	26
Key	8,803	598	227	reStructuredText	38,267	401	21
PHP	113,865	1,767	209	Smalltalk	92,284	777	18
JSON	15,036,863	9,304	194	TOML	2,566	235	17
TypeScript	151,832	2,357	155	Objective-C	19,840	115	14
Other	1,143,963	5,637	235

True credentials by category

Category	True credentials
Password	2,554
Generic Secret	1,064
Private Key	984
Generic Token	453
Predefined Pattern	236
Authentication Key & Token	47
Seed, Salt, Nonce	35
Other	509

Data

Selecting Target Repositories

In order to collect various cases in which credentials exist, we selected publicly accessible repositories on Github through the following process:

We wanted to collect credentials from repositories for various languages, frameworks, and topics, so we primarily collected 181 topics on Github.

In this process, to select widely known repositories for each topic, we limited repositories with more than a certain number of stars. 19,486 repositories were selected in this process.
We filtered repositories which have the license that can not be used for dataset according to the license information provided by Github.

In some cases, the provided license was inaccurate. So we conducted with manual review.
Filtering was carried out by checking whether strings related to the most common credentials such as 'password' and 'secret' among the result repositories are included and how many are included. After that, we executed several open source credential scanning tools.
For the results of No.3, we manually reviewed the detection results of all tools. Please check Ground Rules for Labeling Suspected Credential Information for the method used in the review.

As a result, we selected 297 repositories containing lines that we suspect had at least one credential value.

Ground Rules for Labeling Suspected Credential Information

It is difficult to know whether a line included in the source code is a real credential. However, based on human cognitive abilities, we can expect the possibility that the detected result contains actual credential information. We classify the detection results to the three credential type.

True : It looks like a real credential value.
False : It looks like a false positive case, not the actual credential value.
Template : It seems that it is not an actual credential, but it is a placeholder. It might be helpful in cases such as ML.

In order to compose an accurate Ground Truth set, we proceed data review based on the following 'Ground Rules':

All credentials in test (example) directories should be labeled as True.
Credentials with obvious placeholders (password = ;) should be labeled as Template.
Function calls without string literals (password=getPass();) and environmental variable assignments (password=${pass}) should be labeled as False.
Base64 and other encoded data should be labeled as False. If it is a plaintext credential just encoded to Base64, that should be labeled as True.
Package and resource version hash is not a credential, so common hash string (integrity sha512-W7s+uC5bikET2twEFg==) is False.
Be careful about filetype when checking variable assignment:

In .yaml file row (password=my_password) can be a credential but in .js or .py it cannot. This languages require quotations (' or ") for string declaration (password="my_password").
Check if the file you are labeling is not a localization file. For example config/locales/pt-BR.yml is not a credentials, just a translation. So those should be labeled as False.

We could see that many credentials exist in directories/files that have the same test purpose as test/tests. In the case of these values, people often judge that they contain a real credential, but we do not know whether this value is an actual usable credential or a value used only for testing purposes. We classify those values as True in order to prevent the case of missing real usable credentials. Since it may be necessary to separate the values in the future, we have separated the files for testing and the files that are not. (Check metadata or data set)

Metadata

Metadata includes Ground Truth values and additional information for credential lines detected by various tools.

Properties on the Metadata

Name of property	Data Type	Description
ID	Integer	Credential ID
FileID	String	Filename hash. Used to download correct file from a external repo
Domain	String	Domain of repository. (ex. Github)
RepoName	String	Project name that credential was found
FilePath	String	File path where credential information was included
LineStart:LineEnd	Integer:Integer	Line information, it can be single(2:2) or multiple(ex. 2:4 means 2 to 4 inclusive)
GroundTruth	String	Ground Truth of this credential. True / False or Template
ValueStart	Integer	Index of value on the line. always nan if GroundTruth is False.
ValueEnd	Integer	Index of character right after value ends in the line.
InURL	Boolean	Flag to indicate if credential is a part of a URL, such as "http://user:[email protected]"
CharacterSet	String	Characters used in the credential (NumberOnly, CharOnly, Any)
CryptographyKey	String	Type of a key: Private or Public
PredefinedPattern	String	Credential with defined regex patterns (AWS token with `AKIA...` pattern)
VariableNameType	String	Categorize credentials by variable name into Secret, Key, Token, SeedSalt and Auth
Entropy	Float	Shanon entropy of a credential
WithWords	Boolean	Flag to indicate word(https://github.com/first20hours/google-10000-english) is included on the credential
Length	Integer	Value length, similar to ValueEnd - ValueStart
Base64Encode	Boolean	Is credential a base64 string?
HexEncode	Boolean	Is credential a hex encoded string? (like `\xFF` or `FF 02 33`)
URLEncode	Boolean	Is credential a url encoded string? (like `one%20two`)
Category	String	Labeled data divided into 8 major categories according to their properties. see Category.

Name	Description
Password	Short secret with entropy <3.5 or Password keyword in variable name
Generic Secret	Secret of any length with high entropy
Private Key	Private cryptographic key
Predefined Pattern	Credential detected based on defined regex, such as Google API Key/JWT/AWS Client ID
Seed, Salt, Nonce	Credential with seed, salt or nonce in variable name
Generic Token	Credential with Token in VariableNameType and not covered by other categories
Authentication Key & Token	Credential with Auth in VariableNameType and not covered by other categories
Other	Any credentials that is not covered by categories above

Relationship between Data and Metadata

You can see metadata files in the meta directory. A single metadata file contains rows including line location, value index and GT(GroundTruth) information about the suspect credential information for a specific repository.

Let's look at the meta/02dfa7ec.csv. file as an example.

Id,FileID,Domain,RepoName,FilePath,LineStart:LineEnd,GroundTruth,WithWords,ValueStart,ValueEnd,...
34024,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/n.example,83:83,True,Secret,HighEntropy,0,31,73,...

Convert the above line with only essential columns into a table format:

...	RepoName	FilePath	LineStart:LineEnd	GroundTruth	...	ValueStart	ValueEnd	...
...	02dfa7ec	data/02dfa7ec/test/n.example	83:83	True	...	31	73	...

This line means that the credential line exists in the 83th line of the data/02dfa7ec/test/n.example file that downloaded and obfuscated output after running the download_data.py script. You can find the n.example file in the output generated by the download_data.py script execution by referring to the above path.

When you check the 83th line of the file, you can see that the following line exists.

GITHUB_ENTERPRISE_ORG_SECRET=hfbpozfhvuwgtfosmo2imqskc73w04jf3313309829

In here, you can see the credential based on the values of ValueStart and ValueEnd in the metadata.

hfbpozfhvuwgtfosmo2imqskc73w04jf3313309829

That the corresponding value is an obfuscated value, not an actual credential value. For that part, please refer to the following Obfuscation section.

Obfuscation

If the suspicious lines are included in the dataset as it is, the credential values actually used in real may be leaked, which is not a desirable result.

To avoid such cases we proceeded:

Credential values obfuscation in files.
Directory & file name and directory hierarchy obfuscation.

Credential values obfuscation in files

To prevent leakage of the actual credential value in the file, we can mask the line that is supposed to be credential or change it to a random string. However, this masking and changing to a random string can make the side effects to the detection performance of several tools. We have applied other methods to substitute the actual credential values within the limits of ensuring the detectability of these various tools.

Replacing the real value to a example value for a case where a fixed pattern is clear (ex. AWS Access Key)
Replacing the entire file with credential information to a example file. (ex. X.509 Key)
Random key generation using regex pattern from the character set of real string and length.

Directory & file name and directory hierarchy obfuscation

Even if the line suspected of having a credential in the file is obfuscated, you can easily check the original credential value and where it came from by the information of the repository (repo name, directory structure, file name). To prevent this from happening, we obfuscated the directory structure and file names. Files with lines suspected of being credential / detected by the tool. It is saved in the form of data/RepoID/(src|test)/FileID.FILE_EXTENSION. RepoID and FileID are not opened separately, but original repository information can be provided only for research purposes and the actual repository owner through separate contact. For contact information, please refer to the How to Contact section.

License

Each file is under the existing project's license. It may be difficult to check the license of an existing project due to obfuscation of file and directory structure, etc., but you can check the actual license through the license corresponding to the RepoID in the license directory.

Directory Structure

📦CredData
 ┃
 ┣ 📂benchmark
 ┃
 ┣ 📂data          ---- To be generated by `download_data.py`
 ┃ ┗ 📂A
 ┃   ┣ 📂src
 ┃   ┃ ┗ 📜a.c     ---- Source File
 ┃   ┃
 ┃   ┣ 📂test
 ┃   ┃ ┗ 📜b.c     ---- Source File but in the test/tests.
 ┃   ┃
 ┃   ┣ 📂other
 ┃   ┃ ┗ 📜c       ---- File has no extension or Readme 
 ┃   ┃
 ┃   ┗ 📜LICENSE(COPYING)   ---- License File for repo A
 ┃
 ┣ 📂meta
 ┃    ┗ 📜A.csv
 ┃
 ┣ 📜snapshot.yaml ---- URL and commit info for used repositories
 ┃
 ┣ 📜README.md
 ┃
 ┣ 📜download_data.py
 ┃
 ┗ 📜CredData.pdf

Benchmark Result

A table of performance metrics for each tool tested based on CredData. The content will be updated in detail with the release of our tool in October. For the tools used, see the Used Tools for Benchmarking section below.

Name	TP	FP	TN	FN	FPR	FNR	Precision	Recall	F1
ours (to be released)	4,231	1,592	52,511	1,771	0.0294	0.29506	0.7266	0.7049	0.7156
detect-secrets	2,862	10,467	44,508	3,140	0.1903	0.5231	0.2147	0.4768	0.2961
gitleaks	1,064	1,068	52,838	4,938	0.0198	0.8227	0.49906	0.1772	0.2616
shhgit	324	277	53,629	5,678	0.0051	0.94601	0.5391	0.0539	0.0981
truffleHog	1,756	129,343	41,622	4,246	0.7565	0.7074	0.0133	0.2925	0.0256
credential-digger	637	25,532	49,997	5,365	0.33804	0.8938	0.0243	0.1061	0.0396
wraith(gitrob)	1,504	3,062	52,149	4,498	0.0554	0.7494	0.3293	0.2505	0.2846

Used Tools for Benchmarking

Name	URL
truffleHog	https://github.com/trufflesecurity/truffleHog
shhgit	https://github.com/eth0izzle/shhgit
wraith(gitrob)	https://gitlab.com/gitlab-com/gl-security/security-operations/gl-redteam/gitrob
credential-digger	https://github.com/SAP/credential-digger
gitLeaks	https://github.com/zricethezav/gitleaks
detect-secrets	https://github.com/Yelp/detect-secrets

Citation

You can use CredData on your research.

@misc{sr-cred21,
    author = {JaeKu Yun, ShinHyung Choi, YuJeong Lee, Oleksandra Sokol, WooChul Shim},
    title = {Project CredData: A Dataset of Credentials for Research},
    howpublished ={ \url{https://github.com/Samsung/CredData}},
    year = {2021}
}

How to Get Involved

In addition to developing under an Open Source license, A use an Open Source Development approach, welcoming everyone to participate, contribute, and engage with each other through the project.

Project Roles

A recognizes the following formal roles: Contributor and Maintainer. Informally, the community may organize itself and give rights and responsibilities to the necessary people to achieve its goals.

Contributor

A Contributor is anyone who wishes to contribute to the project, at any level. Contributors are granted the following rights, to:

Can suggest
- Change in Ground Truth for currently added/ newly added codes
- New open repository to be included
Report defects (bugs) and suggestions for enhancement;
Participate in the process of reviewing contributions by others;

Contributors are required to:

Must follow below rules when updating additional credential dataset
- Additional data must be individual from the original data; they must not effect(change/remove/conflict) with the original data
- Additional data must not include valid/real credential data to prevent further exposure of the credential; they must be transformed by the obfuscation rule guided in README.md, or changed through other process which has similar obfuscation effect.
To contribute and reflect changes, Contributors receive the approval of the maintainer.

Contributors who show dedication and skill are rewarded with additional rights and responsibilities. Their opinions weigh more when decisions are made, in a fully meritocratic fashion.

Maintainer

A Maintainer is a Contributor who is also responsible for knowing, directing and anticipating the needs of a given a Module. As such, Maintainers have the right to set the overall organization of the source code in the Module, and the right to participate in the decision-making. Maintainers are required to review the contributor’s requests and decide whether to accept or not the contributed data.

How to Contact

Please post questions, issues, or suggestions into Issues. This is the best way to communicate with the developer.

Comments

Add category to result

We had categories, but we didn't have results that were categorized into categories. So I added category to Result class and also add accuracy to Result class for convenience.

Here is result:

$ python -m benchmark --scanner credsweeper
result_cnt : 4103, lost_cnt : 65, true_cnt : 3701, false_cnt : 337
credsweeper -> TP : 3701, FP : 337, TN : 19454362, FN : 882, FPR : 0.0000173223, FNR : 0.1924503600, ACC : 0.9999373564, PRC : 0.9165428430, RCL : 0.8075496400, F1 : 0.8586010904
credsweeper Private Key -> TP : 953, FP : 0, TN : 4, FN : 39, FPR : 0E-10, FNR : 0.0393145161, ACC : 0.9608433735, PRC : 1.0000000000, RCL : 0.9606854839, F1 : 0.9799485861
credsweeper Predefined Pattern -> TP : 310, FP : 3, TN : 39, FN : 17, FPR : 0.0714285714, FNR : 0.0519877676, ACC : 0.9457994580, PRC : 0.9904153355, RCL : 0.9480122324, F1 : 0.9687500000
credsweeper Generic Token -> TP : 281, FP : 5, TN : 598, FN : 52, FPR : 0.0082918740, FNR : 0.1561561562, ACC : 0.9391025641, PRC : 0.9825174825, RCL : 0.8438438438, F1 : 0.9079159935
credsweeper Generic Secret -> TP : 975, FP : 4, TN : 214, FN : 81, FPR : 0.0183486239, FNR : 0.0767045455, ACC : 0.9332810047, PRC : 0.9959141982, RCL : 0.9232954545, F1 : 0.9582309582
credsweeper Password -> TP : 986, FP : 129, TN : 4167, FN : 409, FPR : 0.0300279330, FNR : 0.2931899642, ACC : 0.9054647689, PRC : 0.8843049327, RCL : 0.7068100358, F1 : 0.7856573705
credsweeper Other -> TP : 115, FP : 3, TN : 742, FN : 259, FPR : 0.0040268456, FNR : 0.6925133690, ACC : 0.7658623771, PRC : 0.9745762712, RCL : 0.3074866310, F1 : 0.4674796748
credsweeper Seed, Salt, Nonce -> TP : 33, FP : 0, TN : 8, FN : 6, FPR : 0E-10, FNR : 0.1538461538, ACC : 0.8723404255, PRC : 1.0000000000, RCL : 0.8461538462, F1 : 0.9166666667
credsweeper Authentication Key & Token -> TP : 48, FP : 1, TN : 31, FN : 19, FPR : 0.0312500000, FNR : 0.2835820896, ACC : 0.7979797980, PRC : 0.9795918367, RCL : 0.7164179104, F1 : 0.8275862069

opened by csh519 3

Cover case of a failed checkout
Fix case where repo might gone private, or commit removed from the history

if move_files encountered empty repo with mismatched number of files, this repo would be skipped with a message. Related meta file would be removed, so mismatch between meta and real files would not have an affect later on

if checkout process failed - repo will be removed, so move_files would also skip it

Change assert with error to a message

Example result:

Processed: 296/297 Processed: 297/297 Finalizing dataset. Please wait a moment... Done! All files saved to data Some repos had a problem with download. Removing meta so missing files would not count in the dataset statistics: meta/8c13fe41.csv meta/d7017e58.csv You can use git to restore mentioned meta files back
opened by meanrin 2

Duplicated rows in meta files

Hello, thank you for this benchmark and dataset, it is very interesting !

I noticed that there are some duplicated rows in the meta files (when excluding the Id column), as can be seen by running this script:

import os
import pandas as pd
meta_file_list = os.listdir('./meta')
for f in meta_file_list:
     df = pd.read_csv(f"./meta/{f}")
     nb_dups = sum(df.drop(["Id"], axis="columns").duplicated())
     if nb_dups > 1:
         print(f"{f}: {nb_dups} dups")

The output is:

387016a6.csv: 7 dups
e51ae6e8.csv: 3 dups
eb9ba732.csv: 2 dups
1d8ec728.csv: 4 dups
b88a4e51.csv: 4 dups
7738e44d.csv: 3 dups
4997f5e1.csv: 15 dups
ac9be8d9.csv: 16 dups
894e3377.csv: 4 dups
36d0fbbb.csv: 8 dups
c2127ffb.csv: 2 dups
31423103.csv: 4 dups
0064e882.csv: 4 dups
6a132a15.csv: 19 dups
8ae5e55a.csv: 2 dups
a59f7e20.csv: 14 dups
54f6f35d.csv: 3 dups
c9b945fa.csv: 2 dups
fd501154.csv: 9 dups
4fe048e4.csv: 2 dups
83a072aa.csv: 4 dups
78e5819e.csv: 3 dups
4764adaf.csv: 28 dups
14a53c5a.csv: 3 dups
0401c075.csv: 2 dups
bbb4193f.csv: 2 dups
d111114d.csv: 4 dups
784d78e8.csv: 13 dups
533c47c6.csv: 4 dups
798d34aa.csv: 5 dups
ec869dbc.csv: 7 dups
c8c48c5e.csv: 2 dups
28728ab4.csv: 33 dups
77a3d7d7.csv: 3 dups
f8f46739.csv: 3 dups
02dfa7ec.csv: 2 dups
7ba140d7.csv: 5 dups
efb4b495.csv: 3 dups
3bc98c93.csv: 7 dups
91cfdb2b.csv: 2 dups
349ac2b1.csv: 18 dups
660708f3.csv: 3 dups
90aebe4a.csv: 7 dups
264777b9.csv: 5 dups
654956e7.csv: 3 dups
a5c3685e.csv: 2 dups
8e7a08b0.csv: 5 dups
8cda00f3.csv: 6 dups
4dccc5be.csv: 5 dups
afdd01d8.csv: 6 dups
2ba83c6a.csv: 6 dups
5f62aae4.csv: 3 dups
472d4c24.csv: 12 dups
389fd795.csv: 2 dups
1fb36b4f.csv: 3 dups
ec138349.csv: 9 dups
fa71ac83.csv: 2 dups
2c47a91c.csv: 3 dups
50595139.csv: 88 dups
288eaba8.csv: 2 dups
b6b2487d.csv: 8 dups
873d2d8b.csv: 5 dups
39def7b4.csv: 3 dups
0f133e09.csv: 2 dups
49b08818.csv: 2 dups
80815938.csv: 12 dups
c8aa9b49.csv: 8 dups
6e2ed0e4.csv: 2 dups
cc51a2f0.csv: 3 dups
86050693.csv: 3 dups
255bae6f.csv: 10 dups
6c73b80a.csv: 10 dups
a0cd6261.csv: 2 dups
e3c63910.csv: 2 dups
f710ac3c.csv: 12 dups
41659445.csv: 2 dups
87ae3d91.csv: 4 dups
a15774b8.csv: 15 dups
4a099ada.csv: 3 dups
81cd05d0.csv: 40 dups
fdbe07ac.csv: 6 dups
f008dd40.csv: 7 dups
2df212a2.csv: 2 dups
850c2319.csv: 6 dups
f37dd3b3.csv: 3 dups
f623c7b3.csv: 2 dups
d7017e58.csv: 16 dups
d2d68c6f.csv: 17 dups
00408ef6.csv: 6 dups

opened by gg-mmill 2

Add new scanner TruffleHog
TruffleHog recently released v3.0.0 and it has filesystem option, so finally we can test it with CredData.

Add TruffleHog v3.2.3 linux arm64 binary in benchmark/scanner/bin/trufflehog

Add TruffleHog scanner class
opened by csh519 1
CI workflow with update

Removed some non-public repositories. Updated debug info to enlighten which repo is processing before it throws an exception. Checks are available https://github.com/babenek/CredData/pull/1/checks

opened by babenek 0
Remove duplicates
Fix duplicates mentioned in #11

Changes:

Remove duplicates from 129 files

Change stats in readme

Change benchmark results for different scanners (less than 0.1% difference in most cases)
opened by meanrin 0
Update data, Nov 24, 2021
Changes to metadata

Changes:

Number on lines in meta: 59,907 -> 74,549

True cases in meta: 6,002 -> 4,595

Benchmark for other scanner tools and aggregated statistics is also updated

Fix issue where some values for with_words column was equal to 0 or 1, while should always be T or F

Motivation for changes:

Ambiguous class for placeholder passwords A lot of passwords labeled as true in dataset is placeholder passwords similar to password = password or password = admin123 (and many similar) But we also have password = password or password = admin123 in Templates and False cases Having same cases in both True, Template and False creates huge confusion for the model To make model perform better it would make sense to move all such cases under same class: either True, False or Template Propose to move it to Template, but this is arbitrary choice and can be changed

Not all keys labeled as True We found a lot of JWT and AWS (and some other cases) that looks really True, but was either labeled False, or not present in meta at all (so not detected by other scanners) Propose to relabel them to True

Assert/Example/Expect statements A lot of lines with Assert/Example/Expect functions is labeled as True, while being stored in the test files and used for validation different functionality in code Propose to relabel them as False
opened by meanrin 0
Fix citation

First, names have to be separated by and operator instead of commas, otherwise biblatex gets error. See this link for details

Second, add Melkonyan Arkadiy and Dmytro Kuzmenko to authors

opened by ARKAD97 0
Update download_data.py

Change True to T in download script due to the changes in https://github.com/Samsung/CredData/pull/3/commits/fc0caa5e294d4dcda4cc53d23b865deaf60af423

Modified rows used to select credentials for obfuscation

opened by meanrin 0
Update Readme statistics
Update readme statistics and related images

Change False/True in meta GroundTruth to F/T GroundTruth is a string column (values False/True/Template) but Pandas can sometimes interpreter it as a boolean one. This may cause errors if user tries to check true credentials by something similar to line.GroundTruth == "True" as it will not include values interpreted as bool. Changing to F/T forces pandas to always read it like a string column

Sanity check for new statistics data in readme:

True credentials by category 2613 + 1074 + 993 + 462 + 251 + 49 + 40 + 520 = 6002

Lines of code by language

|Language|Total|Labeled|True| |--------|--------|--------|--------| |Total sum|19,459,282| 59,907 | 6,002 | |Text|85,144|8,718|1,634| |Go|838,816|5,808|690| |Python|351,494|4,969|677| |YAML|74,643|2,521|479| |JavaScript|742,704|3,586|448| |Markdown|186,099|3,064|371| |Ruby|186,196|4,006|327| |JSON|15,036,863|9,303|193| |Java|178,326|1,435|169| |TypeScript|151,832|2,357|155| |Other|237,491|625|137| |Key|8,803|309|115| |PHP|113,865|1,660|104| |Config|7,920|308|68| |No Extension|48,645|990|55| |Shell|42,019|1,207|52| |Java Properties|1,878|111|38| |AsciiDoc|27,803|418|37| |XML|57,377|1,312|30| |Haskell|5,127|67|30| |SQLPL|16,808|594|26| |reStructuredText|38,267|401|21| |Smalltalk|92,284|777|18| |TOML|2,566|235|17| |Objective-C|19,840|115|14| |Other|1,143,963|5,636|234|
opened by meanrin 0
Update 'readme.md' to add terminology of performance metrics

For better understanding of the table of performance metrics, I suggest to add 'terminology' under the benchmark result. If the terminology is not accurate, please make it clear. :)

opened by Seo-Young 0

CredData is a set of files including credentials in open source projects

Related tags

Overview

CredData (Credential Dataset)

Table of Contents

Introduction

How To Use

Data Overview

Data Statistics

Lines of code by language

True credentials by category

Data

Selecting Target Repositories

Ground Rules for Labeling Suspected Credential Information

Metadata

Properties on the Metadata

Category

Relationship between Data and Metadata

Obfuscation

Credential values obfuscation in files

Directory & file name and directory hierarchy obfuscation

License

Directory Structure

Benchmark Result

Used Tools for Benchmarking

Citation

How to Get Involved

Project Roles

Contributor

Maintainer

How to Contact

Comments

Owner

Samsung

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Maix Speech AI lib, including ASR, chat, TTS etc.

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

An open source library for deep learning end-to-end dialog systems and chatbots.

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

An open-source NLP research library, built on PyTorch.

Open Source Neural Machine Translation in PyTorch

An open source library for deep learning end-to-end dialog systems and chatbots.