Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop

Last update: Dec 25, 2022

Related tags

Text Data & NLP biomedical

Overview

Biomedical Language Modeling

Tools for curating biomedical training data for large-scale language modeling.

Setup

Using conda:

conda env create -f conda.yml

Activate the environment as:

conda activate bigscience-biomedical

Datasets

Spreadsheet of biomedical training sets (currently ~76 datasets).

Experiments

Biomedical Prompting

Comments

Add Chemdner dataset loader
Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

If the following information is NOT present in the issue, please populate:

Name: name of the dataset

Description: short description of the dataset (or link to social media or blog post)

Paper: link to the dataset paper if available

Data: link to the online home of the dataset

Checkbox

[x] Confirm that this PR is linked to the dataset issue.

[x] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).

[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.

[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.

[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.

[x] Confirm dataloader script works with datasets.load_dataset function.

[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
opened by qanastek 24
Closes #222
Name: PubTator Central

Description: PubTator Central (PTC, https://www.ncbi.nlm.nih.gov/research/pubtator/) is a web service for exploring and retrieving bioconcept annotations in full text biomedical articles. PTC provides automated annotations from state-of-the-art text mining systems for genes/proteins, genetic variants, diseases, chemicals, species and cell lines, all available for immediate download.

Paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3692066/

Data: https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/

Checkbox

[X] Confirm that this PR is linked to the dataset issue.

[X] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).

[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.

[X] Implement _info(), _split_generators() and _generate_examples() in dataloader script.

[X] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.

[X] Confirm dataloader script works with datasets.load_dataset function.

[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.

Closes #222.
opened by JohnGiorgi 19
Closes #156

Closes #156

Local dataset, output in file: test_log.txt

Dataset is of task Word Sense Disambiguation fitted into Tasks.NAMED_ENTITY_DISAMBIGUATION. There was a discussion starting from here.

There are a lot of mismatches between entity and offset, this is intended as there can be slight variations of ambiguous word/word in text (in case, singular/plural, ...).
local dataset

opened by nomisto 16
Create dataset loader for MedHop
Adding a Dataset

Name: MedHop

Description: None provided

Task: QA

Paper: https://transacl.org/ojs/index.php/tacl/article/viewFile/1325/299

Data: http://qangaroo.cs.ucl.ac.uk

License: CC BY-SA 3.0

English QA JSON
opened by jason-fries 16
Closes #64
Name: CODIESP

Description: Collection of 1,000 manually selected clinical case studies in Spanish.

Paper: link to the dataset paper

Data: link to the online home of the dataset

Checkbox

[x] Confirm that this PR is linked to the dataset issue.

[x] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).

[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.

[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.

[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.

[x] Confirm dataloader script works with datasets.load_dataset function.

[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.

[x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
opened by mapama247 13

Normalize licenses

The idea is to have classes for different type of licenses whit each license having a subset of available parameters: name, type, version, link, description.

This entails replacing:

_LICENSE = "CC BY-NC-SA"
_LICENSE= "CC BY-NC-SA 3.0"

With

from bigbio.utils import license
_LICENSE = license.CreativeCommons(type="BY-NC-SA")
_LICENSE = license.CreativeCommons(type="BY-NC-SA", version=3.0)

and

return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=features,
            homepage=_HOMEPAGE,
            license= str(_LICENSE),
            citation=_CITATION,
        )

Special ones are Custom for dataset-specific license and PubliclyAvailable for those datasets which can be freely downloaded but do not provide license information.

opened by sg-wbi 12

Closes #220

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

If the following information is NOT present in the issue, please populate:

Name: name of the dataset
Description: short description of the dataset (or link to social media or blog post)
Paper: link to the dataset paper if available
Data: link to the online home of the dataset

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
[x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

(bigscience-biomedical) root@docker-desktop:/workspaces/biomedical# python -m tests.test_bigbio biodatasets/n2c2_2014/n2c2_2014.py --data_dir /workspaces/biomedical/biodatasets/n2c2_2014/tar_gz
INFO:__main__:args: Namespace(path='biodatasets/n2c2_2014/n2c2_2014.py', schema=None, subset_id=None, data_dir='/workspaces/biomedical/biodatasets/n2c2_2014/tar_gz', use_auth_token=None)
INFO:__main__:self.PATH: biodatasets/n2c2_2014/n2c2_2014.py
INFO:__main__:self.SUBSET_ID: n2c2_2014
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: /workspaces/biomedical/biodatasets/n2c2_2014/tar_gz
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.NAMED_ENTITY_RECOGNITION: 'NER'>, <Tasks.TEXT_CLASSIFICATION: 'TXTCLASS'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'KB', 'TEXT'}
INFO:__main__:schemas_to_check: {'KB', 'TEXT'}
INFO:__main__:Checking load_dataset with config name n2c2_2014_source
WARNING:datasets.builder:Using custom data configuration n2c2_2014_source-57e9df040ed9f011
Downloading and preparing dataset n2c2_2014/n2c2_2014_source to /home/jovyan/.cache/huggingface/datasets/n2c2_2014/n2c2_2014_source-57e9df040ed9f011/1.0.0/1e3e609086987a589ee6db853ac5c55fe73648c31096f144bb0dbbd670cb2f72...
Dataset n2c2_2014 downloaded and prepared to /home/jovyan/.cache/huggingface/datasets/n2c2_2014/n2c2_2014_source-57e9df040ed9f011/1.0.0/1e3e609086987a589ee6db853ac5c55fe73648c31096f144bb0dbbd670cb2f72. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 282.37it/s]
INFO:__main__:Checking load_dataset with config name n2c2_2014_bigbio_kb
WARNING:datasets.builder:Using custom data configuration n2c2_2014_bigbio_kb-57e9df040ed9f011
Downloading and preparing dataset n2c2_2014/n2c2_2014_bigbio_kb to /home/jovyan/.cache/huggingface/datasets/n2c2_2014/n2c2_2014_bigbio_kb-57e9df040ed9f011/1.0.0/1e3e609086987a589ee6db853ac5c55fe73648c31096f144bb0dbbd670cb2f72...
Dataset n2c2_2014 downloaded and prepared to /home/jovyan/.cache/huggingface/datasets/n2c2_2014/n2c2_2014_bigbio_kb-57e9df040ed9f011/1.0.0/1e3e609086987a589ee6db853ac5c55fe73648c31096f144bb0dbbd670cb2f72. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 384.62it/s]
INFO:__main__:Checking load_dataset with config name n2c2_2014_bigbio_text
WARNING:datasets.builder:Using custom data configuration n2c2_2014_bigbio_text-57e9df040ed9f011
Downloading and preparing dataset n2c2_2014/n2c2_2014_bigbio_text to /home/jovyan/.cache/huggingface/datasets/n2c2_2014/n2c2_2014_bigbio_text-57e9df040ed9f011/1.0.0/1e3e609086987a589ee6db853ac5c55fe73648c31096f144bb0dbbd670cb2f72...
Dataset n2c2_2014 downloaded and prepared to /home/jovyan/.cache/huggingface/datasets/n2c2_2014/n2c2_2014_bigbio_text-57e9df040ed9f011/1.0.0/1e3e609086987a589ee6db853ac5c55fe73648c31096f144bb0dbbd670cb2f72. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 514.01it/s]
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 12490 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
id: 790
document_id: 790
passages: 790
entities: 17405
normalized: 0
events: 0
coreferences: 0
relations: 0

test
==========
id: 514
document_id: 514
passages: 514
entities: 11462
normalized: 0
events: 0
coreferences: 0
relations: 0

INFO:__main__:Checking if referenced IDs are properly mapped
INFO:__main__:KB ONLY: Checking passage offsets
INFO:__main__:KB ONLY: Checking entity offsets

<SPECIFIC ERRORS ARE HIDDEN> 

There are features with wrong offsets! This is not a hard failure, as it is common for this type of datasets. However, if the error list is long (e.g. >10) you should double check your code. 


INFO:__main__:KB ONLY: Checking event offsets
INFO:__main__:KB ONLY: Checking coref offsets
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 514 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
id: 790
document_id: 790
text: 790
labels: 16501

test
==========
id: 514
document_id: 514
text: 514
labels: 10970

.
----------------------------------------------------------------------
Ran 1 test in 11.258s

OK

local dataset

opened by jdposada 12

Closes #246
Closes #246

Description: PsyTAR dataset contains 891 drugs reviews posted by patients on "askapatient.com", about the effectiveness and adverse drug events associated with Zoloft, Lexapro, Cymbalta, and Effexor XR.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.

[x] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).

[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.

[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.

[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.

[x] Confirm dataloader script works with datasets.load_dataset function.

[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.

local dataset
opened by danilexn 12
Create dataset loader for Bio-SimLex
Name: Bio-SimLex

Description: Noun pairs with similarity scores

Task: Semantic Similarity

Paper: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2039-z

Data: https://github.com/cambridgeltl/bio-simverb/tree/master/wvlib/word-similarities/bio-simlex

License: ?

Motivation: Evaluation

English Semantic Similarity
opened by galtay 12

Task of dataset is Relation Extraction, however it was part of a challenge where the correct answers to the test set were never released. Since data["test"]["relations"] is thus empty for the testset the test fails with:

ERROR: runTest (__main__.TestDataLoader)
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\ottsi\biomedical\tests\test_bigbio.py", line 116, in runTest
    self.test_schema(schema)
  File "C:\Users\ottsi\biomedical\tests\test_bigbio.py", line 518, in test_schema
    self.assertTrue(self._check_subkey(example[key][0], attrs))
IndexError: list index out of range

----------------------------------------------------------------------
Ran 1 test in 1.195s

FAILED (errors=1)

Edit: On second thought I think this might be a bug in the test-script, since it is assumed that a required subkey has to have elements in it. F.e. also in NER there could be documents without any entities at all in them.

opened by nomisto 11

Add Mantra GSC Dataset.
Closes #137

Checkbox

[x] Confirm that this PR is linked to the dataset issue.

[x] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).

[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.

[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.

[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.

[x] Confirm dataloader script works with datasets.load_dataset function.

[ ] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.

[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
opened by karthikrangasai 9
Closes #854
Add the Paragraph-Level Simplification of Medical Texts dataset. Closes #854

Checkbox

[X] Confirm that this PR is linked to the dataset issue.

[X] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).

[X] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.

[X] Implement _info(), _split_generators() and _generate_examples() in dataloader script.

[X] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.

[X] Confirm dataloader script works with datasets.load_dataset function.

[X] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.

[X] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
opened by Miking98 2
Add implementation for the Paragraph-level Simplification of Medical Texts dataset
Adding a Dataset

Name: Paragraph-level Simplification of Medical Texts

Description: A paired dataset of technical medical abstracts and their plain-language summarizations.

Task: SUM

Paper: https://arxiv.org/abs/2104.05767

Data: https://github.com/AshOlogn/Paragraph-level-Simplification-of-Medical-Texts

License: CC_BY_4p0

Motivation: High-quality summarization dataset for translating biomedical technical language -> layman's terms
opened by Miking98 0
Revise implementation of BioRED
This PR improves the implementation of the BioRed corpus:

In the previous implementation a unique entity was created per entity mention and database identifier. This was fixed to a single entity mention having multiple database ids.

Furthermore, the name of the database a entity is linked to was added

BioRed only provides abstract-level annotations for entity-linked relation pairs rather than materializing links between all surface form mentions of relation. Analogous to BC5CDR we enumerate all mention pairs concerning the entities in the triple.
opened by mariosaenger 2
Fix unit test to run local PRs + fix tutorial
Enables unit testing of local scripts with --test_local flag; borrows the test_bigbio_hub.py script. I tested this by making a copy of scitail as test_scitail in the biodatasets folder, and

To replicate:

copy the scitail folder in bigbio/biodatasets as cp -r bigbio/biodatasets/scitail bigbio/biodatasets/test_scitail.

change the name of scitail.py in this folder as test_scitail.py

add bigbiohub.py into this new test_scitail folder

in the main directory, run python -m tests.test_bigbio_hub test_scitail --test_local

Note- the contributions guide makes a reference to the templates folder that has 2 scripts; bigbiohub and a template file that can be used to fill-in-the-blanks. To avoid a deprecated script, maybe we should automate that for every new update, bigbiohub is either copied into the template folder OR I can just change the tutorial to reflect it's actual "default" location of bigbio/hub/bigbiohub.py

TODO: on 2023/01/03 I'm going to add one more small change that also tests whether the METADATA is in the acceptable set of values to ensure standardization!

@galtay
opened by hakunanatasha 1
Add implementation for the CPI dataset
Adding a Dataset

Name: CPI

Description: The compound-protein relationship (CPI) dataset consists of 2,613 sentences from abstracts containing
annotations of proteins, small molecules, and their relationships

Task: NER,RE,NEN

Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0220925

Data: https://github.com/KerstenDoering/CPI-Pipeline

License: ISC

Motivation: High quality NER and RE annotations
opened by mariosaenger 0
Add implementation for DrugProt data set
Adding a Dataset

Name: DrugProt

Description: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-1/

Task: NER, RE

Paper: https://biocreative.bioinformatics.udel.edu/media/store/files/2021/Track1_pos_1_BC7_overview.pdf

Data: https://zenodo.org/record/5119892#.Y6A4RtLMKV4

License: Creative Commons Attribution 4.0 International

Motivation: High-quality annotations for NER and RE
opened by mariosaenger 0

Owner

BigScience Workshop

Research workshop on large language models - The Summer of Language Models 21

GitHub

Unsupervised Language Modeling at scale for robust sentiment classification

** DEPRECATED ** This repo has been deprecated. Please visit Megatron-LM for our up to date Large-scale unsupervised pretraining and finetuning code.

1k Nov 17, 2022

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

377 Jan 2, 2023

Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

47 Dec 20, 2022

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

316 Jan 3, 2023

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

3.5k Dec 30, 2022

Concept Modeling: Topic Modeling on Images and Text

Concept is a technique that leverages CLIP and BERTopic-based techniques to perform Concept Modeling on images.

120 Dec 27, 2022

A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

1.3k Jan 3, 2023

A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

831 Feb 17, 2021

中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

English | 中文说明 CBLUE AI (Artificial Intelligence) is playing an indispensabe role in the biomedical field, helping improve medical technology. For fur

452 Dec 30, 2022

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization

99 Jan 6, 2023

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020：基于标题的大规模商品实体检索，任务为对于给定的一个商品标题，参赛系统需要匹配到该标题在给定商品库中的对应商品实体。输入：输入文件包括若干行商品标题。输出：输出文本每一行包括此标题对应的商品实体，即给定知识库中商品 ID，

43 Nov 11, 2022

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

309 Oct 19, 2022

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

40 Nov 30, 2022

Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

161 Dec 28, 2022

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

Tools for curating biomedical training data for large-scale language modeling

Related tags

Overview

Biomedical Language Modeling

Setup

Datasets

Experiments

Biomedical Prompting

Comments

Checkbox

Checkbox

Adding a Dataset

Checkbox

Checkbox

Checkbox

Checkbox

Checkbox

Adding a Dataset

Adding a Dataset

Adding a Dataset

Owner

BigScience Workshop

Unsupervised Language Modeling at scale for robust sentiment classification

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Code for text augmentation method leveraging large-scale language models

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Concept Modeling: Topic Modeling on Images and Text

A full spaCy pipeline and models for scientific/biomedical documents.

A full spaCy pipeline and models for scientific/biomedical documents.

中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

Large-scale Knowledge Graph Construction with Prompting

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

A natural language modeling framework based on PyTorch

A natural language modeling framework based on PyTorch

A natural language modeling framework based on PyTorch