Tools for curating biomedical training data for large-scale language modeling

Overview

Biomedical Language Modeling

Tools for curating biomedical training data for large-scale language modeling.

Setup

Using conda:

conda env create -f conda.yml

Activate the environment as:

conda activate bigscience-biomedical

Datasets

Spreadsheet of biomedical training sets (currently ~76 datasets).

Experiments

Biomedical Prompting

Comments
  • Add Chemdner dataset loader

    Add Chemdner dataset loader

    Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

    If the following information is NOT present in the issue, please populate:

    • Name: name of the dataset
    • Description: short description of the dataset (or link to social media or blog post)
    • Paper: link to the dataset paper if available
    • Data: link to the online home of the dataset

    Checkbox

    • [x] Confirm that this PR is linked to the dataset issue.
    • [x] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
    • [x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
    • [x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
    • [x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
    • [x] Confirm dataloader script works with datasets.load_dataset function.
    • [x] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
    opened by qanastek 24
  • Closes #222

    Closes #222

    • Name: PubTator Central
    • Description: PubTator Central (PTC, https://www.ncbi.nlm.nih.gov/research/pubtator/) is a web service for exploring and retrieving bioconcept annotations in full text biomedical articles. PTC provides automated annotations from state-of-the-art text mining systems for genes/proteins, genetic variants, diseases, chemicals, species and cell lines, all available for immediate download.
    • Paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3692066/
    • Data: https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/

    Checkbox

    • [X] Confirm that this PR is linked to the dataset issue.
    • [X] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
    • [x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
    • [X] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
    • [X] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
    • [X] Confirm dataloader script works with datasets.load_dataset function.
    • [x] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.

    Closes #222.

    opened by JohnGiorgi 19
  • Closes #156

    Closes #156

    Closes #156

    Local dataset, output in file: test_log.txt

    Dataset is of task Word Sense Disambiguation fitted into Tasks.NAMED_ENTITY_DISAMBIGUATION. There was a discussion starting from here.

    There are a lot of mismatches between entity and offset, this is intended as there can be slight variations of ambiguous word/word in text (in case, singular/plural, ...).

    local dataset 
    opened by nomisto 16
  • Create dataset loader for MedHop

    Create dataset loader for MedHop

    Adding a Dataset

    • Name: MedHop
    • Description: None provided
    • Task: QA
    • Paper: https://transacl.org/ojs/index.php/tacl/article/viewFile/1325/299
    • Data: http://qangaroo.cs.ucl.ac.uk
    • License: CC BY-SA 3.0
    English QA JSON 
    opened by jason-fries 16
  • Closes #64

    Closes #64

    Checkbox

    • [x] Confirm that this PR is linked to the dataset issue.
    • [x] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
    • [x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
    • [x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
    • [x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
    • [x] Confirm dataloader script works with datasets.load_dataset function.
    • [x] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
    • [x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
    opened by mapama247 13
  • Normalize licenses

    Normalize licenses

    The idea is to have classes for different type of licenses whit each license having a subset of available parameters: name, type, version, link, description.

    This entails replacing:

    _LICENSE = "CC BY-NC-SA"
    _LICENSE= "CC BY-NC-SA 3.0"
    

    With

    from bigbio.utils import license
    _LICENSE = license.CreativeCommons(type="BY-NC-SA")
    _LICENSE = license.CreativeCommons(type="BY-NC-SA", version=3.0)
    

    and

    return datasets.DatasetInfo(
                description=_DESCRIPTION,
                features=features,
                homepage=_HOMEPAGE,
                license= str(_LICENSE),
                citation=_CITATION,
            )
    

    Special ones are Custom for dataset-specific license and PubliclyAvailable for those datasets which can be freely downloaded but do not provide license information.

    opened by sg-wbi 12
  • Closes #220

    Closes #220

    Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

    If the following information is NOT present in the issue, please populate:

    • Name: name of the dataset
    • Description: short description of the dataset (or link to social media or blog post)
    • Paper: link to the dataset paper if available
    • Data: link to the online home of the dataset

    Checkbox

    • [x] Confirm that this PR is linked to the dataset issue.
    • [x] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
    • [x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
    • [x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
    • [x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
    • [x] Confirm dataloader script works with datasets.load_dataset function.
    • [x] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
    • [x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
    (bigscience-biomedical) root@docker-desktop:/workspaces/biomedical# python -m tests.test_bigbio biodatasets/n2c2_2014/n2c2_2014.py --data_dir /workspaces/biomedical/biodatasets/n2c2_2014/tar_gz
    INFO:__main__:args: Namespace(path='biodatasets/n2c2_2014/n2c2_2014.py', schema=None, subset_id=None, data_dir='/workspaces/biomedical/biodatasets/n2c2_2014/tar_gz', use_auth_token=None)
    INFO:__main__:self.PATH: biodatasets/n2c2_2014/n2c2_2014.py
    INFO:__main__:self.SUBSET_ID: n2c2_2014
    INFO:__main__:self.SCHEMA: None
    INFO:__main__:self.DATA_DIR: /workspaces/biomedical/biodatasets/n2c2_2014/tar_gz
    INFO:__main__:Checking for _SUPPORTED_TASKS ...
    INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.NAMED_ENTITY_RECOGNITION: 'NER'>, <Tasks.TEXT_CLASSIFICATION: 'TXTCLASS'>]
    INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'KB', 'TEXT'}
    INFO:__main__:schemas_to_check: {'KB', 'TEXT'}
    INFO:__main__:Checking load_dataset with config name n2c2_2014_source
    WARNING:datasets.builder:Using custom data configuration n2c2_2014_source-57e9df040ed9f011
    Downloading and preparing dataset n2c2_2014/n2c2_2014_source to /home/jovyan/.cache/huggingface/datasets/n2c2_2014/n2c2_2014_source-57e9df040ed9f011/1.0.0/1e3e609086987a589ee6db853ac5c55fe73648c31096f144bb0dbbd670cb2f72...
    Dataset n2c2_2014 downloaded and prepared to /home/jovyan/.cache/huggingface/datasets/n2c2_2014/n2c2_2014_source-57e9df040ed9f011/1.0.0/1e3e609086987a589ee6db853ac5c55fe73648c31096f144bb0dbbd670cb2f72. Subsequent calls will reuse this data.
    100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 282.37it/s]
    INFO:__main__:Checking load_dataset with config name n2c2_2014_bigbio_kb
    WARNING:datasets.builder:Using custom data configuration n2c2_2014_bigbio_kb-57e9df040ed9f011
    Downloading and preparing dataset n2c2_2014/n2c2_2014_bigbio_kb to /home/jovyan/.cache/huggingface/datasets/n2c2_2014/n2c2_2014_bigbio_kb-57e9df040ed9f011/1.0.0/1e3e609086987a589ee6db853ac5c55fe73648c31096f144bb0dbbd670cb2f72...
    Dataset n2c2_2014 downloaded and prepared to /home/jovyan/.cache/huggingface/datasets/n2c2_2014/n2c2_2014_bigbio_kb-57e9df040ed9f011/1.0.0/1e3e609086987a589ee6db853ac5c55fe73648c31096f144bb0dbbd670cb2f72. Subsequent calls will reuse this data.
    100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 384.62it/s]
    INFO:__main__:Checking load_dataset with config name n2c2_2014_bigbio_text
    WARNING:datasets.builder:Using custom data configuration n2c2_2014_bigbio_text-57e9df040ed9f011
    Downloading and preparing dataset n2c2_2014/n2c2_2014_bigbio_text to /home/jovyan/.cache/huggingface/datasets/n2c2_2014/n2c2_2014_bigbio_text-57e9df040ed9f011/1.0.0/1e3e609086987a589ee6db853ac5c55fe73648c31096f144bb0dbbd670cb2f72...
    Dataset n2c2_2014 downloaded and prepared to /home/jovyan/.cache/huggingface/datasets/n2c2_2014/n2c2_2014_bigbio_text-57e9df040ed9f011/1.0.0/1e3e609086987a589ee6db853ac5c55fe73648c31096f144bb0dbbd670cb2f72. Subsequent calls will reuse this data.
    100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 514.01it/s]
    INFO:__main__:Checking global ID uniqueness
    INFO:__main__:Found 12490 unique IDs
    INFO:__main__:Gathering schema statistics
    INFO:__main__:Gathering schema statistics
    train
    ==========
    id: 790
    document_id: 790
    passages: 790
    entities: 17405
    normalized: 0
    events: 0
    coreferences: 0
    relations: 0
    
    test
    ==========
    id: 514
    document_id: 514
    passages: 514
    entities: 11462
    normalized: 0
    events: 0
    coreferences: 0
    relations: 0
    
    INFO:__main__:Checking if referenced IDs are properly mapped
    INFO:__main__:KB ONLY: Checking passage offsets
    INFO:__main__:KB ONLY: Checking entity offsets
    
    <SPECIFIC ERRORS ARE HIDDEN> 
    
    There are features with wrong offsets! This is not a hard failure, as it is common for this type of datasets. However, if the error list is long (e.g. >10) you should double check your code. 
    
    
    INFO:__main__:KB ONLY: Checking event offsets
    INFO:__main__:KB ONLY: Checking coref offsets
    INFO:__main__:Checking global ID uniqueness
    INFO:__main__:Found 514 unique IDs
    INFO:__main__:Gathering schema statistics
    INFO:__main__:Gathering schema statistics
    train
    ==========
    id: 790
    document_id: 790
    text: 790
    labels: 16501
    
    test
    ==========
    id: 514
    document_id: 514
    text: 514
    labels: 10970
    
    .
    ----------------------------------------------------------------------
    Ran 1 test in 11.258s
    
    OK
    
    local dataset 
    opened by jdposada 12
  • Closes #246

    Closes #246

    Closes #246

    • Description: PsyTAR dataset contains 891 drugs reviews posted by patients on "askapatient.com", about the effectiveness and adverse drug events associated with Zoloft, Lexapro, Cymbalta, and Effexor XR.

    Checkbox

    • [x] Confirm that this PR is linked to the dataset issue.
    • [x] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
    • [x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
    • [x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
    • [x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
    • [x] Confirm dataloader script works with datasets.load_dataset function.
    • [x] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
    local dataset 
    opened by danilexn 12
  • Create dataset loader for Bio-SimLex

    Create dataset loader for Bio-SimLex

    • Name: Bio-SimLex
    • Description: Noun pairs with similarity scores
    • Task: Semantic Similarity
    • Paper: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2039-z
    • Data: https://github.com/cambridgeltl/bio-simverb/tree/master/wvlib/word-similarities/bio-simlex
    • License: ?
    • Motivation: Evaluation
    English Semantic Similarity 
    opened by galtay 12
  • Closes #217

    Closes #217

    Closes #217

    Task of dataset is Relation Extraction, however it was part of a challenge where the correct answers to the test set were never released. Since data["test"]["relations"] is thus empty for the testset the test fails with:

    ERROR: runTest (__main__.TestDataLoader)
    Run all tests that check:
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "C:\Users\ottsi\biomedical\tests\test_bigbio.py", line 116, in runTest
        self.test_schema(schema)
      File "C:\Users\ottsi\biomedical\tests\test_bigbio.py", line 518, in test_schema
        self.assertTrue(self._check_subkey(example[key][0], attrs))
    IndexError: list index out of range
    
    ----------------------------------------------------------------------
    Ran 1 test in 1.195s
    
    FAILED (errors=1)
    

    Edit: On second thought I think this might be a bug in the test-script, since it is assumed that a required subkey has to have elements in it. F.e. also in NER there could be documents without any entities at all in them.

    opened by nomisto 11
  • Add Mantra GSC Dataset.

    Add Mantra GSC Dataset.

    Closes #137

    Checkbox

    • [x] Confirm that this PR is linked to the dataset issue.
    • [x] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
    • [x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
    • [x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
    • [x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
    • [x] Confirm dataloader script works with datasets.load_dataset function.
    • [ ] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
    • [ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
    opened by karthikrangasai 9
  • Closes #854

    Closes #854

    Add the Paragraph-Level Simplification of Medical Texts dataset. Closes #854

    Checkbox

    • [X] Confirm that this PR is linked to the dataset issue.
    • [X] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
    • [X] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
    • [X] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
    • [X] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
    • [X] Confirm dataloader script works with datasets.load_dataset function.
    • [X] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
    • [X] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
    opened by Miking98 2
  • Add implementation for the Paragraph-level Simplification of Medical Texts dataset

    Add implementation for the Paragraph-level Simplification of Medical Texts dataset

    Adding a Dataset

    • Name: Paragraph-level Simplification of Medical Texts
    • Description: A paired dataset of technical medical abstracts and their plain-language summarizations.
    • Task: SUM
    • Paper: https://arxiv.org/abs/2104.05767
    • Data: https://github.com/AshOlogn/Paragraph-level-Simplification-of-Medical-Texts
    • License: CC_BY_4p0
    • Motivation: High-quality summarization dataset for translating biomedical technical language -> layman's terms
    opened by Miking98 0
  • Revise implementation of BioRED

    Revise implementation of BioRED

    This PR improves the implementation of the BioRed corpus:

    • In the previous implementation a unique entity was created per entity mention and database identifier. This was fixed to a single entity mention having multiple database ids.
    • Furthermore, the name of the database a entity is linked to was added
    • BioRed only provides abstract-level annotations for entity-linked relation pairs rather than materializing links between all surface form mentions of relation. Analogous to BC5CDR we enumerate all mention pairs concerning the entities in the triple.
    opened by mariosaenger 2
  • Fix unit test to run local PRs + fix tutorial

    Fix unit test to run local PRs + fix tutorial

    Enables unit testing of local scripts with --test_local flag; borrows the test_bigbio_hub.py script. I tested this by making a copy of scitail as test_scitail in the biodatasets folder, and

    To replicate:

    • copy the scitail folder in bigbio/biodatasets as cp -r bigbio/biodatasets/scitail bigbio/biodatasets/test_scitail.
    • change the name of scitail.py in this folder as test_scitail.py
    • add bigbiohub.py into this new test_scitail folder
    • in the main directory, run python -m tests.test_bigbio_hub test_scitail --test_local

    Note- the contributions guide makes a reference to the templates folder that has 2 scripts; bigbiohub and a template file that can be used to fill-in-the-blanks. To avoid a deprecated script, maybe we should automate that for every new update, bigbiohub is either copied into the template folder OR I can just change the tutorial to reflect it's actual "default" location of bigbio/hub/bigbiohub.py

    TODO: on 2023/01/03 I'm going to add one more small change that also tests whether the METADATA is in the acceptable set of values to ensure standardization!

    @galtay

    opened by hakunanatasha 1
  • Add implementation for the CPI dataset

    Add implementation for the CPI dataset

    Adding a Dataset

    • Name: CPI
    • Description: The compound-protein relationship (CPI) dataset consists of 2,613 sentences from abstracts containing
      annotations of proteins, small molecules, and their relationships
    • Task: NER,RE,NEN
    • Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0220925
    • Data: https://github.com/KerstenDoering/CPI-Pipeline
    • License: ISC
    • Motivation: High quality NER and RE annotations
    opened by mariosaenger 0
  • Add implementation for DrugProt data set

    Add implementation for DrugProt data set

    Adding a Dataset

    • Name: DrugProt
    • Description: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-1/
    • Task: NER, RE
    • Paper: https://biocreative.bioinformatics.udel.edu/media/store/files/2021/Track1_pos_1_BC7_overview.pdf
    • Data: https://zenodo.org/record/5119892#.Y6A4RtLMKV4
    • License: Creative Commons Attribution 4.0 International
    • Motivation: High-quality annotations for NER and RE
    opened by mariosaenger 0
Owner
BigScience Workshop
Research workshop on large language models - The Summer of Language Models 21
BigScience Workshop
Unsupervised Language Modeling at scale for robust sentiment classification

** DEPRECATED ** This repo has been deprecated. Please visit Megatron-LM for our up to date Large-scale unsupervised pretraining and finetuning code.

NVIDIA Corporation 1k Nov 17, 2022
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 2, 2023
Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

NAVER AI 47 Dec 20, 2022
null 189 Jan 2, 2023
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

BigScience Workshop 316 Jan 3, 2023
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
Concept Modeling: Topic Modeling on Images and Text

Concept is a technique that leverages CLIP and BERTopic-based techniques to perform Concept Modeling on images.

Maarten Grootendorst 120 Dec 27, 2022
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 1.3k Jan 3, 2023
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 831 Feb 17, 2021
中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

English | 中文说明 CBLUE AI (Artificial Intelligence) is playing an indispensabe role in the biomedical field, helping improve medical technology. For fur

null 452 Dec 30, 2022
BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization

DMIS Laboratory - Korea University 99 Jan 6, 2023
CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020:基于标题的大规模商品实体检索,任务为对于给定的一个商品标题,参赛系统需要匹配到该标题在给定商品库中的对应商品实体。 输入:输入文件包括若干行商品标题。 输出:输出文本每一行包括此标题对应的商品实体,即给定知识库中商品 ID,

null 43 Nov 11, 2022
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Maluuba Inc. 309 Oct 19, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.4k Dec 27, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.1k Feb 12, 2021
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.1k Feb 18, 2021