Reading Wikipedia to Answer Open-Domain Questions

Related tags

Text Data & NLP DrQA
Overview

DrQA

This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions.

Quick Links

Machine Reading at Scale

DrQA is a system for reading comprehension applied to open-domain question answering. In particular, DrQA is targeted at the task of "machine reading at scale" (MRS). In this setting, we are searching for an answer to a question in a potentially very large corpus of unstructured documents (that may not be redundant). Thus the system has to combine the challenges of document retrieval (finding the relevant documents) with that of machine comprehension of text (identifying the answers from those documents).

Our experiments with DrQA focus on answering factoid questions while using Wikipedia as the unique knowledge source for documents. Wikipedia is a well-suited source of large-scale, rich, detailed information. In order to answer any question, one must first retrieve the few potentially relevant articles among more than 5 million, and then scan them carefully to identify the answer.

Note that DrQA treats Wikipedia as a generic collection of articles and does not rely on its internal graph structure. As a result, DrQA can be straightforwardly applied to any collection of documents, as described in the retriever README.

This repository includes code, data, and pre-trained models for processing and querying Wikipedia as described in the paper -- see Trained Models and Data. We also list several different datasets for evaluation, see QA Datasets. Note that this work is a refactored and more efficient version of the original code. Reproduction numbers are very similar but not exact.

Quick Start: Demo

Install DrQA and download our models to start asking open-domain questions!

Run python scripts/pipeline/interactive.py to drop into an interactive session. For each question, the top span and the Wikipedia paragraph it came from are returned.

>>> process('What is question answering?')

Top Predictions:
+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+
| Rank |                                                  Answer                                                  |        Doc         | Answer Score | Doc Score |
+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+
|  1   | a computer science discipline within the fields of information retrieval and natural language processing | Question answering |    1917.8    |   327.89  |
+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+

Contexts:
[ Doc = Question answering ]
Question Answering (QA) is a computer science discipline within the fields of
information retrieval and natural language processing (NLP), which is
concerned with building systems that automatically answer questions posed by
humans in a natural language.
>>> process('What is the answer to life, the universe, and everything?')

Top Predictions:
+------+--------+---------------------------------------------------+--------------+-----------+
| Rank | Answer |                        Doc                        | Answer Score | Doc Score |
+------+--------+---------------------------------------------------+--------------+-----------+
|  1   |   42   | Phrases from The Hitchhiker's Guide to the Galaxy |    47242     |   141.26  |
+------+--------+---------------------------------------------------+--------------+-----------+

Contexts:
[ Doc = Phrases from The Hitchhiker's Guide to the Galaxy ]
The number 42 and the phrase, "Life, the universe, and everything" have
attained cult status on the Internet. "Life, the universe, and everything" is
a common name for the off-topic section of an Internet forum and the phrase is
invoked in similar ways to mean "anything at all". Many chatbots, when asked
about the meaning of life, will answer "42". Several online calculators are
also programmed with the Question. Google Calculator will give the result to
"the answer to life the universe and everything" as 42, as will Wolfram's
Computational Knowledge Engine. Similarly, DuckDuckGo also gives the result of
"the answer to the ultimate question of life, the universe and everything" as
42. In the online community Second Life, there is a section on a sim called
43. "42nd Life." It is devoted to this concept in the book series, and several
attempts at recreating Milliways, the Restaurant at the End of the Universe, were made.
>>> process('Who was the winning pitcher in the 1956 World Series?')

Top Predictions:
+------+------------+------------------+--------------+-----------+
| Rank |   Answer   |       Doc        | Answer Score | Doc Score |
+------+------------+------------------+--------------+-----------+
|  1   | Don Larsen | New York Yankees |  4.5059e+06  |   278.06  |
+------+------------+------------------+--------------+-----------+

Contexts:
[ Doc = New York Yankees ]
In 1954, the Yankees won over 100 games, but the Indians took the pennant with
an AL record 111 wins; 1954 was famously referred to as "The Year the Yankees
Lost the Pennant". In , the Dodgers finally beat the Yankees in the World
Series, after five previous Series losses to them, but the Yankees came back
strong the next year. On October 8, 1956, in Game Five of the 1956 World
Series against the Dodgers, pitcher Don Larsen threw the only perfect game in
World Series history, which remains the only perfect game in postseason play
and was the only no-hitter of any kind to be pitched in postseason play until
Roy Halladay pitched a no-hitter on October 6, 2010.

Try some of your own! Of course, DrQA might provide alternative facts, so enjoy the ride.

Installing DrQA

Setting up DrQA is easy!

DrQA requires Linux/OSX and Python 3.5 or higher. It also requires installing PyTorch version 1.0. Its other dependencies are listed in requirements.txt. CUDA is strongly recommended for speed, but not necessary.

Run the following commands to clone the repository and install DrQA:

git clone https://github.com/facebookresearch/DrQA.git
cd DrQA; pip install -r requirements.txt; python setup.py develop

Note: requirements.txt includes a subset of all the possible required packages. Depending on what you want to run, you might need to install an extra package (e.g. spacy).

If you use the CoreNLPTokenizer or SpacyTokenizer you also need to download the Stanford CoreNLP jars and spaCy en model, respectively. If you use Stanford CoreNLP, have the jars in your java CLASSPATH environment variable, or set the path programmatically with:

import drqa.tokenizers
drqa.tokenizers.set_default('corenlp_classpath', '/your/corenlp/classpath/*')

IMPORTANT: The default tokenizer is CoreNLP so you will need that in your CLASSPATH to run the README examples.

Ex: export CLASSPATH=$CLASSPATH:/path/to/corenlp/download/*.

If you do not already have a CoreNLP download you can run:

./install_corenlp.sh

Verify that it runs:

from drqa.tokenizers import CoreNLPTokenizer
tok = CoreNLPTokenizer()
tok.tokenize('hello world').words()  # Should complete immediately

For convenience, the Document Reader, Retriever, and Pipeline modules will try to load default models if no model argument is given. See below for downloading these models.

Trained Models and Data

To download all provided trained models and data for Wikipedia question answering, run:

./download.sh

Warning: this downloads a 7.5GB tarball (25GB untarred) and will take some time.

This stores the data in data/ at the file paths specified in the various modules' defaults. This top-level directory can be modified by setting a DRQA_DATA environment variable to point to somewhere else.

Default directory structure (see embeddings for more info on additional downloads for training):

DrQA
├── data (or $DRQA_DATA)
    ├── datasets
    │   ├── SQuAD-v1.1-<train/dev>.<txt/json>
    │   ├── WebQuestions-<train/test>.txt
    │   ├── freebase-entities.txt
    │   ├── CuratedTrec-<train/test>.txt
    │   └── WikiMovies-<train/test/entities>.txt
    ├── reader
    │   ├── multitask.mdl
    │   └── single.mdl
    └── wikipedia
        ├── docs.db
        └── docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz

Default model paths for the different modules can also be modified programmatically in the code, e.g.:

import drqa.reader
drqa.reader.set_default('model', '/path/to/model')
reader = drqa.reader.Predictor()  # Default model loaded for prediction

Document Retriever

TF-IDF model using Wikipedia (unigrams and bigrams, 2^24 bins, simple tokenization), evaluated on multiple datasets (test sets, dev set for SQuAD):

Model SQuAD P@5 CuratedTREC P@5 WebQuestions P@5 WikiMovies P@5 Size
TF-IDF model 78.0 87.6 75.0 69.8 ~13GB

P@5 here is defined as the % of questions for which the answer segment appears in one of the top 5 documents.

Document Reader

Model trained only on SQuAD, evaluated in the SQuAD setting:

Model SQuAD Dev EM SQuAD Dev F1 Size
Single model 69.4 78.9 ~130MB

Model trained with distant supervision without NER/POS/lemma features, evaluated on multiple datasets (test sets, dev set for SQuAD) in the full Wikipedia setting:

Model SQuAD EM CuratedTREC EM WebQuestions EM WikiMovies EM Size
Multitask model 29.5 27.2 18.5 36.9 ~270MB

Wikipedia

Our full-scale experiments were conducted on the 2016-12-21 dump of English Wikipedia. The dump was processed with the WikiExtractor and filtered for internal disambiguation, list, index, and outline pages (pages that are typically just links). We store the documents in an sqlite database for which drqa.retriever.DocDB provides an interface.

Database Num. Documents Size
Wikipedia 5,075,182 ~13GB

QA Datasets

The datasets used for DrQA training and evaluation can be found here:

Format A

The retriever/eval.py, pipeline/eval.py, and distant/generate.py scripts expect the datasets as a .txt file where each line is a JSON encoded QA pair, like so:

'{"question": "q1", "answer": ["a11", ..., "a1i"]}'
...
'{"question": "qN", "answer": ["aN1", ..., "aNi"]}'

Scripts to convert SQuAD and WebQuestions to this format are included in scripts/convert. This is automatically done in download.sh.

Format B

The reader directory scripts expect the datasets as a .json file where the data is arranged like SQuAD:

file.json
├── "data"
│   └── [i]
│       ├── "paragraphs"
│       │   └── [j]
│       │       ├── "context": "paragraph text"
│       │       └── "qas"
│       │           └── [k]
│       │               ├── "answers"
│       │               │   └── [l]
│       │               │       ├── "answer_start": N
│       │               │       └── "text": "answer"
│       │               ├── "id": "<uuid>"
│       │               └── "question": "paragraph question?"
│       └── "title": "document id"
└── "version": 1.1
Entity lists

Some datasets have (potentially large) candidate lists for selecting answers. For example, WikiMovies' answers are OMDb entries while WebQuestions is based on Freebase. If we have known candidates, we can impose that all predicted answers must be in this list by discarding any higher scoring spans that are not.

DrQA Components

Document Retriever

DrQA is not tied to any specific type of retrieval system -- as long as it effectively narrows the search space and focuses on relevant documents.

Following classical QA systems, we include an efficient (non-machine learning) document retrieval system based on sparse, TF-IDF weighted bag-of-word vectors. We use bags of hashed n-grams (here, unigrams and bigrams).

To see how to build your own such model on new documents, see the retriever README.

To interactively query Wikipedia:

python scripts/retriever/interactive.py --model /path/to/model

If model is left out our default model will be used (assuming it was downloaded).

To evaluate the retriever accuracy (% match in top 5) on a dataset:

python scripts/retriever/eval.py /path/to/format/A/dataset.txt --model /path/to/model

Document Reader

DrQA's Document Reader is a multi-layer recurrent neural network machine comprehension model trained to do extractive question answering. That is, the model tries to find the answer to any question as a text span in one of the returned documents.

The Document Reader was inspired by, and primarily trained on, the SQuAD dataset. It can also be used standalone on such SQuAD-like tasks where a specific context is supplied with the question, the answer to which is contained in the context.

To see how to train the Document Reader on SQuAD, see the reader README.

To interactively ask questions about text with a trained model:

python scripts/reader/interactive.py --model /path/to/model

Again, here model is optional; a default model will be used if it is left out.

To run model predictions on a dataset:

python scripts/reader/predict.py /path/to/format/B/dataset.json --model /path/to/model

DrQA Pipeline

The full system is linked together in drqa.pipeline.DrQA.

To interactively ask questions using the full DrQA:

python scripts/pipeline/interactive.py

Optional arguments:

--reader-model    Path to trained Document Reader model.
--retriever-model Path to Document Retriever model (tfidf).
--doc-db          Path to Document DB.
--tokenizer      String option specifying tokenizer type to use (e.g. 'corenlp').
--candidate-file  List of candidates to restrict predictions to, one candidate per line.
--no-cuda         Use CPU only.
--gpu             Specify GPU device id to use.

To run predictions on a dataset:

python scripts/pipeline/predict.py /path/to/format/A/dataset.txt

Optional arguments:

--out-dir             Directory to write prediction file to (<dataset>-<model>-pipeline.preds).
--reader-model        Path to trained Document Reader model.
--retriever-model     Path to Document Retriever model (tfidf).
--doc-db              Path to Document DB.
--embedding-file      Expand dictionary to use all pretrained embeddings in this file (e.g. all glove vectors to minimize UNKs at test time).
--candidate-file      List of candidates to restrict predictions to, one candidate per line.
--n-docs              Number of docs to retrieve per query.
--top-n               Number of predictions to make per query.
--tokenizer           String option specifying tokenizer type to use (e.g. 'corenlp').
--no-cuda             Use CPU only.
--gpu                 Specify GPU device id to use.
--parallel            Use data parallel (split across GPU devices).
--num-workers         Number of CPU processes (for tokenizing, etc).
--batch-size          Document paragraph batching size (Reduce in case of GPU OOM).
--predict-batch-size  Question batching size (Reduce in case of CPU OOM).

Distant Supervision (DS)

DrQA's performance improves significantly in the full-setting when provided with distantly supervised data from additional datasets. Given question-answer pairs but no supporting context, we can use string matching heuristics to automatically associate paragraphs to these training examples.

Question: What U.S. state’s motto is “Live free or Die”?

Answer: New Hampshire

DS Document: Live Free or Die “Live Free or Die” is the official motto of the U.S. state of New Hampshire, adopted by the state in 1945. It is possibly the best-known of all state mottos, partly because it conveys an assertive independence historically found in American political philosophy and partly because of its contrast to the milder sentiments found in other state mottos.

The scripts/distant directory contains code to generate and inspect such distantly supervised data. More information can be found in the distant supervision README.

Tokenizers

We provide a number of different tokenizer options for convenience. Each has its own pros/cons based on how many dependencies it requires, overhead for running it, speed, and performance. For our reported experiments we used CoreNLP (but results are all similar).

Available tokenizers:

  • CoreNLPTokenizer: Uses Stanford CoreNLP (option: 'corenlp'). We used v3.7.0. Requires Java 8.
  • SpacyTokenizer: Uses spaCy (option: 'spacy').
  • RegexpTokenizer: Custom regex-based PTB-style tokenizer (option: 'regexp').
  • SimpleTokenizer: Basic alpha-numeric/non-whitespace tokenizer (option: 'simple').

See the list of mappings between string option names and tokenizer classes.

Citation

Please cite the ACL paper if you use DrQA in your work:

@inproceedings{chen2017reading,
  title={Reading {Wikipedia} to Answer Open-Domain Questions},
  author={Chen, Danqi and Fisch, Adam and Weston, Jason and Bordes, Antoine},
  booktitle={Association for Computational Linguistics (ACL)},
  year={2017}
}

DrQA Elsewhere

Connection with ParlAI

This implementation of the DrQA Document Reader is closely related to the one found in ParlAI. Here, however, the work is extended to interact with the Document Retriever in the open-domain setting. On the other hand, the implementation in ParlAI is more general, and follows the appropriate API to work in more QA/Dialog settings.

Web UI

Hamed Zaghaghi has provided a wrapper for a Web UI.

License

DrQA is BSD-licensed.

Comments
  • Numpy memory error

    Numpy memory error

    When I am running python scripts/retriever/interactive.py command then it shows me below error. root@ubuntu-2gb-nyc3-01:~/DrQA# python scripts/retriever/interactive.py 08/21/2017 08:13:28 AM: [ Initializing ranker... ] 08/21/2017 08:13:28 AM: [ Loading /root/DrQA/data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ] Traceback (most recent call last): File "scripts/retriever/interactive.py", line 27, in ranker = retriever.get_class('tfidf')(tfidf_path=args.model) File "/root/DrQA/drqa/retriever/tfidf_doc_ranker.py", line 37, in init matrix, metadata = utils.load_sparse_csr(tfidf_path) File "/root/DrQA/drqa/retriever/utils.py", line 34, in load_sparse_csr matrix = sp.csr_matrix((loader['data'], loader['indices'], File "/root/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 233, in getitem pickle_kwargs=self.pickle_kwargs) File "/root/anaconda3/lib/python3.6/site-packages/numpy/lib/format.py", line 664, in read_array array = numpy.ndarray(count, dtype=dtype) MemoryError

    I am using it without GPU and below is my system information. Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 4 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz Stepping: 1 CPU MHz: 2199.998 BogoMIPS: 4399.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0-3 Can some one help me to resolve that problem..??

    Thank You

    opened by Deepakchawla 19
  • `TIMEOUT: Timeout exceeded` error trying `tok = CoreNLPTokenizer()`

    `TIMEOUT: Timeout exceeded` error trying `tok = CoreNLPTokenizer()`

    WhenI try

    >>> from drqa.tokenizers import CoreNLPTokenizer
    >>> tok = CoreNLPTokenizer()
    Traceback (most recent call last):
      File "/usr/local/lib/python3.5/dist-packages/pexpect/expect.py", line 99, in expect_loop
        incoming = spawn.read_nonblocking(spawn.maxread, timeout)
      File "/usr/local/lib/python3.5/dist-packages/pexpect/pty_spawn.py", line 462, in read_nonblocking
        raise TIMEOUT('Timeout exceeded.')
    pexpect.exceptions.TIMEOUT: Timeout exceeded.
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/ritwik/rd/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 33, in __init__
        self._launch()
      File "/home/ritwik/rd/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 61, in _launch
        self.corenlp.expect_exact('NLP>', searchwindowsize=100)
      File "/usr/local/lib/python3.5/dist-packages/pexpect/spawnbase.py", line 390, in expect_exact
        return exp.expect_loop(timeout)
      File "/usr/local/lib/python3.5/dist-packages/pexpect/expect.py", line 107, in expect_loop
        return self.timeout(e)
      File "/usr/local/lib/python3.5/dist-packages/pexpect/expect.py", line 70, in timeout
        raise TIMEOUT(msg)
    pexpect.exceptions.TIMEOUT: Timeout exceeded.
    <pexpect.pty_spawn.spawn object at 0x7ff89a70f128>
    command: /bin/bash
    args: ['/bin/bash']
    buffer (last 100 chars): b'@stagwiki: ~/rd/DrQA/data/corenlp\x07\x1b[01;32mritwik@stagwiki\x1b[00m:\x1b[01;34m~/rd/DrQA/data/corenlp\x1b[00m$ '
    before (last 100 chars): b'@stagwiki: ~/rd/DrQA/data/corenlp\x07\x1b[01;32mritwik@stagwiki\x1b[00m:\x1b[01;34m~/rd/DrQA/data/corenlp\x1b[00m$ '
    after: <class 'pexpect.exceptions.TIMEOUT'>
    match: None
    match_index: None
    exitstatus: None
    flag_eof: False
    pid: 17048
    child_fd: 5
    closed: False
    timeout: 60
    delimiter: <class 'pexpect.exceptions.EOF'>
    logfile: None
    logfile_read: None
    logfile_send: None
    maxread: 100000
    ignorecase: False
    searchwindowsize: None
    delaybeforesend: 0
    delayafterclose: 0.1
    delayafterterminate: 0.1
    searcher: searcher_string:
        0: "b'NLP>'"
    

    CLASSPATH is set properly

    corenlp$ echo $CLASSPATH
    /home/ritwik/rd/DrQA/data/corenlp/ejml-0.23.jar /home/ritwik/rd/DrQA/data/corenlp/javax.json-api-1.0-sources.jar /home/ritwik/rd/DrQA/data/corenlp/javax.json.jar /home/ritwik/rd/DrQA/data/corenlp/joda-time-2.9-sources.jar /home/ritwik/rd/DrQA/data/corenlp/joda-time.jar /home/ritwik/rd/DrQA/data/corenlp/jollyday-0.4.9-sources.jar /home/ritwik/rd/DrQA/data/corenlp/jollyday.jar /home/ritwik/rd/DrQA/data/corenlp/protobuf.jar /home/ritwik/rd/DrQA/data/corenlp/slf4j-api.jar /home/ritwik/rd/DrQA/data/corenlp/slf4j-simple.jar /home/ritwik/rd/DrQA/data/corenlp/stanford-corenlp-3.8.0.jar /home/ritwik/rd/DrQA/data/corenlp/stanford-corenlp-3.8.0-javadoc.jar /home/ritwik/rd/DrQA/data/corenlp/stanford-corenlp-3.8.0-models.jar /home/ritwik/rd/DrQA/data/corenlp/stanford-corenlp-3.8.0-sources.jar /home/ritwik/rd/DrQA/data/corenlp/xom-1.2.10-src.jar /home/ritwik/rd/DrQA/data/corenlp/xom.jar
    
    opened by RitwikGopi 15
  • hangs up at

    hangs up at "Reading paragraphs"

    I'm trying to run a default example, I get no errors but also no output - am I missing something? predict also seems to just hangup

    process('How many counties are in the United States?') 05/16/2019 11:25:06 AM: [ Processing 1 queries... ] 05/16/2019 11:25:06 AM: [ Retrieving top 5 docs... ] 05/16/2019 11:25:10 AM: [ Reading 406 paragraphs... ]

    opened by red8top 12
  • Custom corpus - document retrieval

    Custom corpus - document retrieval

    Hi, I am facing following issue which I cannot find a way around. I want to build a QA system that will answer questions about New York (any domain in general). I created a custom corpus of documents. Each of these documents is somehow related to New York City. The string 'New York' is present in every single one of the documents. Many of these documents are Wikipedia articles.

    Issues when running retriever with custom model:

    • process("Which rivers flow through New York?", k=N); returns very irrelevant articles.
    • process("What is New York?", k=N) returns no documents at all.

    Additionally, when running retriever with full wikipedia dump, both of these queries return relevant articles at the top.

    As mentioned in other issues this is probably related to weighing function of ranker: IDF = log((N - Nt + 0.5) / (Nt + 0.5))

    I am not really sure how to modify the retriever to work with domain-specific questions and documents. Since the results are better on a more generic database, one solution may be to extend my corpus with dummy documents but I find it very inelegant.

    Any pointers, ideas, suggestions will be much appreciated. Thanks in advance

    opened by xjurko 12
  • Can't generate datasets for distant supervision

    Can't generate datasets for distant supervision

    Hello,

    I can't manage to generate the datasets for DS, no matter the tokenizer used. When attempting with '--tokenizer spacy' the script never goes beyond the line 197 of generate.py q_tokens = workers.map(tokenize_text, questions)

    02/15/2018 04:34:31 PM: [ Processing 3778 question answer pairs... ]
    02/15/2018 04:34:31 PM: [ Will save to data/ds/WebQuestions-train.dstrain and data/ds/WebQuestions-train.dsdev ]
    02/15/2018 04:34:31 PM: [ Loading data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ]
    02/15/2018 04:39:45 PM: [ Ranking documents (top 5 per question)... ]
    02/15/2018 04:42:42 PM: [ Pre-tokenizing questions... ]
    

    When using another tokenizer, like '--tokenizer simple' I get the following errors:

    02/16/2018 12:17:18 PM: [ Processing 3778 question answer pairs... ]
    02/16/2018 12:17:18 PM: [ Will save to data/ds/WebQuestions-train.dstrain and data/ds/WebQuestions-train.dsdev ]
    02/16/2018 12:17:18 PM: [ Loading data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ]
    02/16/2018 12:22:33 PM: [ Ranking documents (top 5 per question)... ]
    02/16/2018 12:25:26 PM: [ Pre-tokenizing questions... ]
    02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
    02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
    02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
    02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
    02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
    02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
    02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
    02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
    02/16/2018 12:25:36 PM: [ Searching documents... ]
    multiprocessing.pool.RemoteTraceback: 
    """
    Traceback (most recent call last):
      File "/usr/local/Cellar/python3/3.6.4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
        result = (True, func(*args, **kwds))
      File "scripts/distant/generate.py", line 170, in search_docs
        found = find_answer(paragraph, q_tokens, answer, opts)
      File "scripts/distant/generate.py", line 109, in find_answer
        for ne in q_tokens.entity_groups():
    TypeError: 'NoneType' object is not iterable
    """
    

    Any idea of what might be happening would be greatly appreciated )

    opened by ironflood 12
  • TypeError: expected string or buffer

    TypeError: expected string or buffer

    Hi, I have uploaded 7 documents and one of the documents is like below. "The American Civil War was fought in the United States from 1861 to 1865. The result of a long-standing controversy over slavery, war broke out in April 1861, when Confederates attacked Fort Sumter in South Carolina, shortly after President Abraham Lincoln was inaugurated. The nationalists of the Union proclaimed loyalty to the U.S. Constitution. They faced secessionists of the Confederate States of America, who advocated for states' rights to expand slavery.

    Among the 34 U.S. states in February 1861, seven Southern slave states individually declared their secession from the U.S. to form the Confederate States of America, or the South. The Confederacy grew to include eleven slave states. The Confederacy was never diplomatically recognized by the United States government, nor was it recognized by any foreign country (although Britain and France granted it belligerent status). The states that remained loyal, including the border states where slavery was legal, were known as the Union or the North. The North and South quickly raised volunteer and conscription armies that fought mostly in the South over four years. The Union finally won the war when General Robert E. Lee surrendered to General Ulysses S. Grant at the Battle of Appomattox Court House followed by a series of surrenders by Confederate generals throughout the southern states. Four years of intense combat left 620,000 to 750,000 soldiers dead, a higher number than the number of American military deaths in all other wars combined. Much of the South's infrastructure was destroyed, especially the transportation systems, railroads, mills and houses. The Confederacy collapsed, slavery was abolished, and 4 million slaves were freed. The Reconstruction Era (1863–1877) overlapped and followed the war, with the process of restoring national unity, strengthening the national government, and granting civil rights to freed slaves throughout the country. The Civil War is the most studied and written about episode in American history.

    In the 1860 presidential election, Republicans, led by Abraham Lincoln, supported banning slavery in all the U.S. territories. The Southern states viewed this as a violation of their constitutional rights and as the first step in a grander Republican plan to eventually abolish slavery. The three pro-Union candidates together received an overwhelming 82% majority of the votes cast nationally: Republican Lincoln's votes centered in the north, Democrat Stephen A. Douglas' votes were distributed nationally and Constitutional Unionist John Bell's votes centered in Tennessee, Kentucky, and Virginia. The Republican Party, dominant in the North, secured a plurality of the popular votes and a majority of the electoral votes nationally, so Lincoln was constitutionally elected president. He was the first Republican Party candidate to win the presidency. However, before his inauguration, seven slave states with cotton-based economies declared secession and formed the Confederacy. The first six to declare secession had the highest proportions of slaves in their populations, a total of 49 percent. The first seven with state legislatures to resolve for secession included split majorities for unionists Douglas and Bell in Georgia with 51% and Louisiana with 55%. Alabama had voted 46% for those unionists, Mississippi with 40%, Florida with 38%, Texas with 25%, and South Carolina cast Electoral College votes without a popular vote for president. Of these, only Texas held a referendum on secession." copied from wiki - American Civil War. I have built TFIDF successfully but while querying using python3.6 scripts/pipeline/interactive.py --retriever-model /home/shiva/DrQA/data/sample-tfidf-ngram=2-hash=16777216-tokenizer=corenlp.npz

    getting below error

    process("Ulysses S. Grant") 01/08/2018 08:46:58 PM: [ Processing 1 queries... ] 01/08/2018 08:46:58 PM: [ Retrieving top 5 docs... ] Traceback (most recent call last): File "", line 1, in File "scripts/pipeline/interactive.py", line 81, in process question, candidates, top_n, n_docs, return_context=True File "/home/shiva/DrQA/drqa/pipeline/drqa.py", line 184, in process top_n, n_docs, return_context File "/home/shiva/DrQA/drqa/pipeline/drqa.py", line 217, in process_batch for split in splits: File "/home/shiva/DrQA/drqa/pipeline/drqa.py", line 147, in _split_doc for split in regex.split(r'\n+', doc): File "/usr/local/lib/python3.6/site-packages/regex.py", line 319, in split return _compile(pattern, flags, kwargs).split(string, maxsplit, concurrent) TypeError: expected string or buffer

    could you please suggest whether i missed any thing or any limitation in code.

    opened by shivamani-ans 12
  • Document Reader for different Domain.

    Document Reader for different Domain.

    Hi Adam,

    I was able to successfully do the POC on Document Retriever integrated with ElasticSearch, now moving on to Document Reader and need Help. I am planning to do POC on Document Reader for different Domain and not use the SQuAD-v1.1-dev/train. What are the steps, do i have to create a JSON file something similar to SQuAD-v1.1-dev.json and SQuAD-v1.1-train.json ? as per readME there are 2 formats which one should i have to develop Format A or Format B . https://github.com/facebookresearch/DrQA/blob/master/scripts/reader/README.md

    If you can quickly list down the steps that would be great.. i know there is readMe but does not mention steps for different domain.

    opened by samdash 11
  • Error when running scripts/pipeline/interactive.py

    Error when running scripts/pipeline/interactive.py

    Anyone seen this and can point me in the right direction?

    (drqa) eric@WIN-DJKY13BTNQ:~/repos/DrQA$ python scripts/pipeline/interactive.py
    05/26/2018 12:45:57 PM: [ Running on CPU only. ]
    05/26/2018 12:45:57 PM: [ Initializing pipeline... ]
    05/26/2018 12:45:57 PM: [ Initializing document ranker... ]
    05/26/2018 12:45:57 PM: [ Loading /home/eric/repos/DrQA/data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ]
    05/26/2018 12:48:46 PM: [ Initializing document reader... ]
    05/26/2018 12:48:46 PM: [ Loading model /home/eric/repos/DrQA/data/reader/multitask.mdl ]
    05/26/2018 12:48:55 PM: [ Initializing tokenizers and document retrievers... ]
    Traceback (most recent call last):
      File "scripts/pipeline/interactive.py", line 70, in <module>
        tokenizer=args.tokenizer
      File "/home/eric/repos/DrQA/drqa/pipeline/drqa.py", line 140, in __init__
        initargs=(tok_class, tok_opts, db_class, db_opts, fixed_candidates)
      File "/home/eric/anaconda3/envs/drqa/lib/python3.6/multiprocessing/context.py", line 119, in Pool
        context=self.get_context())
      File "/home/eric/anaconda3/envs/drqa/lib/python3.6/multiprocessing/pool.py", line 174, in __init__
        self._repopulate_pool()
      File "/home/eric/anaconda3/envs/drqa/lib/python3.6/multiprocessing/pool.py", line 239, in _repopulate_pool
        w.start()
      File "/home/eric/anaconda3/envs/drqa/lib/python3.6/multiprocessing/process.py", line 105, in start
        self._popen = self._Popen(self)
      File "/home/eric/anaconda3/envs/drqa/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
        return Popen(process_obj)
      File "/home/eric/anaconda3/envs/drqa/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
        self._launch(process_obj)
      File "/home/eric/anaconda3/envs/drqa/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
        self.pid = os.fork()
    OSError: [Errno 22] Invalid argument
    
    opened by ericchansen 8
  • eq() TypeError in train.py

    eq() TypeError in train.py

    Hi,

    I've been trying to run the train.py file, and am facing the following error:

    Traceback (most recent call last):
      File "scripts/reader/train.py", line 542, in <module>
        main(args)
      File "scripts/reader/train.py", line 484, in main
        validate_unofficial(args, train_loader, model, stats, mode='train')
      File "scripts/reader/train.py", line 253, in validate_unofficial
        accuracies = eval_accuracies(pred_s, target_s, pred_e, target_e)
      File "scripts/reader/train.py", line 345, in eval_accuracies
        if any([1 for _s, _e in zip(target_s[i], target_e[i])
      File "scripts/reader/train.py", line 346, in <listcomp>
        if _s == pred_s[i] and _e == pred_e[i]]):
    TypeError: eq() received an invalid combination of arguments - got (numpy.ndarray), but expected one of:
     * (Tensor other)
          didn't match because some of the arguments have invalid types: (numpy.ndarray)
     * (float other)
          didn't match because some of the arguments have invalid types: (numpy.ndarray)
    

    I see that there is some code above to convert 1D tensors to list of lists:

    # Convert 1D tensors to lists of lists (compatibility)
        if torch.is_tensor(target_s):
            target_s = [[e] for e in target_s]
            target_e = [[e] for e in target_e]
    

    The type of pred_s[i] is numpy.ndarray.

    The type of target_s[i][0] is torch.Tensor.

    I'm not completely well versed with torch, so I am requesting for help, and proceeding to try to convert the torch.Tensor into a numpy.ndarray during comparison. Any help on fixing this error properly would be appreciated.

    Thank you.

    opened by smiduthuri 8
  • How to increase length of the span?

    How to increase length of the span?

    Hi Adam,

    while executing below statement, i see the span only have few words, how to increase the length of th e span to get better meaning

    python scripts/pipeline/predict.py /path/to/format/A/dataset.txt

    below is the sample i get when executing the predict.py, is there any fine tuning the span attribute to increase the length?

     "span": "a recommended update", "doc_score": 91.32551502098157, "span_score": 45858.703125}]
     "span": "included.Aperture 3", "doc_score": 195.0124505464773, "span_score": 7517108.0}]
     "span": "2nd generation", "doc_score": 51.732084228252496, "span_score": 1008.5970458984375}]
     "span": "2 Help", "doc_score": 64.38085395815739, "span_score": 142660.265625}]
     "span": "youre", "doc_score": 87.41417707786954, "span_score": 2699.703125}]
    
    opened by samdash 8
  • Elasticsearch integration

    Elasticsearch integration

    Enable to query an Elasticsearch server with the python API. We need to specify the url, index name, the fields to search on and the field name containing the text to retrieve. It works as follow on an existing drqa model:

    DrQA.process(self, query, candidates=None, top_n=1, n_docs=5, 
                 return_context=False, elastic_url='localhost:9200', elastic_index='pmpsa', 
                 elastic_fields=['content.unigrams', 'content.bigrams', 'doc_name^2'], elastic_text_field='content')
    
    CLA Signed 
    opened by lbaligand 7
  • python scripts/pipeline/interactive.py keep on running for so long?

    python scripts/pipeline/interactive.py keep on running for so long?

    process('What is question answering?') 11/29/2022 08:31:54 AM: [ Processing 1 queries... ] 11/29/2022 08:31:54 AM: [ Retrieving top 5 docs... ] 11/29/2022 08:31:55 AM: [ Reading 106 paragraphs... ] /home/ubuntu/DrQA/drqa/reader/layers.py:202: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at ../aten/src/ATen/native/TensorAdvancedIndexing.cpp:1646.) scores.data.masked_fill_(y_mask.data, -float('inf')) /home/ubuntu/DrQA/drqa/reader/layers.py:275: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at ../aten/src/ATen/native/TensorAdvancedIndexing.cpp:1646.) scores.data.masked_fill_(x_mask.data, -float('inf')) /home/ubuntu/DrQA/drqa/reader/layers.py:242: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at ../aten/src/ATen/native/TensorAdvancedIndexing.cpp:1646.) xWy.data.masked_fill_(x_mask.data, -float('inf'))

    opened by ayush431 0
  • The file E:/Deep_learning/DrQA-main\data\wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz can't open

    The file E:/Deep_learning/DrQA-main\data\wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz can't open

    Excuse me, when it runs the code matrix = sp.csr_matrix((loader['data'], loader['indices'], the error is numpy.core._exceptions.MemoryError: Unable to allocate 7.92 GiB for an array with shape (1063277605,) and data type float64.

    opened by yqiz-98 0
  • RuntimeError: CUDA error: no kernel image is available for execution on the device

    RuntimeError: CUDA error: no kernel image is available for execution on the device

    Good morning, I found the following error when running on cuda 11 :

    RuntimeError: CUDA error: no kernel image is available for execution on the device

    Is DrQA running on a lower cuda version? Is there anyway to make it work on cuda 11? thanks a lot.

    opened by chikiuso 0
  • ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'

    ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'

    root@LAPTOP-VF7DMIT4:/mnt/d/E/DL/DrQA-main/DrQA-main/DrQA/DrQA/DrQA# cd DrQA; pip install -r requirements.txt; python setup.py develop bash: cd: DrQA: No such file or directory ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'

    opened by KeyVux 0
  • > https://wa.me/message/MCRHATMQSTO6A1

    > https://wa.me/message/MCRHATMQSTO6A1

    https://en.wikipedia.org/wiki/File%3AFacebook_Messenger_logo_2020.svg{ "error" : { "code" : "mustpostparams" , "info" : "El siguiente parámetro se encontró en la cadena de consulta, pero debe estar en el cuerpo POST: token". , "*" : "Consulte https://www.mediawiki.org/w/api.php para conocer el uso de la API. Suscríbase a la lista de correo mediawiki-api-announce en <https://lists.wikimedia.org/postorius /lists/mediawiki-api-announce.lists.wikimedia.org/> para obtener información sobre las obsolescencias de la API y los cambios importantes". }, +52 452 165 0233

    }

    opened by Ala420 0
Owner
Facebook Research
Facebook Research
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
Question answering app is used to answer for a user given question from user given text.

Question answering app is used to answer for a user given question from user given text.It is created using HuggingFace's transformer pipeline and streamlit python packages.

Siva Prakash 3 Apr 5, 2022
Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! ?? What is this? At this repo, I'm

M. Yusuf Sarıgöz 13 Oct 10, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

Benjamin Heinzerling 1.1k Jan 3, 2023
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

Machinalis 1.2k Dec 18, 2022
GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

AI2 112 Nov 6, 2022
An ActivityWatch watcher to pose questions to the user and record her answers.

aw-watcher-ask An ActivityWatch watcher to pose questions to the user and record her answers. This watcher uses Zenity to present dialog boxes to the

Bernardo Chrispim Baron 33 Dec 3, 2022
IEEEXtreme15.0 Questions And Answers

IEEEXtreme15.0 Questions And Answers IEEEXtreme is a global challenge in which teams of IEEE Student members – advised and proctored by an IEEE member

Dilan Perera 15 Oct 24, 2022
This is my reading list for my PhD in AI, NLP, Deep Learning and more.

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

Zhong Peixiang 156 Dec 21, 2022
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 6, 2021
Code repository for "It's About Time: Analog clock Reading in the Wild"

it's about time Code repository for "It's About Time: Analog clock Reading in the Wild" Packages required: pytorch (used 1.9, any reasonable version s

null 52 Nov 10, 2022
🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

A hyper-fast, safe Python module to read and write JSON data. Works as a drop-in replacement for Python's built-in json module. This is alpha software

Matthias 479 Jan 1, 2023
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

null 272 Dec 15, 2022
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Antlr Project 13.6k Jan 5, 2023
GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

GCRC GCRC: A New Challenging MRC Dataset from Gaokao Chinese for Explainable Eva

Yunxiao Zhao 5 Nov 4, 2022
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022)

SyntaxGen Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022) In this repo, we upload all the scripts for this work. Due to siz

Zhuosheng Zhang 3 Jun 13, 2022