:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Overview

Logo

Build Checked with MyPy Documentation Release License Last commit Downloads

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ...

... ask questions in natural language and find granular answers in your own documents.
... do semantic document search and retrieve more relevant documents for your search queries.
... search at scale through millions of documents.
... use off-the-shelf models or fine-tune them to your own domain.
... evaluate, benchmark and continuously improve your models via user feedback.
... improve chat bots by leveraging existing knowledge bases for the long tail of queries.
... automate processes by automatically applying a list of questions to new documents and using the extracted answers.

📒 Docs Usage, Guides, API documentation ...
💻 Installation How to install
🎨 Key components Overview of core concepts
👀 Quick Tour Basic explanation of concepts, options and usage
🎓 Tutorials Jupyter/Colab Notebooks & Scripts
📊 Benchmarks Speed & Accuracy of Retriever, Readers and DocumentStores
🔭 Roadmap Public roadmap of Haystack
❤️ Contributing We welcome all contributions!
🙏 Slack Join our community on Slack
🐦 Twitter Follow us on Twitter for news and updates
📰 Blog Read our articles on Medium

Core Features

  • Latest models: Utilize all latest transformer based models (e.g. BERT, RoBERTa, MiniLM) for extractive QA, generative QA and document retrieval.
  • Modular: Multiple choices to fit your tech stack and use case. Pick your favorite database, file converter or modeling framwework.
  • Open: 100% compatible with HuggingFace's model hub. Tight interfaces to other frameworks (e.g. Transformers, FARM, sentence-transformers)
  • Scalable: Scale to millions of docs via retrievers, production-ready backends like Elasticsearch / FAISS and a fastAPI REST API
  • End-to-End: All tooling in one place: file conversion, cleaning, splitting, training, eval, inference, labeling ...
  • Developer friendly: Easy to debug, extend and modify.
  • Customizable: Fine-tune models to your own domain or implement your custom DocumentStore.
  • Continuous Learning: Collect new training data via user feedback in production & improve your models continuously

Installation

PyPi:

pip install farm-haystack

Master branch (if you wanna try the latest features):

git clone https://github.com/deepset-ai/haystack.git
cd haystack
pip install --editable .

To update your installation, just do a git pull. The --editable flag will update changes immediately.

On Windows you might need:

pip install farm-haystack -f https://download.pytorch.org/whl/torch_stable.html

Key Components

image

  1. FileConverter: Extracts pure text from files (pdf, docx, pptx, html and many more).
  2. PreProcessor: Cleans and splits texts into smaller chunks.
  3. DocumentStore: Database storing the documents, metadata and vectors for our search. We recommend Elasticsearch or FAISS, but have also more light-weight options for fast prototyping (SQL or In-Memory).
  4. Retriever: Fast algorithms that identify candidate documents for a given query from a large collection of documents. Retrievers narrow down the search space significantly and are therefore key for scalable QA. Haystack supports sparse methods (TF-IDF, BM25, custom Elasticsearch queries) and state of the art dense methods (e.g. sentence-transformers and Dense Passage Retrieval)
  5. Reader: Neural network (e.g. BERT or RoBERTA) that reads through texts in detail to find an answer. The Reader takes multiple passages of text as input and returns top-n answers. Models are trained via FARM or Transformers on SQuAD like tasks. You can just load a pretrained model from Hugging Face's model hub or fine-tune it on your own domain data.
  6. Generator: Neural network (e.g. RAG) that generates an answer for a given question conditioned on the retrieved documents from the retriever.
  7. Pipeline: Stick building blocks together to highly custom pipelines that are represented as Directed Acyclic Graphs (DAG). Think of it as "Apache Airflow for search".
  8. REST API: Exposes a simple API based on fastAPI for running QA search, uploading files and collecting user feedback for continuous learning.
  9. Haystack Annotate: Create custom QA labels to improve performance of your domain-specific models. Hosted version or Docker images.

Usage

# DB to store your docs
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="",
                                            index="document", embedding_dim=768,
                                            embedding_field="embedding")

# Index your docs
# (Options: Convert text from PDFs etc. via FileConverter; Split and clean docs with the PreProcessor)
docs = [Document(text="Arya accompanies her father Ned and her sister Sansa to King's Landing. Before their departure ...", meta={}), 
       ...]

document_store.write_documents([doc])

# Init Retriever: Fast algorithm to identify most promising candidate docs
# (Options: DPR, TF-IDF, Elasticsearch, Plain Embeddings ..)
retriever = DensePassageRetriever(document_store=document_store,                         
                                query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                )
document_store.update_embeddings(retriever)

# Init Reader: Powerful, but slower neural model 
# (Options: FARM or Transformers Framework; Extractive or generative models)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

# The Pipeline sticks together Reader + Retriever to a DAG
# There's many different pipeline types and you can easily build your own
pipeline = ExtractiveQAPipeline(reader, retriever)

# Voilá! Ask a question!
prediction = pipeline.run(query="Who is the father of Arya Stark?", top_k_retriever=10,top_k_reader=3)
print_answers(prediction, details="minimal")

[   {   'answer': 'Eddard',
        'context': """... She travels with her father, Eddard, to 
                   King's Landing when he is made Hand of the King ..."""},
    {   'answer': 'Ned',
        'context': """... girl disguised as a boy all along and is surprised 
                   to learn she is Arya, Ned Stark's daughter ..."""},
    {   'answer': 'Ned',
        'context': """... Arya accompanies her father Ned and her sister Sansa to
                   King's Landing. Before their departure ..."""}
]

Tutorials

Quick Tour

File Conversion | Preprocessing | DocumentStores | Retrievers | Readers | Pipelines | REST API | Labeling Tool

1) File Conversion

What
Different converters to extract text from your original files (pdf, docx, txt, html). While it's almost impossible to cover all types, layouts and special cases (especially in PDFs), we cover the most common formats (incl. multi-column) and extract meta information (e.g. page splits). The converters are easily extendable, so that you can customize them for your files if needed.

Available options

  • Txt
  • PDF
  • Docx
  • Apache Tika (Supports > 340 file formats)

Example

#PDF
from haystack.file_converter.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["de","en"])
doc = converter.convert(file_path=file, meta=None)
# => {"text": "text first page \f text second page ...", "meta": None}

#DOCX
from haystack.file_converter.docx import DocxToTextConverter
converter = DocxToTextConverter(remove_numeric_tables=True, valid_languages=["de","en"])
doc = converter.convert(file_path=file, meta=None)
# => {"text": "some text", "meta": None}

2) Preprocessing

What
Cleaning and splitting of your texts are crucial steps that will directly impact the speed and accuracy of your search. The splitting of larger texts is especially important for achieving fast query speed. The longer the texts that the retriever passes to the reader, the slower your queries.

Available Options
We provide a basic PreProcessor class that allows:

  • clean whitespace, headers, footer and empty lines
  • split by words, sentences or passages
  • option for "overlapping" splits
  • option to never split within a sentence

You can easily extend this class to your own custom requirements.

Example

converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])

processor = PreProcessor(clean_empty_lines=True,
                         clean_whitespace=True,
                         clean_header_footer=True,
                         split_by="word",
                         split_length=200,
                         split_respect_sentence_boundary=True)
docs = []
for f_name, f_path in zip(filenames, filepaths):
    # Optional: Supply any meta data here
    # the "name" field will be used by DPR if embed_title=True, rest is custom and can be named arbitrarily
    cur_meta = {"name": f_name, "category": "a" ...}
    
    # Run the conversion on each file (PDF -> 1x doc)
    d = converter.convert(f_path, meta=cur_meta)
    
    # clean and split each dict (1x doc -> multiple docs)
    d = processor.process(d)
    docs.extend(d)

# at this point docs will be [{"text": "some", "meta":{"name": "myfilename", "category":"a"}},...]
document_store.write_documents(docs)

3) DocumentStores

What

  • Store your texts, meta data and optionally embeddings
  • Documents should be chunked into smaller units (e.g. paragraphs) before indexing to make the results returned by the Retriever more granular and accurate.

Available Options

  • Elasticsearch
  • FAISS
  • SQL
  • InMemory

Example

# Run elasticsearch, e.g. via docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2

# Connect 
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

# Get all documents
document_store.get_all_documents()

# Query
document_store.query(query="What is the meaning of life?", filters=None, top_k=5)
document_store.query_by_embedding(query_emb, filters=None, top_k=5)

-> See docs for details

4) Retrievers

What
The Retriever is a fast "filter" that can quickly go through the full document store and pass a set of candidate documents to the Reader. It is an tool for sifting out the obvious negative cases, saving the Reader from doing more work than it needs to and speeding up the querying process. There are two fundamentally different categories of retrievers: sparse (e.g. TF-IDF, BM25) and dense (e.g. DPR, sentence-transformers).

Available Options

  • DensePassageRetriever
  • ElasticsearchRetriever
  • EmbeddingRetriever
  • TfidfRetriever

Example

retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  use_gpu=True,
                                  batch_size=16,
                                  embed_title=True)
retriever.retrieve(query="Why did the revenue increase?")
# returns: [Document, Document]

-> See docs for details

5) Readers

What
Neural networks (i.e. mostly Transformer-based) that read through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via FARM or on SQuAD-like datasets. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. Both readers can load either a local model or any public model from Hugging Face's model hub

Available Options

  • FARMReader: Reader based on FARM incl. extensive configuration options and speed optimizations
  • TransformersReader: Reader based on the pipeline class of HuggingFace's Transformers.
    Both Readers can load models directly from HuggingFace's model hub.

Example

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2",
                use_gpu=False, no_ans_boost=-10, context_window_size=500,
                top_k_per_candidate=3, top_k_per_sample=1,
                num_processes=8, max_seq_len=256, doc_stride=128)

# Optional: Training & eval
reader.train(...)
reader.eval(...)

# Predict
reader.predict(question="Who is the father of Arya Starck?", documents=documents, top_k=3)

-> See docs for details

6) Pipelines

What
In order to build modern search pipelines, you need two things: powerful building blocks and a flexible way to stick them together. The Pipeline class is exactly build for this purpose and enables many search scenarios beyond QA. The core idea: you can build a Directed Acyclic Graph (DAG) where each node is one "building block" (Reader, Retriever, Generator ...).

Available Options

  • Standard nodes: Reader, Retriever, Generator ...
  • Join nodes: For example, combine results of multiple retrievers via the JoinDocuments node
  • Decision Nodes: For example, classify an incoming query and depending on the results execute only certain branch of your graph

Example
A minimal Open-Domain QA Pipeline:

p = Pipeline()
p.add_node(component=retriever, name="ESRetriever1", inputs=["Query"])
p.add_node(component=reader, name="QAReader", inputs=["ESRetriever1"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)

You can draw the DAG to better inspect what you are building:

p.draw(path="custom_pipe.png")

image

-> See docs for details and example of more complex pipelines

7) REST API

What
A simple REST API based on FastAPI is provided to:

  • search answers in texts (extractive QA)
  • search answers by comparing user question to existing questions (FAQ-style QA)
  • collect & export user feedback on answers to gain domain-specific training data (feedback)
  • allow basic monitoring of requests (currently via APM in Kibana)

Example
To serve the API, adjust the values in rest_api/config.py and run:

gunicorn rest_api.application:app -b 0.0.0.0:8000 -k uvicorn.workers.UvicornWorker -t 300

You will find the Swagger API documentation at http://127.0.0.1:8000/docs

8) Labeling Tool

  • Use the hosted version (Beta) or deploy it yourself with the Docker Images.
  • Create labels with different techniques: Come up with questions (+ answers) while reading passages (SQuAD style) or have a set of predefined questions and look for answers in the document (~ Natural Questions).
  • Structure your work via organizations, projects, users
  • Upload your documents or import labels from an existing SQuAD-style dataset

image

❤️ Contributing

We are very open to contributions from the community - be it the fix of a small typo or a completely new feature! You don't need to be an Haystack expert for providing meaningful improvements. To avoid any extra work on either side, please check our Contributor Guidelines first.

We'd also like to invite you to our Slack community channels. Please join here!

Tests will automatically run for every commit you push to your PR. You can also run them locally by executing pytest in your terminal from the root folder of this repository:

All tests:

cd test
pytest

You can also only run a subset of tests by specifying a marker and the optional "not" keyword:

cd test
pytest -m not elasticsearch
pytest -m elasticsearch
pytest -m generator
pytest -m tika
pytest -m not slow
...
Comments
  • Introduce QueryClassifier

    Introduce QueryClassifier

    Is your feature request related to a problem? Please describe. With the new flexible Pipelines introduced in https://github.com/deepset-ai/haystack/pull/596, we can build way more flexlible and complex search routes. One common challenge that we saw in deployments: We need to distinguish between real questions and keyword queries that come in. We only want to route questions to the Reader branch in order to maximize the accuracy of results and minimize computation efforts/costs.

    Describe the solution you'd like

    New class QueryClassifier that takes a query as input and determines if it is a question or a keyword query. We could start with a very basic version (maybe even rule-based) here and later extend it to use a classification model.
    The run method would need to return query, "output_1" for a question and query, "output_2" for a keyword query in order to allow branching in the DAG.

    Describe alternatives you've considered Later it might also make sense to distinguish into more types (e.g. full sentence but not a question)

    Additional context We could use it like this in a pipeline image

    type:feature Contributions wanted! good second issue 
    opened by tholor 42
  • FARMReader slow

    FARMReader slow

    Question I am running one of the samples in K8 pod (gpu) It get stuck in FARMReader for long (30+ mins) and time out. Any reason? All i added was 2 .txt document

        reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2",
                                use_gpu=True, return_no_answer=True, no_ans_boost=0.7, context_window_size=200)
    
        retriever = ElasticsearchRetriever(document_store= document_store)
         
        pipe = ExtractiveQAPipeline(reader, retriever)
        
        # predict n answers
         prediction = pipe.run(query=question, top_k_retriever=10, top_k_reader=3)
    

    y[``` 2021-05-19 23:34:10 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:8) 05/19/2021 23:34:10 - INFO - farm.infer - Got ya 23 parallel workers to do inference ... 05/19/2021 23:34:10 - INFO - farm.infer - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05/19/2021 23:34:10 - INFO - farm.infer - /w\ /w\ /w\ /w\ /w\ /w\ /w\ /|\ /w\ /w\ /w\ /w\ /w\ /w\ /|\ /w\ /|\ /|\ /|\ /|\ /w\ /w\ /|
    05/19/2021 23:34:10 - INFO - farm.infer - /'\ / \ /'\ /'\ / \ / \ /'\ /'\ /'\ /'\ /'\ /'\ / \ /'\ /'\ / \ /'\ /'\ /'\ /'\ / \ / \ /'
    05/19/2021 23:34:10 - INFO - farm.infer -
    05/19/2021 23:34:10 - INFO - elasticsearch - POST http://10.x.x.x:8071/sidx/_search [status:200 request:0.003s] 05/19/2021 23:34:10 - WARNING - farm.data_handler.dataset - Could not determine type for feature 'labels'. Converting now to a tensor of default type long. 05/19/2021 23:34:10 - WARNING - farm.data_handler.dataset - Could not determine type for feature 'labels'. Converting now to a tensor of default type long. [2021-05-19 23:34:40 +0000] [8] [WARNING] Worker graceful timeout (pid:8) [2021-05-19 23:34:42 +0000] [8] [INFO] Worker exiting (pid: 8)

    
    
    topic:speed 
    opened by bappctl 31
  • proposal: Add PromptNode based on LLM

    proposal: Add PromptNode based on LLM

    Related Issues

    • Proposes a fix for #3306

    Proposed Changes:

    This feature proposes adding PromptNode wrapping an LLM to be used in the Haystack ecosystem.

    How did you test it?

    N/A

    Notes for the reviewer

    See attached issues and the colab notebook for the high-fidelity version of this proposal.

    Checklist

    proposal proposal:active 
    opened by vblagoje 30
  • Update minimum selenium version supported for crawler

    Update minimum selenium version supported for crawler

    Related Issue(s): Issue #2920

    Proposed changes:

    • Update minimum requirements in setup.cfg
    • Change how the error is raised when running the crawler in google colab so the trace is more obvious

    Pre-flight checklist

    topic:dependencies 
    opened by sjrl 27
  • Using Columns names instead of ORM to get all documents

    Using Columns names instead of ORM to get all documents

    To improve #601

    Generally ORM objects kept in memory cause performance issue Hence using directly column name improve memory and performance. Refer StackOverflow

    action:benchmark 
    opened by lalitpagaria 27
  • Rasa Integration

    Rasa Integration

    Is your feature request related to a problem? Please describe.

    One of the typical use cases that we saw in the community is using Haystack to boost conversational assistants / chat bots on the long tail of queries. As you can't think of all possible intents beforehand, a QA model is a powerful option to cover unforeseen "information queries".

    The integration of Haystack with the existing conversational frameworks today requires still a lot of manual effort and implementation work.

    Describe the solution you'd like Let's build a tighter integration with Rasa!

    From what I see, Rasa support extern "action servers" that can execute custom code and get triggered via an REST API.

    Let's build a "Haystack Action Server", that gets a query and returns answers or documents. From what I see, we only need to comply with this API spec: https://rasa.com/docs/action-server/http-api-spec

    More info about the Action Servers: https://rasa.com/docs/action-server/actions

    Contributions welcome :)

    type:feature good first issue Contributions wanted! 
    opened by tholor 24
  • Issue with Milvus

    Issue with Milvus

    Describe the bug We installed farm_reader.

    Error message 1)we installled farm-haystack==0.9.0 2) We installed pip3 install pymilvus==1.1.2

    File "/app/main.py", line 6, in from haystack import document_store File "/usr/local/lib/python3.7/site-packages/haystack/init.py", line 5, in from haystack.finder import Finder File "/usr/local/lib/python3.7/site-packages/haystack/finder.py", line 8, in from haystack.reader.base import BaseReader File "/usr/local/lib/python3.7/site-packages/haystack/reader/init.py", line 1, in from haystack.reader.farm import FARMReader File "/usr/local/lib/python3.7/site-packages/haystack/reader/farm.py", line 21, in from haystack.document_store.base import BaseDocumentStore File "/usr/local/lib/python3.7/site-packages/haystack/document_store/init.py", line 4, in from haystack.document_store.milvus import MilvusDocumentStore File "/usr/local/lib/python3.7/site-packages/haystack/document_store/milvus.py", line 7, in from milvus import IndexType, MetricType, Milvus, Status ModuleNotFoundError: No module named 'milvus'

    Expected behavior It was working until 25th Jan and it stopped working.

    Additional context Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.

    To Reproduce Steps to reproduce the behavior

    FAQ Check

    System:

    • OS: Ubuntu
    • GPU/CPU: CPU
    • Haystack version (commit or version number): farm_reader
    • DocumentStore:
    • Reader:
    • Retriever:
    type:bug topic:document_store topic:dependencies journey:first steps 
    opened by HAIXteam 22
  • Add QueryClassifier incl. baseline models

    Add QueryClassifier incl. baseline models

    Proposed changes:

    • Query classifier updated

    Status (please check what you already did):

    • [] First draft (up for discussions & feedback)
    • [x] Final code
    • [x] Added tests
    • [ ] Updated documentation

    Discussion Points: I have made the changes as per the reviews in the other PR.

    Issue: Linked Issue

    opened by shahrukhx01 22
  • filter parameter ignored in search post /models/{model_id}/doc-qa

    filter parameter ignored in search post /models/{model_id}/doc-qa

    Describe the bug

    We are making an qa request with the following example.

    Screenshot 2021-03-29 at 17 13 35

    Except the filter parameter everything works fine. it seems like the filter parameter is never mapped to the Question object.

    Screenshot 2021-03-29 at 17 13 13

    In the log output you see that the filter is always using the default value that we defined in request.py instead of the one in the post request

    Screenshot 2021-03-29 at 17 16 03 Screenshot 2021-03-29 at 17 17 21

    Expected behavior custom filter should be within the question_request

    • Haystack version (commit or version number): 32050fdce3cc6d55ed744e2475b8c2dcee2edd4b
    type:bug 
    opened by Armbruj 22
  • Haystack with Albert is awesome! XLNet question

    Haystack with Albert is awesome! XLNet question

    I am in the midst of evaluating Haystack with Albert and so far it looks awesome. Loving it, thanks for sharing.

    I missed the whole Game of Thrones fantasy/drama phenomenon, so for a tutorial I could understand and relate-to, I went looking for other content to use with your Tutorial1_Basic_QA_Pipeline.ipynb notebook. Being a Porschephile I settled on:

    import wikipedia
    
    porsche_wikis = wikipedia.search("Porsche", results=25)
    doc_dir = "data/porsche/"
    
    for wiki in porsche_wikis:
        html_page = wikipedia.page(title = wiki, auto_suggest = False)
        text_file = open(doc_dir + wiki.replace('/', ' ') + ".txt", "w+")
        text_file.write(html_page.content)
        text_file.close()
        print(wiki)
    

    I can relate-to the above content and ask relevant questions of it "all day long". All other code in your notebook remains the same, except I use my Albert model for QA and it works well:

    reader = FARMReader(model_name_or_path="ahotrod/albert_xxlargev1_squad2_512", 
    use_gpu=True)
    

    For my application/project, I would like to also evaluate XLNet performance with Haystack but I am having trouble loading my XLNet model:

    reader = FARMReader(model_name_or_path="ahotrod/xlnet_large_squad2_512",
    use_gpu=True)
    

    Attached is the complete terminal output text, but bottom-line the error I get is:

    AttributeError: 'XLNetForQuestionAnswering' object has no attribute 'qa_outputs'

    output_term.txt

    This XLNet model was fine-tuned on Transformers v2.1.1 and is the best I have because I and others are having problems fine-tuning XLNet_large under Transformers v2.4.1, https://github.com/huggingface/transformers/issues/2651

    Perhaps this fine-tuned XLNet model & Transformers v2.1.1 is not compatible/missing the attribute mentioned in the error message?

    Looking forward to additional FARM/Haystack QA capabilities you have in the works, thanks for your efforts!

    type:bug 
    opened by ahotrod 22
  • feat: Bump python to 3.10 for gpu docker image, use nvidia/cuda

    feat: Bump python to 3.10 for gpu docker image, use nvidia/cuda

    Related Issues

    • fixes #3303

    Proposed Changes:

    Based GPU image on nvidia/cuda image Bump GPU image to python to 3.9 CPU image already uses 3.10 - didn't change it

    I wanted to update python to 3.10, currently available via ppa:deadsnakes/ppa apt repository. However, the repository hosting the python 3.10 image for ubuntu seems so overwhelmed that the build timed out, waiting for gpg keys multiple times. These timeouts impede the build process and make it less reliable. Python 3.9 is available via more "standard" repositories - let's use those until 3.10 gets released.

    How did you test it?

    Tested by running haystack demo, need to set the right pipeline path, but the containers start ok.

    Notes for the reviewer

    Supercedes https://github.com/deepset-ai/haystack/pull/3323 Note how I had to remove beir dependency. Otherwise, I would get

    AttributeError: module 'faiss' has no attribute 'swigfaiss'
    

    Checklist

    topic:docker 
    opened by vblagoje 21
  • Add missing docstrings to PromptNode, PromptTemplate and PromptModel

    Add missing docstrings to PromptNode, PromptTemplate and PromptModel

    Related Issues

    • fixes #3806

    Proposed Changes:

    Adds missing pydoc

    How did you test it?

    No new tests needed

    Notes for the reviewer

    cc @agnieszka-m please check the added docs

    Checklist

    opened by vblagoje 0
  • bug: The `PromptNode` handles all parameters as lists without checking if they are in fact lists

    bug: The `PromptNode` handles all parameters as lists without checking if they are in fact lists

    This is to close #3819

    The new PromptNode cannot work with a Question-Answering Pipeline with Retrievers, because the Retriever expects the query parameter as a string, while the PromptNode assumes that all supplied parameters are lists and NOT strings. Strings will be silently handled as a list by the PromptNode and so split into characters by this line.

    Currently the PromptNode will split that string as if it would be a list, so it will only use the first character of the query string to be substituted into the constructed prompt, which is what this bugfix changes.

    Mock-example:

    prompt_node = PromptNode(prompt_model, default_prompt_template="question-answering")
    
    pipeline = Pipeline()
    pipeline.add_node(component=bm25_retriever, name="bm25_retriever", inputs=["Query"])
    pipeline.add_node(component=prompt_node, name="prompt_node", inputs=["bm25_retriever"])
    output = pipeline.run(query="What are subscription warrants?")  # <= if `query` is a list, then that won't work with the retriever,
                                                                  # while if it is a string, then that won't with with the PromptNode
    

    For the above example the PromptNode will construct the prompt such as: `Given the context please answer the question. Context: ....; Question: W; Answer:'

    Observe the W in the place of the question inside the prompt. This happens because the query="What are subscription warrants?" will be split as if it would be a list, so its first character will be inserted into the prompt instead of the whole question.

    Related Issues

    • fixes #3819

    Proposed Changes:

    The bug was caused by this line splitting all prompt argument values, whether or not they are splitable (eg lists). To ensure that that line works for string, I have added a line before to turn non-lists into lists first.

    How did you test it?

    I wrote a dedicated unit test for it - plus I have also manually checked the prompt that gets created with the fixed version to ensure that it contains the whole query and not only its first character.

    Notes for the reviewer

    @vblagoje , we have already discussed this on Discord and you have agreed to have me raise this as a PR, see https://discord.com/channels/993534733298450452/993539071815200889/1060815327262408734

    Checklist

    opened by zoltan-fedor 0
  • The new `PromptNode` cannot work with a `Question-Answering` Pipeline with `Retrievers`

    The new `PromptNode` cannot work with a `Question-Answering` Pipeline with `Retrievers`

    Describe the bug The new PromptNode cannot work with a Question-Answering Pipeline with Retrievers, because the Retriever expects the query parameter as a string, while the PromptNode assumes that all supplied parameters are list and NOT strings. Strings will be silently handled as a list by the PromptNode and so split into characters by this line.

    Currently the PromptNode will split that string as if it would be a list, so it will only use the first character of the query string to be substituted into the constructed prompt.

    Mock-example:

    prompt_node = PromptNode(prompt_model, default_prompt_template="question-answering")
    
    pipeline = Pipeline()
    pipeline.add_node(component=bm25_retriever, name="bm25_retriever", inputs=["Query"])
    pipeline.add_node(component=prompt_node, name="prompt_node", inputs=["bm25_retriever"])
    output = pipeline.run(query="What are subscription warrants?")  # <= if `query` is a list, then that won't work with the retriever,
                                                                  # while if it is a string, then that won't with with the PromptNode
    

    For the above example the PromptNode will construct the prompt such as: `Given the context please answer the question. Context: ....; Question: W; Answer:'

    Observe the W in the place of the question inside the prompt. This happens because the query="What are subscription warrants?" will be split as if it would be a list, so its first character will be inserted into the prompt instead of the whole question.

    Error message No error message, but the prompt sent to the model is wrong - which is why it is a silent error. You can only observe that something is wrong by inspecting the returned answers and that they have no relationship to the original query.

    Expected behavior When a string is supplied as a prompt value to the PromptNode, then that should not be split into characters, the PromptNode should be prepared to accept non-lists and turn them into lists before processing them as lists.

    Additional context This has been discussed with @vblagoje at https://discord.com/channels/993534733298450452/993539071815200889/1060723643841249320 and agreed that I will submit a PR for it.

    To Reproduce See the above code example.

    FAQ Check

    System:

    • OS: Any
    • GPU/CPU: Any
    • Haystack version (commit or version number): 1.12.2
    • DocumentStore: Any
    • Reader: PromptNode
    • Retriever: Any
    opened by zoltan-fedor 0
  • Cannot run crawler after full install

    Cannot run crawler after full install

    Describe the bug A clear and concise description of what the bug is.

    I am trying to run the crawler and ran both a full install of haystack (!pip install 'farm-haystack[all]') as well as just the crawler component (!pip install 'farm-haystack[crawler]'). When trying to import I get the following error.

    Error message Error that was thrown (if available)

    ModuleNotFoundError: No module named 'webdriver_manager'

    The above exception was the direct cause of the following exception:

    ImportError Traceback (most recent call last)

    ImportError: Failed to import 'haystack.nodes.connector.crawler', which is an optional component in Haystack.

    Expected behavior A clear and concise description of what you expected to happen.

    Additional context Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.

    To Reproduce Steps to reproduce the behavior

    basically just ran the install plus import in google colab, followed the docs for the crawler

    FAQ Check

    System:

    • OS:
    • GPU/CPU:
    • Haystack version (commit or version number):
    • DocumentStore:
    • Reader:
    • Retriever:
    opened by RobinSrimal 1
  • Add a `verbose` option to PromptNode to let users understand the prompts being used

    Add a `verbose` option to PromptNode to let users understand the prompts being used

    Is your feature request related to a problem? Please describe. As a developer working with LLMs via the PromptNode, I want to have more transparency over the exact prompt being sent to the LLM. This helps my work in several scenarios:

    • helps me to debug my pipeline in case the generated predictions are bad (e.g. because wrong variables got ingested or prompt is too long)
    • helps me to learn more about prompts and what's behind existing prompt templates
    • helps me to understand the flow of more complex pipelines where multiple prompts are generated at different steps

    Describe the solution you'd like An optional parameter that I can set to make the logs of the node more verbose. Could be a log level, could be a debug param like we have in Pipelines.run() or a verbose param. Ideally, it should be configurable at init time and for each run of the node.

    type:feature topic:LLM 
    opened by tholor 1
  • fix: gracefully handle `FileExistsError` during `Preprocessor` resource download

    fix: gracefully handle `FileExistsError` during `Preprocessor` resource download

    Related Issues

    • fixes https://github.com/deepset-ai/haystack/issues/3514

    Proposed Changes:

    • fix: use temp path for downloading punkt resources
    • fix: gracefully handle file exists error during download

    How did you test it?

    • unit test

    Notes for the reviewer

    Checklist

    topic:preprocessing 
    opened by wochinge 0
Releases(v1.12.2)
  • v1.12.2(Dec 22, 2022)

    What's Changed

    • Fixing the query_batch method of the deepsetcloud document store by @zoltan-fedor in https://github.com/deepset-ai/haystack/pull/3724
    • build: upgrade torch and let transformers pick the version by @julian-risch in https://github.com/deepset-ai/haystack/pull/3727
    • fix: Removed overlooked torch scatter references by @sjrl in https://github.com/deepset-ai/haystack/pull/3719

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.12.1...v1.12.2

    Source code(tar.gz)
    Source code(zip)
  • v1.12.2rc1(Dec 22, 2022)

    What's Changed

    • Fixing the query_batch method of the deepsetcloud document store by @zoltan-fedor in https://github.com/deepset-ai/haystack/pull/3724
    • build: upgrade torch and let transformers pick the version by @julian-risch in https://github.com/deepset-ai/haystack/pull/3727
    • fix: Removed overlooked torch scatter references by @sjrl in https://github.com/deepset-ai/haystack/pull/3719

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.12.1...v1.12.2rc1

    Source code(tar.gz)
    Source code(zip)
  • v1.12.1(Dec 21, 2022)

    ⭐ Highlights

    Large Language Models with PromptNode

    Introducing PromptNode, a new feature that brings the power of large language models (LLMs) to various NLP tasks. PromptNode is an easy-to-use, customizable node you can run on its own or in a pipeline. We've designed the API to be user-friendly and suitable for everyday experimentation, but also fully compatible with production-grade Haystack deployments.

    By setting a prompt template for a PromptNode you define what task you want it to do. This way, you can have multiple PromptNodes in your pipeline, each performing a different task. But that's not all. You can also inject the output of one PromptNode into the input of another one.

    Out of the box, we support both Google T5 Flan and OpenAI GPT-3 models, and you can even mix and match these models in your pipelines.

    from haystack.nodes.prompt import PromptNode
    
    # Initialize the node:
    prompt_node = PromptNode("google/flan-t5-base")  # try also 'text-davinci-003' if you have an OpenAI key
    
    prompt_node("What is the capital of Germany?")
    

    This node can do a lot more than simply querying LLMs: they can manage prompt templates, run batches, share models among instances, be chained together in pipelines, and more. Check its documentation for details!

    Support for BM25Retriever in InMemoryDocumentStore

    InMemoryDocumentStore has always been the go-to document store for small prototypes. The addition of BM25 support makes it officially one of the document stores to support all Retrievers available to Haystack, just like FAISS and Elasticsearch-like stores, but without the external dependencies. Don't use it in your million-documents-throughput deployments to production, though. It's not the fastest document store out there.

    :trophy: Honorable mention to @anakin87 for this outstanding contribution, among many many others! :trophy:

    Haystack is always open to external contributions, and every little bit is appreciated. Don't know where to start? Have a look at the Contributors Guidelines.

    Extended support for Cohere and OpenAI embeddings

    We enabled EmbeddingRetriever to use the latest Cohere multilingual embedding models and OpenAI embedding models.

    Simply use the model's full name (along with your API key) in EmbeddingRetriever to get them:

    # Cohere
    retriever = EmbeddingRetriever(embedding_model="multilingual-22-12", batch_size=16, api_key=api_key)
    # OpenAI
    retriever = EmbeddingRetriever(embedding_model="text-embedding-ada-002", batch_size=32, api_key=api_key, max_seq_len=8191)
    

    Speeding up dense searches in batch mode (Elasticsearch and OpenSearch)

    Whenever you need to execute multiple dense searches at once, ElasticsearchDocumentStore and OpenSearchDocumentStore can now do it in parallel. This not only speeds up run_batch and eval_batch for dense pipelines when used with those document stores but also significantly speeds up multi-embedding retrieval pipelines like, for example, MostSimilarDocumentsPipeline.

    For this, we measured a speed up of up to 49% on a realistic dataset.

    Under the hood, our newly introduced query_by_embedding_batch document store function uses msearch to unchain the full power of your Elasticsearch/OpenSearch cluster.

    :warning: Deprecated Docker images discontinued

    1.12 is the last release we're shipping with the old Docker images deepset/haystack-cpu, deepset/haystack-gpu, and their relative tags. We'll remove the corresponding, deprecated Docker files /Dockerfile, /Dockerfile-GPU, and /Dockerfile-GPU-minimal from the codebase after the release.

    What's Changed

    Pipeline

    • fix: ParsrConverter fails on pages without text by @anakin87 in https://github.com/deepset-ai/haystack/pull/3605
    • fix: Convert eval metrics to python float by @tstadel in https://github.com/deepset-ai/haystack/pull/3612
    • feat: add support for BM25Retriever in InMemoryDocumentStore by @anakin87 in https://github.com/deepset-ai/haystack/pull/3561
    • chore: fix return type of aggregate_labels by @tstadel in https://github.com/deepset-ai/haystack/pull/3617
    • refactor: change MultiModal retriever to be of type DenseRetriever by @mayankjobanputra in https://github.com/deepset-ai/haystack/pull/3598
    • fix: Move entire forward pass of TableQA within torch.no_grad() by @sjrl in https://github.com/deepset-ai/haystack/pull/3636
    • feat: add offsets_in_context to evaluation result by @julian-risch in https://github.com/deepset-ai/haystack/pull/3640
    • bug: Use tqdm auto instead of plain tqdm by @vblagoje in https://github.com/deepset-ai/haystack/pull/3672
    • fix: monkey patch for SklearnQueryClassifier by @anakin87 in https://github.com/deepset-ai/haystack/pull/3678
    • feat: Update table reader tests to check the answer scores by @sjrl in https://github.com/deepset-ai/haystack/pull/3641
    • feat: Adds all_terms_must_match parameter to BM25Retriever at runtime by @ugm2 in https://github.com/deepset-ai/haystack/pull/3627
    • fix: fix PreProcessor split_by schema by @ZanSara in https://github.com/deepset-ai/haystack/pull/3680
    • refactor: Generate JSON schema when missing by @masci in https://github.com/deepset-ai/haystack/pull/3533
    • refactor: replace torch.no_grad with torch.inference_mode (where possible) by @anakin87 in https://github.com/deepset-ai/haystack/pull/3601
    • Adjust get_type() method for pipelines by @vblagoje in https://github.com/deepset-ai/haystack/pull/3657
    • refactor: improve Multilabel design by @anakin87 in https://github.com/deepset-ai/haystack/pull/3658
    • feat: Update cohere embedding models #3704 by @vblagoje https://github.com/deepset-ai/haystack/pull/3704
    • feat: Enable text-embedding-ada-002 for EmbeddingRetriever #3721 by @vblagoje https://github.com/deepset-ai/haystack/pull/3721
    • feat: Expand LLM support with PromptModel, PromptNode, and PromptTemplate by @vblagoje in https://github.com/deepset-ai/haystack/pull/3667

    DocumentStores

    • fix: Flatten DocumentClassifier output in SQLDocumentStore by @anakin87 in https://github.com/deepset-ai/haystack/pull/3273
    • refactor: move milvus tests to their own module by @masci in https://github.com/deepset-ai/haystack/pull/3596
    • feat: store metadata using JSON in SQLDocumentStore by @masci in https://github.com/deepset-ai/haystack/pull/3547
    • fix: Pin faiss-cpu as 1.7.3 seems to have problems by @masci in https://github.com/deepset-ai/haystack/pull/3603
    • refactor: Move InMemoryDocumentStore tests to their own class by @masci in https://github.com/deepset-ai/haystack/pull/3614
    • chore: remove redundant tests by @masci in https://github.com/deepset-ai/haystack/pull/3620
    • refactor: Weaviate query with filters by @ZanSara in https://github.com/deepset-ai/haystack/pull/3628
    • fix: use 9200 as the default port in launch_opensearch() by @masci in https://github.com/deepset-ai/haystack/pull/3630
    • fix: revert Weaviate query with filters and improve tests by @ZanSara in https://github.com/deepset-ai/haystack/pull/3646
    • feat: add query_by_embedding_batch by @tstadel in https://github.com/deepset-ai/haystack/pull/3546
    • refactor: filters type by @tstadel in https://github.com/deepset-ai/haystack/pull/3682
    • fix: pinecone metadata format by @jamescalam in https://github.com/deepset-ai/haystack/pull/3660
    • fix: fixing broken BM25 support with Weaviate - fixes #3720 #3723 by @zoltan-fedor https://github.com/deepset-ai/haystack/pull/3723

    Documentation

    • fix: fixing the url for document merger by @TuanaCelik in https://github.com/deepset-ai/haystack/pull/3615
    • docs: Reformat code blocks in docstrings by @brandenchan in https://github.com/deepset-ai/haystack/pull/3580

    Contributors to Tutorials

    • fix: Tutorial 2, finetune a model, distillation code by Benvii https://github.com/deepset-ai/haystack-tutorials/pull/69
    • chore: Update 01_Basic_QA_Pipeline.ipynb by gsajko https://github.com/deepset-ai/haystack-tutorials/pull/63

    Other Changes

    • test: add test to check id_hash_keys is not ignored by @julian-risch in https://github.com/deepset-ai/haystack/pull/3577
    • fix: remove beir from all-gpu by @ZanSara in https://github.com/deepset-ai/haystack/pull/3669
    • feat: Update DocumentMerger and TextIndexingPipeline imports by @brandenchan in https://github.com/deepset-ai/haystack/pull/3599
    • fix: pin espnet in the audio extra by @ZanSara in https://github.com/deepset-ai/haystack/pull/3693
    • refactor: update Squad data by @espoirMur in https://github.com/deepset-ai/haystack/pull/3513
    • Update CONTRIBUTING.md by @TuanaCelik in https://github.com/deepset-ai/haystack/pull/3624
    • fix: revamp colab extra dependencies by @masci in https://github.com/deepset-ai/haystack/pull/3626
    • refactor: remove test extra by @ZanSara in https://github.com/deepset-ai/haystack/pull/3679
    • fix: remove beir from the base GPU image by @ZanSara in https://github.com/deepset-ai/haystack/pull/3692
    • feat: Bump transformers version to remove torch scatter dependency by @sjrl in https://github.com/deepset-ai/haystack/pull/3703

    New Contributors

    • @espoirMur made their first contribution in https://github.com/deepset-ai/haystack/pull/3513

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.11.1...v1.12.1

    Source code(tar.gz)
    Source code(zip)
  • v1.12.0(Dec 21, 2022)

  • v1.12.0rc1(Dec 19, 2022)

    ⭐ Highlights

    Large Language Models with PromptNode

    Introducing PromptNode, a new feature that brings the power of large language models (LLMs) to various NLP tasks. PromptNode is an easy-to-use, customizable node you can run on its own or in a pipeline. We've designed the API to be user-friendly and suitable for everyday experimentation, but also fully compatible with production-grade Haystack deployments.

    By setting a prompt template for a PromptNode you define what task you want it to do. This way, you can have multiple PromptNodes in your pipeline, each performing a different task. But that's not all. You can also inject the output of one PromptNode into the input of another one.

    Out of the box, we support both Google T5 Flan and OpenAI GPT-3 models, and you can even mix and match these models in your pipelines.

    from haystack.nodes.prompt import PromptNode
    
    # Initialize the node:
    prompt_node = PromptNode("google/flan-t5-base")  # try also 'text-davinci-003' if you have an OpenAI key
    
    prompt_node("What is the capital of Germany?")
    

    This node can do a lot more than simply querying LLMs: they can manage prompt templates, run batches, share models among instances, be chained together in pipelines, and more. Check its documentation for details!

    Support for BM25Retriever in InMemoryDocumentStore

    InMemoryDocumentStore has always been the go-to document store for small prototypes. The addition of BM25 support makes it officially one of the document stores to support all Retrievers available to Haystack, just like FAISS and Elasticsearch-like stores, but without the external dependencies. Don't use it in your million-documents-throughput deployments to production, though. It's not the fastest document store out there.

    :trophy: Honorable mention to @anakin87 for this outstanding contribution, among many many others! :trophy:

    Haystack is always open to external contributions, and every little bit is appreciated. Don't know where to start? Have a look at the Contributors Guidelines.

    Extended support for Cohere and OpenAI embeddings

    We enabled EmbeddingRetriever to use the latest Cohere multilingual embedding models and OpenAI embedding models.

    Simply use the model's full name (along with your API key) in EmbeddingRetriever to get them:

    # Cohere
    retriever = EmbeddingRetriever(embedding_model="multilingual-22-12", batch_size=16, api_key=api_key)
    # OpenAI
    retriever = EmbeddingRetriever(embedding_model="text-embedding-ada-002", batch_size=32, api_key=api_key, max_seq_len=8191)
    

    Speeding up dense searches in batch mode (Elasticsearch and OpenSearch)

    Whenever you need to execute multiple dense searches at once, ElasticsearchDocumentStore and OpenSearchDocumentStore can now do it in parallel. This not only speeds up run_batch and eval_batch for dense pipelines when used with those document stores but also significantly speeds up multi-embedding retrieval pipelines like, for example, MostSimilarDocumentsPipeline.

    For this, we measured a speed up of up to 49% on a realistic dataset.

    Under the hood, our newly introduced query_by_embedding_batch document store function uses msearch to unchain the full power of your Elasticsearch/OpenSearch cluster.

    :warning: Deprecated Docker images discontinued

    1.12 is the last release we're shipping with the old Docker images deepset/haystack-cpu, deepset/haystack-gpu, and their relative tags. We'll remove the corresponding, deprecated Docker files /Dockerfile, /Dockerfile-GPU, and /Dockerfile-GPU-minimal from the codebase after the release.

    What's Changed

    Pipeline

    • fix: ParsrConverter fails on pages without text by @anakin87 in https://github.com/deepset-ai/haystack/pull/3605
    • fix: Convert eval metrics to python float by @tstadel in https://github.com/deepset-ai/haystack/pull/3612
    • feat: add support for BM25Retriever in InMemoryDocumentStore by @anakin87 in https://github.com/deepset-ai/haystack/pull/3561
    • chore: fix return type of aggregate_labels by @tstadel in https://github.com/deepset-ai/haystack/pull/3617
    • refactor: change MultiModal retriever to be of type DenseRetriever by @mayankjobanputra in https://github.com/deepset-ai/haystack/pull/3598
    • fix: Move entire forward pass of TableQA within torch.no_grad() by @sjrl in https://github.com/deepset-ai/haystack/pull/3636
    • feat: add offsets_in_context to evaluation result by @julian-risch in https://github.com/deepset-ai/haystack/pull/3640
    • bug: Use tqdm auto instead of plain tqdm by @vblagoje in https://github.com/deepset-ai/haystack/pull/3672
    • fix: monkey patch for SklearnQueryClassifier by @anakin87 in https://github.com/deepset-ai/haystack/pull/3678
    • feat: Update table reader tests to check the answer scores by @sjrl in https://github.com/deepset-ai/haystack/pull/3641
    • feat: Adds all_terms_must_match parameter to BM25Retriever at runtime by @ugm2 in https://github.com/deepset-ai/haystack/pull/3627
    • fix: fix PreProcessor split_by schema by @ZanSara in https://github.com/deepset-ai/haystack/pull/3680
    • refactor: Generate JSON schema when missing by @masci in https://github.com/deepset-ai/haystack/pull/3533
    • refactor: replace torch.no_grad with torch.inference_mode (where possible) by @anakin87 in https://github.com/deepset-ai/haystack/pull/3601
    • Adjust get_type() method for pipelines by @vblagoje in https://github.com/deepset-ai/haystack/pull/3657
    • refactor: improve Multilabel design by @anakin87 in https://github.com/deepset-ai/haystack/pull/3658
    • feat: Update cohere embedding models #3704 by @vblagoje https://github.com/deepset-ai/haystack/pull/3704
    • feat: Enable text-embedding-ada-002 for EmbeddingRetriever #3721 by @vblagoje https://github.com/deepset-ai/haystack/pull/3721

    DocumentStores

    • fix: Flatten DocumentClassifier output in SQLDocumentStore by @anakin87 in https://github.com/deepset-ai/haystack/pull/3273
    • refactor: move milvus tests to their own module by @masci in https://github.com/deepset-ai/haystack/pull/3596
    • feat: store metadata using JSON in SQLDocumentStore by @masci in https://github.com/deepset-ai/haystack/pull/3547
    • fix: Pin faiss-cpu as 1.7.3 seems to have problems by @masci in https://github.com/deepset-ai/haystack/pull/3603
    • refactor: Move InMemoryDocumentStore tests to their own class by @masci in https://github.com/deepset-ai/haystack/pull/3614
    • chore: remove redundant tests by @masci in https://github.com/deepset-ai/haystack/pull/3620
    • refactor: Weaviate query with filters by @ZanSara in https://github.com/deepset-ai/haystack/pull/3628
    • fix: use 9200 as the default port in launch_opensearch() by @masci in https://github.com/deepset-ai/haystack/pull/3630
    • fix: revert Weaviate query with filters and improve tests by @ZanSara in https://github.com/deepset-ai/haystack/pull/3646
    • feat: add query_by_embedding_batch by @tstadel in https://github.com/deepset-ai/haystack/pull/3546
    • refactor: filters type by @tstadel in https://github.com/deepset-ai/haystack/pull/3682
    • fix: pinecone metadata format by @jamescalam in https://github.com/deepset-ai/haystack/pull/3660
    • fix: fixing broken BM25 support with Weaviate - fixes #3720 #3723 by @zoltan-fedor https://github.com/deepset-ai/haystack/pull/3723

    Documentation

    • fix: fixing the url for document merger by @TuanaCelik in https://github.com/deepset-ai/haystack/pull/3615
    • docs: Reformat code blocks in docstrings by @brandenchan in https://github.com/deepset-ai/haystack/pull/3580

    Contributors to Tutorials

    • fix: Tutorial 2, finetune a model, distillation code by Benvii https://github.com/deepset-ai/haystack-tutorials/pull/69
    • chore: Update 01_Basic_QA_Pipeline.ipynb by gsajko https://github.com/deepset-ai/haystack-tutorials/pull/63

    Other Changes

    • test: add test to check id_hash_keys is not ignored by @julian-risch in https://github.com/deepset-ai/haystack/pull/3577
    • fix: remove beir from all-gpu by @ZanSara in https://github.com/deepset-ai/haystack/pull/3669
    • feat: Update DocumentMerger and TextIndexingPipeline imports by @brandenchan in https://github.com/deepset-ai/haystack/pull/3599
    • fix: pin espnet in the audio extra by @ZanSara in https://github.com/deepset-ai/haystack/pull/3693
    • refactor: update Squad data by @espoirMur in https://github.com/deepset-ai/haystack/pull/3513
    • Update CONTRIBUTING.md by @TuanaCelik in https://github.com/deepset-ai/haystack/pull/3624
    • fix: revamp colab extra dependencies by @masci in https://github.com/deepset-ai/haystack/pull/3626
    • refactor: remove test extra by @ZanSara in https://github.com/deepset-ai/haystack/pull/3679
    • fix: remove beir from the base GPU image by @ZanSara in https://github.com/deepset-ai/haystack/pull/3692
    • feat: Bump transformers version to remove torch scatter dependency by @sjrl in https://github.com/deepset-ai/haystack/pull/3703

    New Contributors

    • @espoirMur made their first contribution in https://github.com/deepset-ai/haystack/pull/3513

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.11.1...v1.12.0rc1

    Source code(tar.gz)
    Source code(zip)
  • v1.11.1(Dec 6, 2022)

    What's Changed

    • fix: Pin faiss-cpu as 1.7.3 seems to have problems by @masci in #3603

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.11.0...v1.11.1

    Source code(tar.gz)
    Source code(zip)
  • v1.11.1rc1(Dec 6, 2022)

    What's Changed

    • fix: Pin faiss-cpu as 1.7.3 seems to have problems by @masci in #3603

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.11.0...v1.11.1rc1

    Source code(tar.gz)
    Source code(zip)
  • v1.11.0(Nov 21, 2022)

    ⭐ Highlights

    Expanding Haystack’s LLM support further with the new CohereEmbeddingEncoder (#3356)

    Now you can easily create document and query embeddings using Cohere’s large language models: if you have a Cohere account, all you have to do is set the name of one of the supported models (small, medium, or large) and add your API key to the EmbeddingRetriever component in your pipelines (see docs).

    Extracting headlines from Markdown and PDF files (#3445 #3488)

    Using the MarkdownConverter or the ParsrConverter you can set the parameter extract_headlines to True to extract the headlines out of your files together with their start start position in the file and their level. Headlines are stored as a list of dictionaries in the Document's meta field "headlines" and are structured as followed:

    {
        "headline": <THE HEADLINE STRING>,
        "start_idx": <IDX OF HEADLINE START IN document.content >,
        "level": <LEVEL OF THE HEADLINE>
    }
    

    Introducing the proposals design process (#3333)

    We've introduced the proposal design process for substantial changes. A proposal is a single Markdown file that explains why a change is needed and how it would be implemented. You can find a detailed explanation of the process and a proposal template in the proposals directory.

    ⚠️ Breaking change: removing Milvus1DocumentStore

    From this version onwards, Haystack no longer supports version 1 of Milvus. We still support Milvus version 2. We removed Milvus1DocumentStore and renamed Milvus2DocumentStore to MilvusDocumentStore.

    What's Changed

    Breaking Changes

    • bug: removed duplicated meta "name" field addition to content before embedding in update_embeddings workflow by @mayankjobanputra in https://github.com/deepset-ai/haystack/pull/3368
    • BREAKING CHANGE: remove Milvus1DocumentStore along with support for Milvus < 2.x by @masci in https://github.com/deepset-ai/haystack/pull/3552

    Pipeline

    • fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in https://github.com/deepset-ai/haystack/pull/3330
    • bug: change type of split_by to Literal including None by @julian-risch in https://github.com/deepset-ai/haystack/pull/3389
    • Fix: update pyworld pin by @anakin87 in https://github.com/deepset-ai/haystack/pull/3435
    • feat: send event if number of queries exceeds threshold by @vblagoje in https://github.com/deepset-ai/haystack/pull/3419
    • Feat: allow decreasing size of datasets loaded from BEIR by @ugm2 in https://github.com/deepset-ai/haystack/pull/3392
    • feat: add __cointains__ to Span by @ZanSara in https://github.com/deepset-ai/haystack/pull/3446
    • Bug: Fix prompt length computation by @Timoeller in https://github.com/deepset-ai/haystack/pull/3448
    • Add indexing pipeline type by @vblagoje in https://github.com/deepset-ai/haystack/pull/3461
    • fix: warning if doc store similarity function is incompatible with Sentence Transformers model by @anakin87 in https://github.com/deepset-ai/haystack/pull/3455
    • feat: Add CohereEmbeddingEncoder to EmbeddingRetriever by @vblagoje in https://github.com/deepset-ai/haystack/pull/3453
    • feat: Extraction of headlines in markdown files by @bogdankostic in https://github.com/deepset-ai/haystack/pull/3445
    • bug: replace decorator with counter attribute for pipeline event by @julian-risch in https://github.com/deepset-ai/haystack/pull/3462
    • feat: add document_store to all BaseRetriever.retrieve() and BaseRetriever.retrieve_batch() implementations by @ZanSara in https://github.com/deepset-ai/haystack/pull/3379
    • refactor: TableReader by @sjrl in https://github.com/deepset-ai/haystack/pull/3456
    • fix: do not reference package directory in PDFToTextOCRConverter.convert() by @ZanSara in https://github.com/deepset-ai/haystack/pull/3478
    • feat: Create the TextIndexingPipeline by @brandenchan in https://github.com/deepset-ai/haystack/pull/3473
    • refactor: remove YAML save/load methods for subclasses of BaseStandardPipeline by @ZanSara in https://github.com/deepset-ai/haystack/pull/3443
    • fix: strip whitespaces safely from FARMReader's answers by @ZanSara in https://github.com/deepset-ai/haystack/pull/3526

    DocumentStores

    • Document Store test refactoring by @masci in https://github.com/deepset-ai/haystack/pull/3449
    • fix: support long texts for labels in ElasticsearchDocumentStore by @anakin87 in https://github.com/deepset-ai/haystack/pull/3346
    • feat: add SQLDocumentStore tests by @masci in https://github.com/deepset-ai/haystack/pull/3517
    • refactor: Refactor Weaviate tests by @masci in https://github.com/deepset-ai/haystack/pull/3541
    • refactor: Pinecone tests by @masci in https://github.com/deepset-ai/haystack/pull/3555
    • fix: write metadata to SQL Document Store when duplicate_documents!="overwrite" by @anakin87 in https://github.com/deepset-ai/haystack/pull/3548
    • fix: Elasticsearch / OpenSearch brownfield function does not incorporate meta by @tstadel in https://github.com/deepset-ai/haystack/pull/3572
    • fix: discard metadata fields if not set in Weaviate by @masci in https://github.com/deepset-ai/haystack/pull/3578

    UI / Demo

    • refactor: update package strategy in ui by @anakin87 in https://github.com/deepset-ai/haystack/pull/3396

    Documentation

    • docs: Extend utils API docs coverage by @brandenchan in https://github.com/deepset-ai/haystack/pull/3402
    • refactor: simplify Summarizer, add Document Merger by @anakin87 in https://github.com/deepset-ai/haystack/pull/3452
    • feat: introduce proposal design process by @masci in https://github.com/deepset-ai/haystack/pull/3333

    Other Changes

    • fix: Update env variable for model caching timeout by @sjrl in https://github.com/deepset-ai/haystack/pull/3405
    • feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in https://github.com/deepset-ai/haystack/pull/3398
    • fix: improve Document __repr__ by @anakin87 in https://github.com/deepset-ai/haystack/pull/3385
    • fix: disabling telemetry prevents writing config by @julian-risch in https://github.com/deepset-ai/haystack/pull/3465
    • refactor: Change no_answer attribute by @anakin87 in https://github.com/deepset-ai/haystack/pull/3411
    • feat: Speed up reader tests by @sjrl in https://github.com/deepset-ai/haystack/pull/3476
    • fix: pattern to match tags push by @masci in https://github.com/deepset-ai/haystack/pull/3469
    • fix: using onnx converter on XLMRoberta architecture by @sjrl in https://github.com/deepset-ai/haystack/pull/3470
    • feat: Add headline extraction to ParsrConverter by @bogdankostic in https://github.com/deepset-ai/haystack/pull/3488
    • refactor: upgrade actions version by @ZanSara in https://github.com/deepset-ai/haystack/pull/3506
    • docs: Update docker readme by @brandenchan in https://github.com/deepset-ai/haystack/pull/3531
    • refactor: refactor FAISS tests by @masci in https://github.com/deepset-ai/haystack/pull/3537
    • feat: include error message in HaystackError telemetry events by @vblagoje in https://github.com/deepset-ai/haystack/pull/3543
    • fix: [rest_api] support TableQA in the endpoint /documents/get_by_filters by @ju-gu in https://github.com/deepset-ai/haystack/pull/3551
    • bug: fix release number by @mayankjobanputra in https://github.com/deepset-ai/haystack/pull/3559
    • refactor: Generate JSON schema when missing by @masci in https://github.com/deepset-ai/haystack/pull/3533

    New Contributors

    • @brunnurs made their first contribution in https://github.com/deepset-ai/haystack/pull/3330
    • @mayankjobanputra made their first contribution in https://github.com/deepset-ai/haystack/pull/3368

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.10.0...v1.11.0rc1

    Source code(tar.gz)
    Source code(zip)
  • v1.11.0rc1(Nov 18, 2022)

    ⭐ Highlights

    Expanding Haystack’s LLM support further with the new CohereEmbeddingEncoder (#3356)

    Now you can easily create document and query embeddings using Cohere’s large language models: if you have a Cohere account, all you have to do is set the name of one of the supported models (small, medium, or large) and add your API key to the EmbeddingRetriever component in your pipelines (see docs).

    Extracting headlines from Markdown and PDF files (#3445 #3488)

    Using the MarkdownConverter or the ParsrConverter you can set the parameter extract_headlines to True to extract the headlines out of your files together with their start start position in the file and their level. Headlines are stored as a list of dictionaries in the Document's meta field "headlines" and are structured as followed:

    {
        "headline": <THE HEADLINE STRING>,
        "start_idx": <IDX OF HEADLINE START IN document.content >,
        "level": <LEVEL OF THE HEADLINE>
    }
    

    Introducing the proposals design process (#3333)

    We've introduced the proposal design process for substantial changes. A proposal is a single Markdown file that explains why a change is needed and how it would be implemented. You can find a detailed explanation of the process and a proposal template in the proposals directory.

    ⚠️ Breaking change: removing Milvus1DocumentStore

    From this version onwards, Haystack no longer supports version 1 of Milvus. We still support Milvus version 2. We removed Milvus1DocumentStore and renamed Milvus2DocumentStore to MilvusDocumentStore.

    What's Changed

    Breaking Changes

    • bug: removed duplicated meta "name" field addition to content before embedding in update_embeddings workflow by @mayankjobanputra in https://github.com/deepset-ai/haystack/pull/3368
    • BREAKING CHANGE: remove Milvus1DocumentStore along with support for Milvus < 2.x by @masci in https://github.com/deepset-ai/haystack/pull/3552

    Pipeline

    • fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in https://github.com/deepset-ai/haystack/pull/3330
    • bug: change type of split_by to Literal including None by @julian-risch in https://github.com/deepset-ai/haystack/pull/3389
    • Fix: update pyworld pin by @anakin87 in https://github.com/deepset-ai/haystack/pull/3435
    • feat: send event if number of queries exceeds threshold by @vblagoje in https://github.com/deepset-ai/haystack/pull/3419
    • Feat: allow decreasing size of datasets loaded from BEIR by @ugm2 in https://github.com/deepset-ai/haystack/pull/3392
    • feat: add __cointains__ to Span by @ZanSara in https://github.com/deepset-ai/haystack/pull/3446
    • Bug: Fix prompt length computation by @Timoeller in https://github.com/deepset-ai/haystack/pull/3448
    • Add indexing pipeline type by @vblagoje in https://github.com/deepset-ai/haystack/pull/3461
    • fix: warning if doc store similarity function is incompatible with Sentence Transformers model by @anakin87 in https://github.com/deepset-ai/haystack/pull/3455
    • feat: Add CohereEmbeddingEncoder to EmbeddingRetriever by @vblagoje in https://github.com/deepset-ai/haystack/pull/3453
    • feat: Extraction of headlines in markdown files by @bogdankostic in https://github.com/deepset-ai/haystack/pull/3445
    • bug: replace decorator with counter attribute for pipeline event by @julian-risch in https://github.com/deepset-ai/haystack/pull/3462
    • feat: add document_store to all BaseRetriever.retrieve() and BaseRetriever.retrieve_batch() implementations by @ZanSara in https://github.com/deepset-ai/haystack/pull/3379
    • refactor: TableReader by @sjrl in https://github.com/deepset-ai/haystack/pull/3456
    • fix: do not reference package directory in PDFToTextOCRConverter.convert() by @ZanSara in https://github.com/deepset-ai/haystack/pull/3478
    • feat: Create the TextIndexingPipeline by @brandenchan in https://github.com/deepset-ai/haystack/pull/3473
    • refactor: remove YAML save/load methods for subclasses of BaseStandardPipeline by @ZanSara in https://github.com/deepset-ai/haystack/pull/3443
    • fix: strip whitespaces safely from FARMReader's answers by @ZanSara in https://github.com/deepset-ai/haystack/pull/3526

    DocumentStores

    • Document Store test refactoring by @masci in https://github.com/deepset-ai/haystack/pull/3449
    • fix: support long texts for labels in ElasticsearchDocumentStore by @anakin87 in https://github.com/deepset-ai/haystack/pull/3346
    • feat: add SQLDocumentStore tests by @masci in https://github.com/deepset-ai/haystack/pull/3517
    • refactor: Refactor Weaviate tests by @masci in https://github.com/deepset-ai/haystack/pull/3541
    • refactor: Pinecone tests by @masci in https://github.com/deepset-ai/haystack/pull/3555
    • fix: write metadata to SQL Document Store when duplicate_documents!="overwrite" by @anakin87 in https://github.com/deepset-ai/haystack/pull/3548
    • fix: Elasticsearch / OpenSearch brownfield function does not incorporate meta by @tstadel in https://github.com/deepset-ai/haystack/pull/3572
    • fix: discard metadata fields if not set in Weaviate by @masci in https://github.com/deepset-ai/haystack/pull/3578

    UI / Demo

    • refactor: update package strategy in ui by @anakin87 in https://github.com/deepset-ai/haystack/pull/3396

    Documentation

    • docs: Extend utils API docs coverage by @brandenchan in https://github.com/deepset-ai/haystack/pull/3402
    • refactor: simplify Summarizer, add Document Merger by @anakin87 in https://github.com/deepset-ai/haystack/pull/3452
    • feat: introduce proposal design process by @masci in https://github.com/deepset-ai/haystack/pull/3333

    Other Changes

    • fix: Update env variable for model caching timeout by @sjrl in https://github.com/deepset-ai/haystack/pull/3405
    • feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in https://github.com/deepset-ai/haystack/pull/3398
    • fix: improve Document __repr__ by @anakin87 in https://github.com/deepset-ai/haystack/pull/3385
    • fix: disabling telemetry prevents writing config by @julian-risch in https://github.com/deepset-ai/haystack/pull/3465
    • refactor: Change no_answer attribute by @anakin87 in https://github.com/deepset-ai/haystack/pull/3411
    • feat: Speed up reader tests by @sjrl in https://github.com/deepset-ai/haystack/pull/3476
    • fix: pattern to match tags push by @masci in https://github.com/deepset-ai/haystack/pull/3469
    • fix: using onnx converter on XLMRoberta architecture by @sjrl in https://github.com/deepset-ai/haystack/pull/3470
    • feat: Add headline extraction to ParsrConverter by @bogdankostic in https://github.com/deepset-ai/haystack/pull/3488
    • refactor: upgrade actions version by @ZanSara in https://github.com/deepset-ai/haystack/pull/3506
    • docs: Update docker readme by @brandenchan in https://github.com/deepset-ai/haystack/pull/3531
    • refactor: refactor FAISS tests by @masci in https://github.com/deepset-ai/haystack/pull/3537
    • feat: include error message in HaystackError telemetry events by @vblagoje in https://github.com/deepset-ai/haystack/pull/3543
    • fix: [rest_api] support TableQA in the endpoint /documents/get_by_filters by @ju-gu in https://github.com/deepset-ai/haystack/pull/3551
    • bug: fix release number by @mayankjobanputra in https://github.com/deepset-ai/haystack/pull/3559
    • refactor: Generate JSON schema when missing by @masci in https://github.com/deepset-ai/haystack/pull/3533

    New Contributors

    • @brunnurs made their first contribution in https://github.com/deepset-ai/haystack/pull/3330
    • @mayankjobanputra made their first contribution in https://github.com/deepset-ai/haystack/pull/3368

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.10.0...v1.11.0rc1

    Source code(tar.gz)
    Source code(zip)
  • v1.10.0(Oct 25, 2022)

    ⭐ Highlights

    Expanding Haystack's LLM support with the new OpenAIEmbeddingEncoder (#3356)

    Now you can easily create document and query embeddings using large language models: if you have an OpenAI account, all you have to do is set the name of one of the supported models (ada, babbage, davinci or curie) and add your API key to the EmbeddingRetriever component in your pipelines (see docs).

    Multimodal retrieval is here! (#2891)

    Multimodality with Haystack just made a big leap forward with the addition of MultiModalRetriever: a Retriever that can handle different modalities for query and documents independently. Take it for a spin and experiment with new Document formats, like images. You can now use the same Retriever for text-to-image, text-to-table, and text-to-text retrieval but also image similarity, table similarity, and more! Feed your favorite multimodal model to MultiModalRetriever and see it in action.

    retriever = MultiModalRetriever(
        document_store=InMemoryDocumentStore(embedding_dim=512),
        query_embedding_model = "sentence-transformers/clip-ViT-B-32",
        query_type="text",
        document_embedding_models = {"image": "sentence-transformers/clip-ViT-B-32"}
    )
    

    Multi-platform Docker images

    Starting with 1.10, we're making the deepset/haystack images available for linux/amd64 and linux/arm64.

    ⚠️ Breaking change in embed_queries method (#3252)

    We've changed the text argument in the embed_queries method for DensePassageRetriever and EmbeddingRetriever to queries.

    What's Changed

    Breaking Changes

    • chore: add DenseRetriever abstraction by @tstadel in https://github.com/deepset-ai/haystack/pull/3252

    Pipeline

    • fix: ONNX FARMReader model conversion is broken by @vblagoje in https://github.com/deepset-ai/haystack/pull/3211
    • bug: JoinDocuments nodes produce incorrect results if preceded by another JoinDocuments node by @JeffRisberg in https://github.com/deepset-ai/haystack/pull/3170
    • fix: eval() with add_isolated_node_eval=True breaks if no node supports it by @tstadel in https://github.com/deepset-ai/haystack/pull/3347
    • feat: extract label aggregation by @tstadel in https://github.com/deepset-ai/haystack/pull/3363
    • feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever by @vblagoje in https://github.com/deepset-ai/haystack/pull/3356
    • fix: stable YAML schema generation by @ZanSara in https://github.com/deepset-ai/haystack/pull/3388
    • fix: Update how schema is ordered by @sjrl in https://github.com/deepset-ai/haystack/pull/3399
    • feat: MultiModalRetriever by @ZanSara in https://github.com/deepset-ai/haystack/pull/2891

    DocumentStores

    • feat: FAISS in OpenSearch: Support HNSW for cosine by @tstadel in https://github.com/deepset-ai/haystack/pull/3217
    • feat: add support for Elasticsearch 7.16.2 by @masci in https://github.com/deepset-ai/haystack/pull/3318
    • refactor: remove dead code from FAISSDocumentStore by @anakin87 in https://github.com/deepset-ai/haystack/pull/3372
    • fix: allow same vector_id in different indexes for SQL-based Document stores by @anakin87 in https://github.com/deepset-ai/haystack/pull/3383

    UI / Demo

    • fix: demo won't start through Docker compose on Apple M1 by @masci in https://github.com/deepset-ai/haystack/pull/3337

    Documentation

    • docs: Fix a docstring in ray.py by @tanertopal in https://github.com/deepset-ai/haystack/pull/3282

    Other Changes

    • refactor: make TransformersDocumentClassifier output consistent between different types of classification by @anakin87 in https://github.com/deepset-ai/haystack/pull/3224
    • Classify pipeline's type based on its components by @vblagoje in https://github.com/deepset-ai/haystack/pull/3132
    • docs: sync Haystack API with Readme by @brandenchan in https://github.com/deepset-ai/haystack/pull/3223
    • fix: MostSimilarDocumentsPipeline doesn't have pipeline property by @vblagoje in https://github.com/deepset-ai/haystack/pull/3265
    • bug: make ElasticSearchDocumentStore use batch_size in get_documents_by_id by @anakin87 in https://github.com/deepset-ai/haystack/pull/3166
    • refactor: better tests for TransformersDocumentClassifier by @anakin87 in https://github.com/deepset-ai/haystack/pull/3270
    • fix: AttributeError in TranslationWrapperPipeline by @nickchomey in https://github.com/deepset-ai/haystack/pull/3290
    • refactor: remove Inferencer multiprocessing by @vblagoje in https://github.com/deepset-ai/haystack/pull/3283
    • fix: opensearch script score with filters by @tstadel in https://github.com/deepset-ai/haystack/pull/3321
    • feat: Adding filters param to MostSimilarDocumentsPipeline run and run_batch by @JacdDev in https://github.com/deepset-ai/haystack/pull/3301
    • feat: add multi-platform Docker images by @masci in https://github.com/deepset-ai/haystack/pull/3354
    • fix: Added checks for DataParallel and WrappedDataParallel by @sjrl in https://github.com/deepset-ai/haystack/pull/3366
    • fix: QuestionGenerator generates wrong document questions for non-default num_queries_per_doc parameter by @vblagoje in https://github.com/deepset-ai/haystack/pull/3381
    • bug: Adds better way of checking query in BaseRetriever and Pipeline.run() by @ugm2 in https://github.com/deepset-ai/haystack/pull/3304
    • feat: Updated EntityExtractor to handle long texts and added better postprocessing by @sjrl in https://github.com/deepset-ai/haystack/pull/3154
    • docs: Add comment about the generation of no-answer samples in FARMReader training by @brandenchan in https://github.com/deepset-ai/haystack/pull/3404
    • feat: Speed up integration tests (nodes) by @sjrl in https://github.com/deepset-ai/haystack/pull/3408
    • fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in https://github.com/deepset-ai/haystack/pull/3330
    • bug: change type of split_by to Literal including None by @julian-risch in https://github.com/deepset-ai/haystack/pull/3389
    • feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in https://github.com/deepset-ai/haystack/pull/3398

    New Contributors

    • @tanertopal made their first contribution in https://github.com/deepset-ai/haystack/pull/3282
    • @JeffRisberg made their first contribution in https://github.com/deepset-ai/haystack/pull/3170
    • @JacdDev made their first contribution in https://github.com/deepset-ai/haystack/pull/3301
    • @hsm207 made their first contribution in https://github.com/deepset-ai/haystack/pull/3351
    • @ugm2 made their first contribution in https://github.com/deepset-ai/haystack/pull/3304
    • @brunnurs made their first contribution in https://github.com/deepset-ai/haystack/pull/3330

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.9.1...v1.10.0rc1

    Source code(tar.gz)
    Source code(zip)
  • v1.10.0rc1(Oct 20, 2022)

    ⭐ Highlights

    Expanding Haystack's LLM support with the new OpenAIEmbeddingEncoder (#3356)

    Now you can easily create document and query embeddings using large language models: if you have an OpenAI account, all you have to do is set the name of one of the supported models (ada, babbage, davinci or curie) and add your API key to the EmbeddedRetriver component in your pipelines.

    Multimodal retrieval is here! (#2891)

    Multimodality with Haystack just made a big leap forward with the addition of MultiModalRetriever: a Retriever that can handle different modalities for query and documents independently. Take it for a spin and experiment with new Document formats, like images. You can now use the same Retriever for text-to-image, text-to-table, and text-to-text retrieval but also image similarity, table similarity, and more! Feed your favorite multimodal model to MultiModalRetriever and see it in action.

    retriever = MultiModalRetriever(
        document_store=InMemoryDocumentStore(embedding_dim=512),
        query_embedding_model = "sentence-transformers/clip-ViT-B-32",
        query_type="text",
        document_embedding_models = {"image": "sentence-transformers/clip-ViT-B-32"}
    )
    

    Multi-platform Docker images

    Starting with 1.10, we're making the deepset/haystack images available for linux/amd64 and linux/arm64.

    ⚠️ Breaking change in embed_queries method (#3252)

    We've changed the text argument in the embed_queries method for DensePassageRetriever and EmbeddingRetriever to queries.

    What's Changed

    Breaking Changes

    • chore: add DenseRetriever abstraction by @tstadel in https://github.com/deepset-ai/haystack/pull/3252

    Pipeline

    • fix: ONNX FARMReader model conversion is broken by @vblagoje in https://github.com/deepset-ai/haystack/pull/3211
    • bug: JoinDocuments nodes produce incorrect results if preceded by another JoinDocuments node by @JeffRisberg in https://github.com/deepset-ai/haystack/pull/3170
    • fix: eval() with add_isolated_node_eval=True breaks if no node supports it by @tstadel in https://github.com/deepset-ai/haystack/pull/3347
    • feat: extract label aggregation by @tstadel in https://github.com/deepset-ai/haystack/pull/3363
    • feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever by @vblagoje in https://github.com/deepset-ai/haystack/pull/3356
    • fix: stable YAML schema generation by @ZanSara in https://github.com/deepset-ai/haystack/pull/3388
    • fix: Update how schema is ordered by @sjrl in https://github.com/deepset-ai/haystack/pull/3399
    • feat: MultiModalRetriever by @ZanSara in https://github.com/deepset-ai/haystack/pull/2891

    DocumentStores

    • feat: FAISS in OpenSearch: Support HNSW for cosine by @tstadel in https://github.com/deepset-ai/haystack/pull/3217
    • feat: add support for Elasticsearch 7.16.2 by @masci in https://github.com/deepset-ai/haystack/pull/3318
    • refactor: remove dead code from FAISSDocumentStore by @anakin87 in https://github.com/deepset-ai/haystack/pull/3372
    • fix: allow same vector_id in different indexes for SQL-based Document stores by @anakin87 in https://github.com/deepset-ai/haystack/pull/3383

    UI / Demo

    • fix: demo won't start through Docker compose on Apple M1 by @masci in https://github.com/deepset-ai/haystack/pull/3337

    Documentation

    • docs: Fix a docstring in ray.py by @tanertopal in https://github.com/deepset-ai/haystack/pull/3282

    Other Changes

    • refactor: make TransformersDocumentClassifier output consistent between different types of classification by @anakin87 in https://github.com/deepset-ai/haystack/pull/3224
    • Classify pipeline's type based on its components by @vblagoje in https://github.com/deepset-ai/haystack/pull/3132
    • docs: sync Haystack API with Readme by @brandenchan in https://github.com/deepset-ai/haystack/pull/3223
    • fix: MostSimilarDocumentsPipeline doesn't have pipeline property by @vblagoje in https://github.com/deepset-ai/haystack/pull/3265
    • bug: make ElasticSearchDocumentStore use batch_size in get_documents_by_id by @anakin87 in https://github.com/deepset-ai/haystack/pull/3166
    • refactor: better tests for TransformersDocumentClassifier by @anakin87 in https://github.com/deepset-ai/haystack/pull/3270
    • fix: AttributeError in TranslationWrapperPipeline by @nickchomey in https://github.com/deepset-ai/haystack/pull/3290
    • refactor: remove Inferencer multiprocessing by @vblagoje in https://github.com/deepset-ai/haystack/pull/3283
    • fix: opensearch script score with filters by @tstadel in https://github.com/deepset-ai/haystack/pull/3321
    • feat: Adding filters param to MostSimilarDocumentsPipeline run and run_batch by @JacdDev in https://github.com/deepset-ai/haystack/pull/3301
    • feat: add multi-platform Docker images by @masci in https://github.com/deepset-ai/haystack/pull/3354
    • fix: Added checks for DataParallel and WrappedDataParallel by @sjrl in https://github.com/deepset-ai/haystack/pull/3366
    • fix: QuestionGenerator generates wrong document questions for non-default num_queries_per_doc parameter by @vblagoje in https://github.com/deepset-ai/haystack/pull/3381
    • bug: Adds better way of checking query in BaseRetriever and Pipeline.run() by @ugm2 in https://github.com/deepset-ai/haystack/pull/3304
    • feat: Updated EntityExtractor to handle long texts and added better postprocessing by @sjrl in https://github.com/deepset-ai/haystack/pull/3154
    • docs: Add comment about the generation of no-answer samples in FARMReader training by @brandenchan in https://github.com/deepset-ai/haystack/pull/3404
    • feat: Speed up integration tests (nodes) by @sjrl in https://github.com/deepset-ai/haystack/pull/3408
    • fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in https://github.com/deepset-ai/haystack/pull/3330
    • bug: change type of split_by to Literal including None by @julian-risch in https://github.com/deepset-ai/haystack/pull/3389
    • feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in https://github.com/deepset-ai/haystack/pull/3398

    New Contributors

    • @tanertopal made their first contribution in https://github.com/deepset-ai/haystack/pull/3282
    • @JeffRisberg made their first contribution in https://github.com/deepset-ai/haystack/pull/3170
    • @JacdDev made their first contribution in https://github.com/deepset-ai/haystack/pull/3301
    • @hsm207 made their first contribution in https://github.com/deepset-ai/haystack/pull/3351
    • @ugm2 made their first contribution in https://github.com/deepset-ai/haystack/pull/3304
    • @brunnurs made their first contribution in https://github.com/deepset-ai/haystack/pull/3330

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.9.1...v1.10.0rc1

    Source code(tar.gz)
    Source code(zip)
  • v1.9.1(Oct 10, 2022)

    What's Changed

    • fix: Allow less restrictive values for parameters in Pipeline configurations by @bogdankostic in #3345

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.9.0...v1.9.1rc1

    Source code(tar.gz)
    Source code(zip)
  • v1.9.1rc1(Oct 10, 2022)

    What's Changed

    • fix: Allow less restrictive values for parameters in Pipeline configurations by @bogdankostic in #3345

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.9.0...v1.9.1rc1

    Source code(tar.gz)
    Source code(zip)
  • v1.9.0(Sep 21, 2022)

    ⭐ Highlights

    Haystack 1.9 comes with nice performance improvements and two important pieces of news about its ecosystem. Let's see it in more detail!

    Logging speed set to ludicrous (#3212)

    This feature alone makes Haystack 1.9 worth testing out, just sayin'... We switched from f-strings to the string formatting operator when composing a log message, observing an astonishing speed of up to 120% in some pipelines.

    Tutorials moved out! (#3244)

    They grow up so fast! Tutorials now have their own git repository, CI, and release cycle, making it easier than ever to contribute ideas, fixes, and bug reports. Have a look at the tutorials repo, Star it, and open an issue if you have an idea for a new tutorial!

    Docker pull deepset/haystack (#3162)

    A new Docker image is ready to be pulled shipping Haystack 1.9, providing different flavors and versions that you can specify with the proper Docker tag - have a look at the README. On this occasion, we also revamped the build process so that it's now using bake, while the older images are deprecated (see below).

    ⚠️ Deprecation notice

    With the release of the new Docker image deepset/haystack, the following images are now deprecated and won't be updated any more starting with Haystack 1.10:

    New Documentation Site and Haystack Website Revamp:

    The Haystack website is going through a make-over to become a developer portal that surrounds Haystack and NLP topics beyond pure documentation. With that, we've published our new documentation site. From now on, content surrounding pure developer documentation will live under Haystack Documentation, while the Haystack website becomes a place for the community with tutorials, learning material and soon, a place where the community can share their own content too.

    What's Changed

    Pipeline

    • feat: standardize devices parameter and device initialization by @vblagoje in https://github.com/deepset-ai/haystack/pull/3062
    • fix: Reduce GPU to CPU copies at inference by @sjrl in https://github.com/deepset-ai/haystack/pull/3127
    • test: lower low boundary for accuracy in test_calculate_context_similarity_on_non_matching_contexts by @ZanSara in https://github.com/deepset-ai/haystack/pull/3199
    • bug: fix pdftotext installation verification by @banjocustard in https://github.com/deepset-ai/haystack/pull/3233
    • chore: remove f-strings from logs for performance reasons by @ZanSara in https://github.com/deepset-ai/haystack/pull/3212
    • bug: reactivate benchmarks with quick fixes by @tholor in https://github.com/deepset-ai/haystack/pull/2766

    Models

    • fix: Replace multiprocessing tokenization with batched fast tokenization by @vblagoje in https://github.com/deepset-ai/haystack/pull/3089

    DocumentStores

    • bug: OpensearchDocumentStore.custom_mapping should accept JSON strings at validation by @ZanSara in https://github.com/deepset-ai/haystack/pull/3065
    • feat: Add warnings to PineconeDocumentStore about indexing metadata if filters return no documents by @Namoush in https://github.com/deepset-ai/haystack/pull/3086
    • bug: validate custom_mapping as an object by @ZanSara in https://github.com/deepset-ai/haystack/pull/3189

    Tutorials

    • docs: Fix the word length splitting; should be set to 100 not 1,000 by @stevenhaley in https://github.com/deepset-ai/haystack/pull/3133
    • chore: remove tutorials from the repo by @masci in https://github.com/deepset-ai/haystack/pull/3244

    Other Changes

    • chore: Upgrade and pin transformers to 4.21.2 by @vblagoje in https://github.com/deepset-ai/haystack/pull/3098
    • bug: adapt UI random question for streamlit 1.12 and pin to streamlit>=1.9.0 by @anakin87 in https://github.com/deepset-ai/haystack/pull/3121
    • build: pin pydantic to 1.9.2 by @masci in https://github.com/deepset-ai/haystack/pull/3126
    • fix: document FARMReader.train() evaluation report log level by @brandenchan in https://github.com/deepset-ai/haystack/pull/3129
    • feat: add a security policy for Haystack by @masci in https://github.com/deepset-ai/haystack/pull/3130
    • refactor: update dependencies and remove pins by @danielbichuetti in https://github.com/deepset-ai/haystack/pull/3147
    • refactor: update package strategy in rest_api by @masci in https://github.com/deepset-ai/haystack/pull/3148
    • fix: give default index for torch.device('cuda') in initialize_device_settings by @sjrl in https://github.com/deepset-ai/haystack/pull/3161
    • fix: add type hints to all component init constructor parameters by @vblagoje in https://github.com/deepset-ai/haystack/pull/3152
    • fix: Add 15 min timeout for downloading cached HF models by @vblagoje in https://github.com/deepset-ai/haystack/pull/3179
    • fix: replace torch.device("cuda") with torch.device("cuda:0") in devices initialization by @vblagoje in https://github.com/deepset-ai/haystack/pull/3184
    • feat: add health check endpoint to rest api by @danielbichuetti in https://github.com/deepset-ai/haystack/pull/3168
    • refactor: improve support for dataclasses by @danielbichuetti in https://github.com/deepset-ai/haystack/pull/3142
    • feat: Updates docs and types for language param in PreProcessor by @sjrl in https://github.com/deepset-ai/haystack/pull/3186
    • feat: Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers by @bglearning in https://github.com/deepset-ai/haystack/pull/3164
    • refactoring: reimplement Docker strategy by @masci in https://github.com/deepset-ai/haystack/pull/3162
    • refactor: remove pre haystack-1.0 import paths support by @ZanSara in https://github.com/deepset-ai/haystack/pull/3204
    • feat: exponential backoff with exp decreasing batch size for opensearch and elasticsearch client by @ArzelaAscoIi in https://github.com/deepset-ai/haystack/pull/3194
    • feat: add public layout-base extraction support on PDFToTextConverter by @danielbichuetti in https://github.com/deepset-ai/haystack/pull/3137
    • bug: fix embedding_dim mismatch in DocumentStore by @kalki7 in https://github.com/deepset-ai/haystack/pull/3183
    • fix: update rest_api Docker Compose yamls for recent refactoring of rest_api by @nickchomey in https://github.com/deepset-ai/haystack/pull/3197
    • chore: fix Windows CI by @masci in https://github.com/deepset-ai/haystack/pull/3222
    • fix: type of temperature param and adjust defaults for OpenAIAnswerGenerator by @tholor in https://github.com/deepset-ai/haystack/pull/3073
    • fix: handle Documents containing dataframes in Multilabel constructor by @masci in https://github.com/deepset-ai/haystack/pull/3237
    • fix: make pydoc-markdown hook correctly resolve paths relative to repo root by @masci in https://github.com/deepset-ai/haystack/pull/3238
    • fix: proper retrieval of answers for batch eval by @vblagoje in https://github.com/deepset-ai/haystack/pull/3245
    • chore: updating colab links in older docs versions by @TuanaCelik in https://github.com/deepset-ai/haystack/pull/3250
    • docs: establish API docs sync between v1.9.x and Readme by @brandenchan in https://github.com/deepset-ai/haystack/pull/3266

    New Contributors

    • @Namoush made their first contribution in https://github.com/deepset-ai/haystack/pull/3086
    • @kalki7 made their first contribution in https://github.com/deepset-ai/haystack/pull/3183
    • @nickchomey made their first contribution in https://github.com/deepset-ai/haystack/pull/3197
    • @banjocustard made their first contribution in https://github.com/deepset-ai/haystack/pull/3233

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.8.0...v1.9.0

    Source code(tar.gz)
    Source code(zip)
  • v1.8.0(Aug 26, 2022)

    ⭐ Highlights

    This release comes with a bunch of new features, improvements and bug fixes. Let us know how you like it on our brand new Haystack Discord server! Here are the highlights of the release:

    Pipeline Evaluation in Batch Mode https://github.com/deepset-ai/haystack/pull/2942

    The evaluation of pipelines often uses large datasets and with this new feature batches of queries can be processed at the same time on a GPU. Thereby, the time needed for an evaluation run is decreased and we are working on further speed improvements. To try it out, you only need to replace the call to pipeline.eval() with pipeline.eval_batch() when you evaluate your question answering pipeline:

    ...
    pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)
    eval_result = pipeline.eval_batch(labels=eval_labels, params={"Retriever": {"top_k": 5}})
    

    Early Stopping in Reader and Retriever Training https://github.com/deepset-ai/haystack/pull/3071

    When training a reader or retriever model, you need to specify the number of training epochs. If the model doesn't further improve after the first few epochs, the training usually still continues for the rest of the specified number of epochs. Early Stopping can now automatically monitor how much the model improves during training and stop the process when there is no significant improvement. Various metrics can be monitored, including loss, EM, f1, and top_n_accuracy for FARMReader or loss, acc, f1, and average_rank for DensePassageRetriever. For example, reader training can be stopped when loss doesn't further decrease by at least 0.001 compared to the previous epoch:

    from haystack.nodes import FARMReader
    from haystack.utils.early_stopping import EarlyStopping
    reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2-distilled")
    reader.train(data_dir="data/squad20", train_filename="dev-v2.0.json", early_stopping=EarlyStopping(min_delta=0.001), use_gpu=True, n_epochs=8, save_dir="my_model")
    

    PineconeDocumentStore Without SQL Database https://github.com/deepset-ai/haystack/pull/2749

    Thanks to @jamescalam the PineconeDocumentStore does not depend on a local SQL database anymore. So when you initialize a PineconeDocumentStore from now on, all you need to provide is a Pinecone API key:

    from haystack.document_stores import PineconeDocumentStore
    document_store = PineconeDocumentStore(api_key="...")
    docs = [Document(content="...")]
    document_store.write_documents(docs)
    

    FAISS in OpenSearchDocumentStore: https://github.com/deepset-ai/haystack/pull/3101 https://github.com/deepset-ai/haystack/pull/3029

    OpenSearch supports different approximate k-NN libraries for indexing and search. In Haystack's OpenSearchDocumentStore you can now set the knn_engine parameter to choose between nmslib and faiss. When loading an existing index you can also specify a knn_engine and Haystack checks if the same engine was used to create the index. If not, it falls back to slow exact vector calculation.

    Highlighted Bug Fixes

    A bug was fixed that prevented users from loading private models in some components because the authentication token wasn't passed on correctly. A second bug was fixed in the schema files affecting parameters that are of type Optional[List[]], in which case the validation failed if the parameter was explicitly set to None.

    • fix: Use use_auth_token in all cases when loading from the HF Hub by @sjrl in https://github.com/deepset-ai/haystack/pull/3094
    • bug: handle Optional params in schema validation by @anakin87 in https://github.com/deepset-ai/haystack/pull/2980

    Other Changes

    DocumentStores

    • feat: Allow exact list matching with field in Elasticsearch filtering by @masci in https://github.com/deepset-ai/haystack/pull/2988

    Documentation

    • refactor: rename master into main in documentation and links by @ZanSara in https://github.com/deepset-ai/haystack/pull/3063
    • docs:fixed typo (or old documentation) in ipynb tutorial 3 by @DavidGerva in https://github.com/deepset-ai/haystack/pull/3033
    • docs: Add OpenAI Answer Generator API by @brandenchan in https://github.com/deepset-ai/haystack/pull/3050

    Crawler

    • fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter by @danielbichuetti in https://github.com/deepset-ai/haystack/pull/3043
    • fix: Crawler quits ChromeDriver on destruction by @danielbichuetti in https://github.com/deepset-ai/haystack/pull/3070

    Other Changes

    • fix(translator): write translated text to output documents, while keeping input untouched by @danielbichuetti in https://github.com/deepset-ai/haystack/pull/3077
    • test: Use random_sample instead of ndarray for random array in OpenSearchDocumentStore test by @bogdankostic in https://github.com/deepset-ai/haystack/pull/3083
    • feat: add progressbar to upload_files() for deepset Cloud client by @tholor in https://github.com/deepset-ai/haystack/pull/3069
    • refactor: update package metadata by @ofek in https://github.com/deepset-ai/haystack/pull/3079

    New Contributors

    • @DavidGerva made their first contribution in https://github.com/deepset-ai/haystack/pull/3033
    • @ofek made their first contribution in https://github.com/deepset-ai/haystack/pull/3079

    ❤️ Big thanks to all contributors and the whole community!

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.7.1...v1.8.0

    Source code(tar.gz)
    Source code(zip)
  • v1.7.1(Aug 19, 2022)

    Patch Release

    Main Changes

    • feat: take the list of models to cache instead of hardcoding one by @masci in https://github.com/deepset-ai/haystack/pull/3060

    Other Changes

    • fix: pin version of pyworld to 0.2.12 by @sjrl in https://github.com/deepset-ai/haystack/pull/3047
    • test: update filtering of Pinecone mock to imitate doc store by @jamescalam in https://github.com/deepset-ai/haystack/pull/3020

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.7.0...v1.7.1

    Source code(tar.gz)
    Source code(zip)
  • v1.7.0(Aug 15, 2022)

    ⭐ Highlights

    This time we have a couple of smaller yet important feature highlights: lots of them coming from you, our amazing community! 🥂 Alongside that, as we notice more frequent and great contributions from our community, we are also announcing our brand new Haystack Discord server to help us interact better with the people that make Haystack what it is! 🥳

    Here's what you'll find in Haystack 1.7:

    Support for OpenAI GPT-3

    If you always wanted to know how OpenAI's famous GPT-3 model compares to other models, now your time has come. It's been fully integrated into Haystack, so you can use it as any other model. Just sign up to OpenAI, copy your API key from here and run the following code.To compare it to other models, check out our evaluation guide.

    from haystack.nodes import OpenAIAnswerGenerator
    from haystack import Document
    
    reader = OpenAIAnswerGenerator(api_key="<your-api-token>", max_tokens=15, temperature=0.3)
    
    docs = [Document(content="""The Big Bang Theory is an American sitcom.
                                The four main characters are all avid fans of nerd culture. 
                                Among their shared interests are science fiction, fantasy, comic books and collecting memorabilia. 
                                Star Trek in particular is frequently referenced""")]
    res = reader.predict(query="Do the main characters of big bang theory like Star Trek?", documents=docs)
    print(res)
    

    https://github.com/deepset-ai/haystack/pull/2605 https://github.com/deepset-ai/haystack/pull/3036

    Zero-Shot Query Classification

    Till now, TransformersQueryClassifier was very closely built around the excellent binary query-type classifier model of hahrukhx01. Although it was already possible to use other Transformer models, the choice was restricted to the models that output binary labels. One of our amazing community contributions now lifted this restriction. But that's not all: @anakin87 added support for zero-shot classification models as well! So now that you're completely free to choose the classification categories you want, you can let your creativity run wild. One thing you could do is customize the behavior of your pipeline based on the semantic category of the query, like this:

    from haystack.nodes import TransformersQueryClassifier
    
    # In zero-shot-classification, you are free to choose the labels
    labels = ["music", "cinema", "food"]
    
    query_classifier = TransformersQueryClassifier(
        model_name_or_path="typeform/distilbert-base-uncased-mnli",
        use_gpu=True,
        task="zero-shot-classification",
        labels=labels,
    )
    
    queries = [
        "In which films does John Travolta appear?",  # query about cinema
        "What is the Rolling Stones first album?",  # query about music
        "Who was Sergio Leone?",  # query about cinema
    ]
    
    for query in queries:
        result = query_classifier.run(query=query)
        print(f'Query "{query}" was sent to {result[1]}')
    

    https://github.com/deepset-ai/haystack/pull/2965

    Adding Page Numbers to Document Meta

    Sometimes it's not enough to find the right answer or paragraph inside a document and just print it on the screen. Context matters and thus, for search applications, it's essential to send the user exactly to the place where the information came from. For huge documents, we're just halfway there if the user clicks a result and the document opens. To get to the right position, they still need to search the document using the document viewer. To make it easier, we added the parameter add_page_number to ParsrConverter, AzureConverter and PreProcessor. If you set it to True, it adds a meta field "page" to documents containing the page number of the text snippet or a table within the original file.

    from haystack.nodes import PDFToTextConverter, PreProcessor
    from haystack.document_stores import InMemoryDocumentStore
    
    converter = PDFToTextConverter()
    preprocessor = PreProcessor(add_page_number=True)
    document_store = InMemoryDocumentStore()
    
    pipeline = Pipeline()
    pipeline.add_node(component=converter, name="Converter", inputs=["File"])
    pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Converter"])
    pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])
    

    https://github.com/deepset-ai/haystack/pull/2932

    Gradient Accumulation for FARMReader

    Training big Transformer models in low-resource environments is hard. Batch size plays a significant role when it comes to hyper-parameter tuning during the training process. The number of batches you can run on your machine is restricted by the amount of memory that fits into your GPUs. Gradient accumulation is a well-known technique to work around that restriction: adding up the gradients across iterations and running the backward pass only once after a certain number of iterations. We tested it when we fine-tuned roberta-base on SQuAD, which led to nearly the same results as using a higher batch size. We also used it for training deepset/deberta-v3-large, which significantly outperformed its predecessors (see Question Answering on SQuAD).

    from haystack.nodes import FARMReader
    
    reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
    data_dir = "data/squad20"
    reader.train(
        data_dir=data_dir, 
        train_filename="dev-v2.0.json", 
        use_gpu=True, n_epochs=1, 
        save_dir="my_model", 
        grad_acc_steps=8
    )
    

    https://github.com/deepset-ai/haystack/pull/2925

    Extended Ray Support

    Another great contribution from our community comes from @zoltan-fedor: it's now possible to run more complex pipelines with dual-retriever setup on Ray. Also, we now support ray serve deployment arguments in Pipeline YAMLs so that you can fully control your ray deployments.

    pipelines:
      - name: ray_query_pipeline
        nodes:
          - name: EmbeddingRetriever
            replicas: 2
            inputs: [ Query ]
            serve_deployment_kwargs:
              num_replicas: 2
              version: Twenty
              ray_actor_options:
                num_gpus: 0.25
                num_cpus: 0.5
              max_concurrent_queries: 17
          - name: Reader
            inputs: [ EmbeddingRetriever ]
    

    https://github.com/deepset-ai/haystack/pull/2981 https://github.com/deepset-ai/haystack/pull/2918

    Support for Custom Sentence Tokenizers in Preprocessor

    On some specific domains (for example, legal with lots of custom abbreviations), the default sentence tokenizer can be improved by some extra training on the domain data. To support a custom model for sentence splitting, @danielbichuetti added the tokenizer_model_folder parameter to Preprocessor.

    from haystack.nodes import PreProcessor
    
    preprocessor = PreProcessor(
            split_length=10,
            split_overlap=0,
            split_by="sentence",
            split_respect_sentence_boundary=False,
            language="pt",
            tokenizer_model_folder="/home/user/custom_tokenizer_models",
        )
    

    https://github.com/deepset-ai/haystack/pull/2783

    Making it Easier to Switch Document Stores

    We had yet another amazing community contribution by @zoltan-fedor about the support for BM25 with the Weaviate document store. Besides that we streamlined methods of BaseDocumentStore and added update_document_meta() to InMemoryDocumentStore. These are all steps to make it easier for you to run the same pipeline with different document stores (for example, for quick prototyping, use in-memory, then head to something more production-ready). https://github.com/deepset-ai/haystack/pull/2860 https://github.com/deepset-ai/haystack/pull/2689

    Almost 2x Performance Gain for Electra Reader Models

    We did a major refactoring of our language_modeling module resolving a bug that caused Electra models to execute the forward pass twice. https://github.com/deepset-ai/haystack/pull/2703.

    ⚠️ Breaking Changes

    • Add update_document_meta to InMemoryDocumentStore by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2689
    • Add support for BM25 with the Weaviate document store by @zoltan-fedor in https://github.com/deepset-ai/haystack/pull/2860
    • Extending the Ray Serve integration to allow attributes for Serve deployments by @zoltan-fedor in https://github.com/deepset-ai/haystack/pull/2918
    • bug: make MultiLabel ids consistent across python interpreters by @camillepradel in https://github.com/deepset-ai/haystack/pull/2998

    ⚠️ Breaking Changes for Contributors

    Default Branch will be Renamed to main on Tuesday, 16th of August

    We will rename the default branch from master to main after this release. For a nice recap about good reasons for doing this, have a look at the Software Freedom Conservancy's blog. Whether coming from this repository or from a fork, local clones of the Haystack repository will need to be updated by running the following commands:

    git branch -m master main
    git fetch origin
    git branch -u origin/main main
    git remote set-head origin -a
    

    Pre-Commit Hooks Instead of CI Jobs

    To give you full control over your changes, we switched from CI jobs that automatically reformat files, generate schemas, and so on, to pre-commit hooks. To install them, run:

    pre-commit install
    

    For more information, check our contributing guidelines. https://github.com/deepset-ai/haystack/pull/2819

    Other Changes

    Pipeline

    • Fix _debug info getting lost for previous nodes when using join nodes by @tstadel in https://github.com/deepset-ai/haystack/pull/2776
    • fix pipeline run loop on joined pipelines whithout debug flag by @tstadel in https://github.com/deepset-ai/haystack/pull/2777
    • Fix crawler long file names by @danielbichuetti in https://github.com/deepset-ai/haystack/pull/2723
    • Prevent PDFToTextConverter from failing on PDFs with spaces in their names by @danielbichuetti in https://github.com/deepset-ai/haystack/pull/2786
    • Passing the meta-data in the summarizer response by @SjSnowball in https://github.com/deepset-ai/haystack/pull/2179
    • Fix YAML validation for ElasticsearchDocumentStore.custom_query by @ZanSara in https://github.com/deepset-ai/haystack/pull/2789
    • Fix gold_contexts_similarity for table retrieval evaluation by @tstadel in https://github.com/deepset-ai/haystack/pull/2815
    • Fix validation for dynamic outgoing edges by @tstadel in https://github.com/deepset-ai/haystack/pull/2850
    • Print eval reports improvements by @vblagoje in https://github.com/deepset-ai/haystack/pull/2941
    • Add progress bar to batch run component ops by @vblagoje in https://github.com/deepset-ai/haystack/pull/2864
    • feat: warn users if they're calling get_all_labels on a document index and vice-versa (Elasticsearch & Opensearch only) by @ZanSara in https://github.com/deepset-ai/haystack/pull/2990
    • Make MultiLabel preserve order by @anakin87 in https://github.com/deepset-ai/haystack/pull/2956
    • bug: fix UnboundLocalError in Pipeline.run_batch() by @anakin87 in https://github.com/deepset-ai/haystack/pull/3016
    • feat: enable the JoinDocuments node to work with documents with score=None by @zoltan-fedor in https://github.com/deepset-ai/haystack/pull/2984
    • Resolving issue 2853: no answer logic in FARMReader by @sjrl in https://github.com/deepset-ai/haystack/pull/2856
    • bug: Make TranslationWrapperPipeline work with QuestionAnswerGenerationPipeline by @bogdankostic in https://github.com/deepset-ai/haystack/pull/3034

    Models

    • Simplify language_modeling.py and tokenization.py by @ZanSara in https://github.com/deepset-ai/haystack/pull/2703
    • Validate OpenAI response by @anakin87 in https://github.com/deepset-ai/haystack/pull/2844
    • remove unnecessary if else block #2835 by @kekayan in https://github.com/deepset-ai/haystack/pull/2842
    • Explicitly specify all parameters to forward call by @vblagoje in https://github.com/deepset-ai/haystack/pull/2886
    • Use batch_size in QuestionGenerator by @GianiStatie in https://github.com/deepset-ai/haystack/pull/2870
    • Generalize , and tokens of QuestionGenerator node by @francescocastelli in https://github.com/deepset-ai/haystack/pull/2769
    • Component batch_size should be defined rather than Optional by @vblagoje in https://github.com/deepset-ai/haystack/pull/2958
    • Better check for "DebertaV2" architecture in Trainer.train by @sjrl in https://github.com/deepset-ai/haystack/pull/2966

    DocumentStores

    • Fix confusing elasticsearch exception by @tstadel in https://github.com/deepset-ai/haystack/pull/2763
    • added mock pinecone client by @jamescalam in https://github.com/deepset-ai/haystack/pull/2770
    • changed mock pinecone to use dict rather than list index by @jamescalam in https://github.com/deepset-ai/haystack/pull/2845
    • Handle invalid metadata for SQLDocumentStore by @anakin87 in https://github.com/deepset-ai/haystack/pull/2868
    • Use opensearch-py in OpenSearchDocumentStore by @masci in https://github.com/deepset-ai/haystack/pull/2691
    • Wrap opensearch imports into safe_import by @ZanSara in https://github.com/deepset-ai/haystack/pull/2907
    • Bug fix Weaviate document deletion by @stevenhaley in https://github.com/deepset-ai/haystack/pull/2899
    • switch label variables in test_labels by @jamescalam in https://github.com/deepset-ai/haystack/pull/3011
    • Adding support for additional distance/similarity metrics for Weaviate by @zoltan-fedor in https://github.com/deepset-ai/haystack/pull/3001
    • test: add meta fields for meta_config to be used during testing by @jamescalam in https://github.com/deepset-ai/haystack/pull/3021
    • Fix embeddings_field_supports_similarity of OpenSearchDocumentStore when creating index by @tstadel in https://github.com/deepset-ai/haystack/pull/3030
    • Forbid the key id from Documents to be written in WeaviateDocumentStore by @thenewera-ru in https://github.com/deepset-ai/haystack/pull/2846

    Documentation

    • Trying out some smaller images for docs by @TuanaCelik in https://github.com/deepset-ai/haystack/pull/2772
    • Clean OpenAIAnswerGenerator docstrings by @brandenchan in https://github.com/deepset-ai/haystack/pull/2797
    • Add a custom pydoc renderer for Readme.io by @masci in https://github.com/deepset-ai/haystack/pull/2825
    • Typo README.md by @danielfleischer in https://github.com/deepset-ai/haystack/pull/2895
    • Fix typos in Contributing.md by @stevenhaley in https://github.com/deepset-ai/haystack/pull/2897
    • Fix docs code format for sentence transformers by @bilgeyucel in https://github.com/deepset-ai/haystack/pull/2957
    • Update Seq2SeqGenerator API documentation by @vblagoje in https://github.com/deepset-ai/haystack/pull/2970
    • Add API page for util functions by @brandenchan in https://github.com/deepset-ai/haystack/pull/2863
    • docs: update File Classifier Docstring by @brandenchan in https://github.com/deepset-ai/haystack/pull/3018

    Tutorials

    • Fix load_from_yaml example in the Pipelines tutorial by @agnieszka-m in https://github.com/deepset-ai/haystack/pull/2774
    • Tutorial 12: add introduction by @vblagoje in https://github.com/deepset-ai/haystack/pull/2798
    • Exclude docker from Tutorial 15 by @anakin87 in https://github.com/deepset-ai/haystack/pull/2861
    • Remove logging config from Haystack by @julian-risch in https://github.com/deepset-ai/haystack/pull/2848
    • docs: extend tutorial14 about query classification by @anakin87 in https://github.com/deepset-ai/haystack/pull/3013
    • Tutorial 06: Replace DPR with EmbeddingRetriever by @bglearning in https://github.com/deepset-ai/haystack/pull/2910

    Other Changes

    • API key check in OpenAIAnswerGenerator by @ZanSara in https://github.com/deepset-ai/haystack/pull/2791
    • API tests by @masci in https://github.com/deepset-ai/haystack/pull/2738
    • Allow values that are not dictionaries in the request params in the /search endpoint by @masci in https://github.com/deepset-ai/haystack/pull/2720
    • fix healtcheck cmds for annotation tool postgres by @tstadel in https://github.com/deepset-ai/haystack/pull/2840
    • Remove deprecated method prepare_seq2seq_batch by @anakin87 in https://github.com/deepset-ai/haystack/pull/2852
    • Fix corrupted csv from EvaluationResult.save() by @tstadel in https://github.com/deepset-ai/haystack/pull/2854
    • Fix audio dependency chain issue on Python 3.10 by @danielbichuetti in https://github.com/deepset-ai/haystack/pull/2900
    • Add switch for BiAdaptive and TriAdaptiveModel in Evaluator by @ZanSara in https://github.com/deepset-ai/haystack/pull/2908
    • Fix serialization of numpy arrays and pandas dataframes in REST API by @tstadel in https://github.com/deepset-ai/haystack/pull/2838
    • Update minimum selenium version supported for crawler by @sjrl in https://github.com/deepset-ai/haystack/pull/2921
    • Enable Opensearch unit tests in Windows CI by @masci in https://github.com/deepset-ai/haystack/pull/2936
    • Remove unused variable by @sjrl in https://github.com/deepset-ai/haystack/pull/2974
    • Bump streamlit version to latest by @masci in https://github.com/deepset-ai/haystack/pull/3002
    • Testing order in test_multilabel by @jamescalam in https://github.com/deepset-ai/haystack/pull/3015
    • fix: move azure-core pin into the dev dependency list by @ZanSara in https://github.com/deepset-ai/haystack/pull/3022
    • Fix broken MultiLabel serialization by @tstadel in https://github.com/deepset-ai/haystack/pull/3037

    New Contributors

    • @kekayan made their first contribution in https://github.com/deepset-ai/haystack/pull/2842
    • @sjrl made their first contribution in https://github.com/deepset-ai/haystack/pull/2884
    • @zoltan-fedor made their first contribution in https://github.com/deepset-ai/haystack/pull/2860
    • @danielfleischer made their first contribution in https://github.com/deepset-ai/haystack/pull/2895
    • @stevenhaley made their first contribution in https://github.com/deepset-ai/haystack/pull/2897
    • @GianiStatie made their first contribution in https://github.com/deepset-ai/haystack/pull/2870
    • @bglearning made their first contribution in https://github.com/deepset-ai/haystack/pull/2910
    • @bilgeyucel made their first contribution in https://github.com/deepset-ai/haystack/pull/2957
    • @wochinge made their first contribution in https://github.com/deepset-ai/haystack/pull/2883
    • @camillepradel made their first contribution in https://github.com/deepset-ai/haystack/pull/2998
    • @thenewera-ru made their first contribution in https://github.com/deepset-ai/haystack/pull/2846

    ❤️ Big thanks to all contributors and the whole community!

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.6.0...v1.7.0

    Source code(tar.gz)
    Source code(zip)
  • v1.6.0(Jul 6, 2022)

    ⭐ Highlights

    Make Your QA Pipelines Talk with Audio Nodes! (https://github.com/deepset-ai/haystack/pull/2584)

    Indexing pipelines can use a new DocumentToSpeech node, which generates an audio file for each indexed document and stores it alongside the text content in a SpeechDocument. A GPU is recommended for this step to increase indexing speed. During querying, SpeechDocuments allow accessing the stored audio version of the documents the answers are extracted from. There is also a new AnswerToSpeech node that can be used in QA pipelines to generate the audio of an answer on the fly. See the new tutorial for a step by step guide on how to make your QA pipelines talk!

    Save Models to Remote (https://github.com/deepset-ai/haystack/pull/2618)

    A new save_to_remote method was introduced to the FARMReader, so that you can easily upload a trained model to the Hugging Face Model Hub. More of this to come in the following releases!

    from haystack.nodes import FARMReader
    
    reader = FARMReader(model_name_or_path="roberta-base")
    reader.train(data_dir="my_squad_data", train_filename="squad2.json", n_epochs=1, save_dir="my_model")
    
    reader.save_to_remote(repo_id="your-user-name/roberta-base-squad2", private=True, commit_message="First version of my qa model trained with Haystack")
    

    Note that you need to be logged in with transformers-cli login. Otherwise there will be an error message with instructions how to log in. Further, if you make your model private by setting private=True, others won't be able to use it and you will need to pass an authentication token when you reload the model from the Model Hub, which is created also via transformers-cli login.

    new_reader = FARMReader(model_name_or_path="your-user-name/roberta-base-squad2", use_auth_token=True)
    

    Multi-Hop Dense Retrieval (https://github.com/deepset-ai/haystack/pull/2571)

    There is a new MultihopEmbeddingRetriever node that applies iterative retrieval steps and a shared encoder for the query and the documents. Used together with a reader node in a QA pipeline, it is suited for answering complex open-domain questions that require "hopping" multiple relevant documents. See the original paper by Xiong et al. for more details: "Answering complex open-domain questions with multi-hop dense retrieval".

    from haystack.nodes import MultihopEmbeddingRetriever
    from haystack.document_stores import InMemoryDocumentStore
    
    document_store = InMemoryDocumentStore()
    retriever = MultihopEmbeddingRetriever(
                document_store=document_store,
                embedding_model="deutschmann/mdr_roberta_q_encoder",
            )
    

    Big thanks to our community member @deutschmn for the PR!

    InMemoryKnowledgeGraph (https://github.com/deepset-ai/haystack/pull/2678)

    Besides querying texts and tables, Haystack also allows querying knowledge graphs with the help of pre-trained models that translate text queries to graph queries. The latest Haystack release adds an InMemoryKnowledgeGraph allowing to store knowledge graphs without setting up complex graph databases. Try out the tutorial as a notebook on colab!

    from pathlib import Path
    from haystack.nodes import Text2SparqlRetriever
    from haystack.document_stores import InMemoryKnowledgeGraph
    from haystack.utils import fetch_archive_from_http
    
    # Fetch data represented as triples of subject, predicate, and object statements
    fetch_archive_from_http(url="https://fandom-qa.s3-eu-west-1.amazonaws.com/triples_and_config.zip", output_dir="data/tutorial10")
    
    # Fetch a pre-trained BART model that translates text queries to SPARQL queries
    fetch_archive_from_http(url="https://fandom-qa.s3-eu-west-1.amazonaws.com/saved_models/hp_v3.4.zip", output_dir="../saved_models/tutorial10/")
    
    # Initialize knowledge graph and import triples from a ttl file
    kg = InMemoryKnowledgeGraph(index="tutorial10")
    kg.create_index()
    kg.import_from_ttl_file(index="tutorial10", path=Path("data/tutorial10/triples.ttl"))
    
    # Initialize retriever from pre-trained model
    kgqa_retriever = Text2SparqlRetriever(knowledge_graph=kg, model_name_or_path=Path("../saved_models/tutorial10/hp_v3.4"))
    
    # Translate a text query to a SPARQL query and execute it on the knowledge graph
    print(kgqa_retriever.retrieve(query="In which house is Harry Potter?"))
    

    Big thanks to our community member @anakin87 for the PR!

    Torch 1.12 and Transformers 4.20.1 Support

    Haystack is now compatible with last week's PyTorch v1.12 release so that you can take advantage of Apple silicon GPUs (Apple M1) for accelerated training and evaluation. PyTorch shared an impressive analysis of speedups over CPU-only here. Haystack is also compatible with the latest Transformers v4.20.1 release and we will continuously ensure that you can benefit from the latest features in Haystack!

    Other Changes

    Pipeline

    • Fix JoinAnswer/JoinNode by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2612
    • Reduce logging messages and simplify logging by @julian-risch in https://github.com/deepset-ai/haystack/pull/2682
    • Correct docstring parameter name by @julian-risch in https://github.com/deepset-ai/haystack/pull/2757
    • AnswerToSpeech by @ZanSara in https://github.com/deepset-ai/haystack/pull/2584
    • Fix params being changed during pipeline.eval() by @tstadel in https://github.com/deepset-ai/haystack/pull/2638
    • Make crawler extract also hidden text by @anakin87 in https://github.com/deepset-ai/haystack/pull/2642
    • Update document scores based on ranker node by @mathislucka in https://github.com/deepset-ai/haystack/pull/2048
    • Improved crawler support for dynamically loaded pages by @danielbichuetti in https://github.com/deepset-ai/haystack/pull/2710
    • Replace deprecated Selenium methods by @ZanSara in https://github.com/deepset-ai/haystack/pull/2724
    • Fix EvaluationSetCliet.get_labels() by @tstadel in https://github.com/deepset-ai/haystack/pull/2690
    • Show warning in reader.eval() about differences compared to pipeline.eval() by @tstadel in https://github.com/deepset-ai/haystack/pull/2477
    • Fix using id_hash_keys as pipeline params by @tstadel in https://github.com/deepset-ai/haystack/pull/2717
    • Fix loading of tokenizers in DPR by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2755
    • Add support for Multi-Hop Dense Retrieval by @deutschmn in https://github.com/deepset-ai/haystack/pull/2571
    • Create target folder if not exists in EvalResult.save() by @tstadel in https://github.com/deepset-ai/haystack/pull/2647
    • Validate max_seq_length in SquadProcessor by @francescocastelli in https://github.com/deepset-ai/haystack/pull/2740

    Models

    • Use AutoTokenizer by default, to easily adapt to new models and token… by @apohllo in https://github.com/deepset-ai/haystack/pull/1902
    • first version of save_to_remote for HF from FarmReader by @TuanaCelik in https://github.com/deepset-ai/haystack/pull/2618

    DocumentStores

    • Move Opensearch document store in its own module by @masci in https://github.com/deepset-ai/haystack/pull/2603
    • Extract common code for ES and OS into a base class by @masci in https://github.com/deepset-ai/haystack/pull/2664
    • Fix bugs in loading code from yaml by @masci in https://github.com/deepset-ai/haystack/pull/2705
    • fix error in log message by @anakin87 in https://github.com/deepset-ai/haystack/pull/2719
    • Pin es client to include bugfixes by @masci in https://github.com/deepset-ai/haystack/pull/2735
    • Make check of document & embedding count optional in FAISS and Pinecone by @julian-risch in https://github.com/deepset-ai/haystack/pull/2677
    • In memory knowledge graph by @anakin87 in https://github.com/deepset-ai/haystack/pull/2678
    • Pinecone unary queries upgrade by @jamescalam in https://github.com/deepset-ai/haystack/pull/2657
    • wait for postgres to be ready before data migrations by @masci in https://github.com/deepset-ai/haystack/pull/2654

    Documentation & Tutorials

    • Update docstrings for GPL by @agnieszka-m in https://github.com/deepset-ai/haystack/pull/2633
    • Add GPL API docs, unit tests update by @vblagoje in https://github.com/deepset-ai/haystack/pull/2634
    • Add GPL adaptation tutorial by @vblagoje in https://github.com/deepset-ai/haystack/pull/2632
    • GPL tutorial - add GPU header and open in colab button by @vblagoje in https://github.com/deepset-ai/haystack/pull/2736
    • Add execute_eval_run example to Tutorial 5 by @tstadel in https://github.com/deepset-ai/haystack/pull/2459
    • Tutorial 14 edit by @robpasternak in https://github.com/deepset-ai/haystack/pull/2663

    Misc

    • Replace question issue with link to discussions by @masci in https://github.com/deepset-ai/haystack/pull/2697
    • Upgrade transformers to 4.20.1 by @julian-risch in https://github.com/deepset-ai/haystack/pull/2702
    • Upgrade torch to 1.12 by @julian-risch in https://github.com/deepset-ai/haystack/pull/2741
    • Remove rapidfuzz version pin by @tstadel in https://github.com/deepset-ai/haystack/pull/2730

    New Contributors

    • @ryanrussell made their first contribution in https://github.com/deepset-ai/haystack/pull/2617
    • @apohllo made their first contribution in https://github.com/deepset-ai/haystack/pull/1902
    • @robpasternak made their first contribution in https://github.com/deepset-ai/haystack/pull/2663
    • @danielbichuetti made their first contribution in https://github.com/deepset-ai/haystack/pull/2710
    • @francescocastelli made their first contribution in https://github.com/deepset-ai/haystack/pull/2740
    • @deutschmn made their first contribution in https://github.com/deepset-ai/haystack/pull/2571

    ❤️ Big thanks to all contributors and the whole community!

    Full Changelog: https://github.com/deepset-ai/haystack/compare/v1.5.0...v1.6.0

    Source code(tar.gz)
    Source code(zip)
  • v1.5.0(Jun 2, 2022)

    ⭐ Highlights

    Generative Pseudo Labeling

    Dense retrievers excel when finetuned on a labeled dataset of the target domain. However, such datasets rarely exist and are costly to create from scratch with human annotators. Generative Pseudo Labeling solves this dilemma by creating labels automatically for you, which makes it a super fast and low-cost alternative to manual annotation. Technically speaking, it is an unsupervised approach for domain adaptation of dense retrieval models. Given a corpus of unlabeled documents from that domain, it automatically generates queries on that corpus and then uses a cross-encoder model to create pseudo labels for these queries. The pseudo labels can be used to adapt retriever models that domain. Here is a code example that shows how to do that in Haystack:

    from haystack.nodes.retriever import EmbeddingRetriever
    from haystack.document_stores import InMemoryDocumentStore
    from haystack.nodes.question_generator.question_generator import QuestionGenerator
    from haystack.nodes.label_generator.pseudo_label_generator import PseudoLabelGenerator
    
    # Initialize any document store and fill it with documents from your domain - no labels needed.
    document_store = InMemoryDocumentStore()
    document_store.write_documents(...) 
    
    # Calculate and store a dense embedding for each document
    retriever = EmbeddingRetriever(document_store=document_store, 
                                   embedding_model="sentence-transformers/msmarco-distilbert-base-tas-b", 
                                   max_seq_len=200)
    document_store.update_embeddings(retriever)
    
    # Use the new PseudoLabelGenerator to automatically generate labels and train the retriever on them
    qg = QuestionGenerator(model_name_or_path="doc2query/msmarco-t5-base-v1", max_length=64, split_length=200, batch_size=12)
    psg = PseudoLabelGenerator(qg, retriever)
    output, _ = psg.run(documents=document_store.get_all_documents()) 
    retriever.train(output["gpl_labels"])
    

    https://github.com/deepset-ai/haystack/pull/2388

    Batch Processing with Query Pipelines

    Every query pipeline now has a run_batch() method, which allows to pass multiple queries to the pipeline at once. Together with a list of queries, you can either provide a single list of documents or a list of lists of documents. In the first case, answers are returned for each query-document pair. In the second case, each query is applied to its corresponding list of documents based on same index in the list. A third option is to have a list containing a single query, which is then applied to each list of documents separately. Here is an example with a pipeline:

    from haystack.pipelines import ExtractiveQAPipeline
    ...
    pipe = ExtractiveQAPipeline(reader, retriever)
    predictions = pipe.pipeline.run_batch(
            queries=["Who is the father of Arya Stark?","Who is the mother of Arya Stark?"], params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
        )
    

    And here is an example with a single reader node:

    from haystack.nodes import FARMReader
    from haystack.schema import Document
    
    FARMReader.predict_batch(
        queries=["1st sample query", "2nd sample query"]
        documents=[[Document(content="sample doc1"), Document(content="sample doc2")], [Document(content="sample doc3"), Document(content="sample doc4")]]
    
    {"queries": ["1st sample query", "2nd sample query"], "answers": [[Answers from doc1 and doc2], [Answers from doc3 and doc4]], ...]}
    

    https://github.com/deepset-ai/haystack/pull/2481 https://github.com/deepset-ai/haystack/pull/2575

    Pipeline Evaluation with Advanced Label Scopes

    Typically, a predicted answer is considered correct if it matches the gold answer in the set of evaluation labels. Similarly, a retrieved document is considered correct if its ID matches the gold document ID in the labels. Sometimes however, these simple definitions of "correctness" are not sufficient and you want to further specify the "scope" within which an answer or a document is considered correct. For this reason, EvaluationResult.calculate_metrics() accepts the parameters answer_scope and document_scope.

    As an example, you might consider an answer to be correct only if it stems from a specific context of surrounding words. You can specify answer_scope="context" in calculate_metrics() in that case. See the updated docstrings with a description of the different label scopes or the updated tutorial on evaluation.

    ...
    document_store.add_eval_data(
            filename="data/tutorial5/nq_dev_subset_v2.json",
            preprocessor=preprocessor,
        )
    ...
    eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=True)
    eval_result = pipeline.eval(labels=eval_labels, params={"Retriever": {"top_k": 5}})
    metrics = eval_result.calculate_metrics(answer_scope="context")
    print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')
    

    https://github.com/deepset-ai/haystack/pull/2482

    Support of DeBERTa Models

    Haystack now supports DeBERTa models! These kind of models come with some smart architectural improvements over BERT and RoBERTa, such as encoding the relative and absolute position of a token in the input sequence. Only the following three lines are needed to train a DeBERTa reader model on the SQuAD 2.0 dataset. And compared to a RoBERTa model trained on that dataset, you can expect a boost in F1-score from ~84% to ~88% ("microsoft/deberta-v3-large" even gets you to an F1-score as high as ~92%).

    from haystack.nodes import FARMReader
    reader = FARMReader(model_name_or_path="microsoft/deberta-v3-base")
    reader.train(data_dir="data/squad20", train_filename="train-v2.0.json", dev_filename="dev-v2.0.json", save_dir="my_model")
    

    https://github.com/deepset-ai/haystack/pull/2097

    ⚠️ Breaking Changes

    • Validation for Ray pipelines by @ZanSara in https://github.com/deepset-ai/haystack/pull/2545
    • Add run_batch method to all nodes and Pipeline to allow batch querying by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2481
    • Support context matching in pipeline.eval() by @tstadel in https://github.com/deepset-ai/haystack/pull/2482

    Other Changes

    Pipeline

    • Add sort arg to JoinAnswers by @brandenchan in https://github.com/deepset-ai/haystack/pull/2436
    • Update run() and run_batch() params descriptions in API by @agnieszka-m in https://github.com/deepset-ai/haystack/pull/2542
    • [CI refactoring] Avoid ray==1.12.0 on Windows by @ZanSara in https://github.com/deepset-ai/haystack/pull/2562
    • Prevent losing names of utilized components when loaded from config by @tstadel in https://github.com/deepset-ai/haystack/pull/2525
    • Do not copy _component_config in get_components_definitions by @ZanSara in https://github.com/deepset-ai/haystack/pull/2574
    • Add run_batch for standard pipelines by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2595
    • Fix Pipeline.get_config() for forked pipelines by @tstadel in https://github.com/deepset-ai/haystack/pull/2616
    • Remove wrong retriever top_1 metrics from print_eval_report by @tstadel in https://github.com/deepset-ai/haystack/pull/2510
    • Handle transformers pipeline flattening lists of length 1 by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2531
    • Fix pipeline.eval with context matching for Table-QA by @tstadel in https://github.com/deepset-ai/haystack/pull/2597
    • set top_k to 5 in SAS to be consistent by @ClaMnc in https://github.com/deepset-ai/haystack/pull/2550

    DocumentStores

    • Make DeepsetCloudDocumentStore work with non-existing index by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2513
    • [Weaviate] Exit the while loop when we query less documents than available by @masci in https://github.com/deepset-ai/haystack/pull/2537
    • Fix knn params for aws managed opensearch by @tstadel in https://github.com/deepset-ai/haystack/pull/2581
    • Fix number of returned values in get_metadata_values_by_key by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2614

    Retriever

    • Simplify loading of EmbeddingRetriever by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2619
    • Add training checkpoint in retriever trainer by @dimitrisna in https://github.com/deepset-ai/haystack/pull/2543
    • Include meta data when computing embeddings in EmbeddingRetriever by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2559

    Documentation

    • fix small typo in Document doc string by @galtay in https://github.com/deepset-ai/haystack/pull/2520
    • rearrange contributing guidelines by @masci in https://github.com/deepset-ai/haystack/pull/2515
    • Documenting output score of JoinDocuments when using concatenation by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2561
    • Minor lg updates to doc strings by @agnieszka-m in https://github.com/deepset-ai/haystack/pull/2585
    • Adjust pydoc markdown config so methods shown with classes by @brandenchan in https://github.com/deepset-ai/haystack/pull/2511
    • Update Ray pipeline docs with validation info by @agnieszka-m in https://github.com/deepset-ai/haystack/pull/2590

    Other Changes

    • Upgrade transformers version to 4.18.0 by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2514
    • Upgrade torch version to 1.11 by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2538
    • Fix tutorials 4, 7 and 8 by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2526
    • Tutorial1: convert_files_to_dicts --> convert_files_to_docs by @ZanSara in https://github.com/deepset-ai/haystack/pull/2546
    • Fix docker image tag with semantic version for releases by @askainet in https://github.com/deepset-ai/haystack/pull/2548
    • added launch_tika method by @anakin87 in https://github.com/deepset-ai/haystack/pull/2567
    • Remove encoding option from PDFToTextOCRConverter by @julian-risch in https://github.com/deepset-ai/haystack/pull/2553
    • Fix StaleElementReferenceException in Crawler by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2591

    New Contributors

    • @galtay made their first contribution in https://github.com/deepset-ai/haystack/pull/2520
    • @masci made their first contribution in https://github.com/deepset-ai/haystack/pull/2515
    • @ClaMnc made their first contribution in https://github.com/deepset-ai/haystack/pull/2550
    • @anakin87 made their first contribution in https://github.com/deepset-ai/haystack/pull/2567
    • @dimitrisna made their first contribution in https://github.com/deepset-ai/haystack/pull/2543

    ❤️ Big thanks to all contributors and the whole community!

    Source code(tar.gz)
    Source code(zip)
  • v1.4.0(May 5, 2022)

    ⭐ Highlights

    Logging Evaluation Results to MLflow

    Logging and comparing the evaluation results of multiple different pipeline configurations is much easier now thanks to the newly implemented MLflowTrackingHead. With our public MLflow instance you can log evaluation metrics and metadata about pipeline, evaluation set and corpus. Here is an example log file. If you have your own MLflow instance you can even store the pipeline YAML file and the evaluation set as artifacts. In Haystack, all you need is the execute_eval_run() method:

    eval_result = Pipeline.execute_eval_run(
        index_pipeline=index_pipeline,
        query_pipeline=query_pipeline,
        evaluation_set_labels=labels,
        corpus_file_paths=file_paths,
        corpus_file_metas=file_metas,
        experiment_tracking_tool="mlflow",
        experiment_tracking_uri="http://localhost:5000",
        experiment_name="my-query-pipeline-experiment",
        experiment_run_name="run_1",
        pipeline_meta={"name": "my-pipeline-1"},
        evaluation_set_meta={"name": "my-evalset"},
        corpus_meta={"name": "my-corpus"}.
        add_isolated_node_eval=True,
        reuse_index=False
    )
    

    https://github.com/deepset-ai/haystack/pull/2337

    Filtering Answers by Confidence in FARMReader

    The FARMReader got a parameter confidence_threshold to filter out predictions below this threshold. The threshold is disabled by default but can be set between 0 and 1 when initializing the FARMReader:

    from haystack.nodes import FARMReader
    model = "deepset/roberta-base-squad2"
    reader = FARMReader(model, confidence_threshold=0.5)
    

    https://github.com/deepset-ai/haystack/pull/2376

    Deprecating Milvus1DocumentStore & Renaming ElasticsearchRetriever

    The Milvus1DocumentStore is deprecated in favor of the newer Milvus2DocumentStore. Besides big architectural changes that impact performance and reliability Milvus version 2.0 supports the filtering by scalar data types. For Haystack users this means you can now run a query using vector similarity and filter for some meta data at the same time! See the Milvus documentation for more details if you need to migrate from Milvus1DocumentStore to Milvus2DocumentStore. https://github.com/deepset-ai/haystack/pull/2495

    The ElasticsearchRetriever node does not only work with the ElasticsearchDocumentStore but also with the OpenSearchDocumentStore and so it is only logical to rename the ElasticsearchRetriever. Now it is called BM25Retriever after the underlying BM25 ranking function. For the same reason, ElasticsearchFilterOnlyRetriever is now called FilterRetriever. The deprecated names and the new names are both working but we will drop support of the deprecated names in a future release. An overview of the different DocumentStores in Haystack can be found here. https://github.com/deepset-ai/haystack/pull/2423 https://github.com/deepset-ai/haystack/pull/2461

    Fixing Evaluation Discrepancies

    The evaluation of pipeline nodes with pipeline.eval(add_isolated_node_eval=True) and alternatively with retriever.eval() and reader.eval() gave slightly different results due to a bug in handling no_answers. This bug is fixed now and all different ways to run the evaluation give the same results. https://github.com/deepset-ai/haystack/pull/2381

    ⚠️ Breaking Changes

    • Change return types of indexing pipeline nodes by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2342
    • Upgrade weaviate-client to 3.3.3 and fix get_all_documents by @ZanSara in https://github.com/deepset-ai/haystack/pull/1895
    • Align TransformersReader defaults with FARMReader by @julian-risch in https://github.com/deepset-ai/haystack/pull/2490
    • Change default encoding for PDFToTextConverter from Latin 1 to UTF-8 by @ZanSara in https://github.com/deepset-ai/haystack/pull/2420
    • Validate YAML files without loading the nodes by @ZanSara in https://github.com/deepset-ai/haystack/pull/2438

    Other Changes

    Pipeline

    • Add tests for missing __init__ and super().__init__() in custom nodes by @ZanSara in https://github.com/deepset-ai/haystack/pull/2350
    • Forbid usage of *args and **kwargs in any node's __init__ by @ZanSara in https://github.com/deepset-ai/haystack/pull/2362
    • Change YAML version exception into a warning by @ZanSara in https://github.com/deepset-ai/haystack/pull/2385
    • Make sure that debug=True and params={'debug': True} behaves the same way by @ZanSara in https://github.com/deepset-ai/haystack/pull/2442
    • Add support for positional args in pipeline.get_config() by @tstadel in https://github.com/deepset-ai/haystack/pull/2478
    • enforce same index values before and after saving/loading eval dataframes by @tstadel in https://github.com/deepset-ai/haystack/pull/2398

    DocumentStores

    • Fix sparse retrieval with filters returns results without any text-match by @tstadel in https://github.com/deepset-ai/haystack/pull/2359
    • EvaluationSetClient for deepset cloud to fetch evaluation sets and la… by @FHardow in https://github.com/deepset-ai/haystack/pull/2345
    • Update launch script for Milvus from 1.x to 2.x by @ZanSara in https://github.com/deepset-ai/haystack/pull/2378
    • Use ElasticsearchDocumentStore.get_all_documents in ElasticsearchFilterOnlyRetriever.retrieve by @adri1wald in https://github.com/deepset-ai/haystack/pull/2151
    • Fix and use delete_index instead of delete_documents in tests by @tstadel in https://github.com/deepset-ai/haystack/pull/2453
    • Update docs of DeepsetCloudDocumentStore by @tholor in https://github.com/deepset-ai/haystack/pull/2460
    • Add support for aliases in elasticsearch document store by @ZeJ0hn in https://github.com/deepset-ai/haystack/pull/2448
    • fix dot_product metric by @jamescalam in https://github.com/deepset-ai/haystack/pull/2494
    • Deprecate Milvus1DocumentStore by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2495
    • Fix OpenSearchDocumentStore's __init__ by @ZanSara in https://github.com/deepset-ai/haystack/pull/2498

    Retriever

    • Rename dataset to evaluation_set when logging to mlflow by @tstadel in https://github.com/deepset-ai/haystack/pull/2457
    • Linearize tables in EmbeddingRetriever by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2462
    • Print warning in EmbeddingRetriever if sentence-transformers model used with different model format by @mpangrazzi in https://github.com/deepset-ai/haystack/pull/2377
    • Add flag to disable scaling scores to probabilities by @tstadel in https://github.com/deepset-ai/haystack/pull/2454
    • changing the name of the retrievers from es_retriever to retriever by @TuanaCelik in https://github.com/deepset-ai/haystack/pull/2487
    • Replace dpr with embeddingretriever tut14 by @mkkuemmel in https://github.com/deepset-ai/haystack/pull/2336
    • Support conjunctive queries in sparse retrieval by @tstadel in https://github.com/deepset-ai/haystack/pull/2361
    • Fix: Auth token not passed for EmbeddingRetriever by @mathislucka in https://github.com/deepset-ai/haystack/pull/2404
    • Pass use_auth_token to sentence transformers EmbeddingRetriever by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2284

    Reader

    • Fix TableReader for tables without rows by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2369
    • Match answer sorting in QuestionAnsweringHead with FARMReader by @tstadel in https://github.com/deepset-ai/haystack/pull/2414
    • Fix reader.eval() and reader.eval_on_file() output by @tstadel in https://github.com/deepset-ai/haystack/pull/2476
    • Raise error if torch-scatter is not installed or wrong version is installed by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2486

    Documentation

    • Fix link to squad_to_dpr.py in DPR train tutorial by @raphaelmerx in https://github.com/deepset-ai/haystack/pull/2334
    • Add evaluation and document conversion to tutorial 15 by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2325
    • Replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2479
    • Fix RouteDocuments documentation by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2380

    Other Changes

    • extract extension based on file's content by @GiannisKitsos in https://github.com/deepset-ai/haystack/pull/2330
    • Reduce num REST API workers to accommodate smaller machines by @brandenchan in https://github.com/deepset-ai/haystack/pull/2400
    • Add devices alongside use_gpu in FARMReader by @ZanSara in https://github.com/deepset-ai/haystack/pull/2294
    • Delete files in docs/_src by @brandenchan in https://github.com/deepset-ai/haystack/pull/2322
    • Add apt update in Linux CI by @ZanSara in https://github.com/deepset-ai/haystack/pull/2415
    • Exclude beir from Windows install by @ZanSara in https://github.com/deepset-ai/haystack/pull/2419
    • Added macos version of xpdf in tutorial 8 by @seduerr91 in https://github.com/deepset-ai/haystack/pull/2424
    • Make python-magic fully optional by @ZanSara in https://github.com/deepset-ai/haystack/pull/2412
    • Upgrade xpdf to 4.0.4 by @tholor in https://github.com/deepset-ai/haystack/pull/2443
    • Update xpdfreader package installation by @AI-Ahmed in https://github.com/deepset-ai/haystack/pull/2491

    New Contributors

    • @raphaelmerx made their first contribution in https://github.com/deepset-ai/haystack/pull/2334
    • @FHardow made their first contribution in https://github.com/deepset-ai/haystack/pull/2345
    • @GiannisKitsos made their first contribution in https://github.com/deepset-ai/haystack/pull/2330
    • @mpangrazzi made their first contribution in https://github.com/deepset-ai/haystack/pull/2377
    • @seduerr91 made their first contribution in https://github.com/deepset-ai/haystack/pull/2424
    • @ZeJ0hn made their first contribution in https://github.com/deepset-ai/haystack/pull/2448
    • @AI-Ahmed made their first contribution in https://github.com/deepset-ai/haystack/pull/2491

    ❤️ Big thanks to all contributors and the whole community!

    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Mar 23, 2022)

    ⭐ Highlights

    Pipeline YAML Syntax Validation

    The syntax of pipeline configurations as defined in YAML files can now be validated. If the validation fails, erroneous components/parameters are identified to make it simple to fix them. Here is a code snippet to manually validate a file:

    from pathlib import Path
    from haystack.pipelines.config import validate_yaml
    validate_yaml(Path("rest_api/pipeline/pipelines.haystack-pipeline.yml"))
    

    Your IDE can also take care of the validation when you edit a pipeline YAML file. The suffix *.haystack-pipeline.yml tells your IDE that this YAML contains a Haystack pipeline configuration and enables some checks and autocompletion features if the IDE is configured that way (YAML plugin for VSCode, Configuration Guide for PyCharm). The schema used for validation can be found in SchemaStore pointing to the schema files for the different Haystack versions. Note that an update of the Haystack version might sometimes require to do small changes to the pipeline YAML files. You can set version: 'unstable' in the pipeline YAML to circumvent the validation or set it to the latest Haystack version if the components and parameters that you use are compatible with the latest version. https://github.com/deepset-ai/haystack/pull/2226

    Pinecone DocumentStore

    We added another DocumentStore to Haystack: PineconeDocumentStore! 🎉 Pinecone is a fully managed service for very large scale dense retrieval. To this end, embeddings and metadata are stored in a hosted Pinecone vector database while the document content is stored in a local SQL database. This separation simplifies infrastructure setup and maintenance. In order to use this new document store, all you need is an API key, which you can obtain by creating an account on the Pinecone website. https://github.com/deepset-ai/haystack/pull/2254

    import os
    from haystack.document_stores import PineconeDocumentStore
    document_store = PineconeDocumentStore(api_key=os.environ["PINECONE_API_KEY"])
    

    BEIR Integration

    Fresh from the 🍻 cellar, Haystack now has an integration with our favorite BEnchmarking Information Retrieval tool BEIR. It contains preprocessed datasets for zero-shot evaluation of retrieval models in 17 different languages, which you can use to benchmark your pipelines. For example, a DocumentSearchPipeline can now be evaluated by calling Pipeline.eval_beir() after having installed Haystack with the BEIR dependency via pip install farm-haystack[beir]. Cheers! https://github.com/deepset-ai/haystack/pull/2333

    from haystack.pipelines import DocumentSearchPipeline, Pipeline
    from haystack.nodes import TextConverter, ElasticsearchRetriever
    from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
    
    text_converter = TextConverter()
    document_store = ElasticsearchDocumentStore(search_fields=["content", "name"], index="scifact_beir")
    retriever = ElasticsearchRetriever(document_store=document_store, top_k=1000)
    
    index_pipeline = Pipeline()
    index_pipeline.add_node(text_converter, name="TextConverter", inputs=["File"])
    index_pipeline.add_node(document_store, name="DocumentStore", inputs=["TextConverter"])
    
    query_pipeline = DocumentSearchPipeline(retriever=retriever)
    
    ndcg, _map, recall, precision = Pipeline.eval_beir(
        index_pipeline=index_pipeline, query_pipeline=query_pipeline, dataset="scifact"
    )
    

    Breaking Changes

    • Make Milvus2DocumentStore compatible with pymilvus>=2.0.0 by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2126
    • Set provider parameter when instantiating onnxruntime.InferenceSession and make device a torch.device in internal methods by @cjb06776 in https://github.com/deepset-ai/haystack/pull/1976

    Pipeline

    • Generate haystack-pipeline-1.2.0.schema.json by @ZanSara in https://github.com/deepset-ai/haystack/pull/2239
    • Add RouteDocuments and JoinAnswers nodes by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2256
    • Refactor Pipeline peripherals by @tstadel in https://github.com/deepset-ai/haystack/pull/2253
    • Allow to deploy and undeploy Pipelines on Deepset Cloud by @tstadel in https://github.com/deepset-ai/haystack/pull/2285
    • Reintroduce debug as a valid global key for Pipeline's params by @ZanSara in https://github.com/deepset-ai/haystack/pull/2298
    • Replace dpr with embeddingretriever tut11 by @mkkuemmel in https://github.com/deepset-ai/haystack/pull/2287
    • Package JSON schemas properly in Haystack by @ZanSara in https://github.com/deepset-ai/haystack/pull/2316
    • Fix dependency graph for indexing pipelines during codegen by @tstadel in https://github.com/deepset-ai/haystack/pull/2311
    • Fix YAML pipeline paths in docker-compose.yml by @ZanSara in https://github.com/deepset-ai/haystack/pull/2335
    • Improve error message for nodes failing validation by @ZanSara in https://github.com/deepset-ai/haystack/pull/2313
    • Fix Pipeline.print_eval_report by @tstadel in https://github.com/deepset-ai/haystack/pull/2271
    • save_to_deepset_cloud: automatically convert document stores by @tstadel in https://github.com/deepset-ai/haystack/pull/2283
    • Sas gpu additions by @thimo72 in https://github.com/deepset-ai/haystack/pull/2308

    Models

    • Update LFQA with the latest LFQA seq2seq and retriever models by @vblagoje in https://github.com/deepset-ai/haystack/pull/2210

    DocumentStores

    • Bulk insert in sql document stores by @OmniscienceAcademy in https://github.com/deepset-ai/haystack/pull/2264
    • 'os' wrapper to function for brownfield support by @TuanaCelik in https://github.com/deepset-ai/haystack/pull/2282
    • Using default OpenSearch parameters by @TuanaCelik in https://github.com/deepset-ai/haystack/pull/2327
    • Fix docker launch scripts by @tstadel in https://github.com/deepset-ai/haystack/pull/2341
    • Fix normalize_embedding using numba by @tstadel in https://github.com/deepset-ai/haystack/pull/2347

    Documentation

    • Update other.yml with new node names by @agnieszka-m in https://github.com/deepset-ai/haystack/pull/2286
    • Bring back init defs to api in v1.2 and latest by @brandenchan in https://github.com/deepset-ai/haystack/pull/2296
    • Remove unneeded files in docs directory by @brandenchan in https://github.com/deepset-ai/haystack/pull/2237
    • change old text to content argument for translator examples by @ju-gu in https://github.com/deepset-ai/haystack/pull/2240

    Tutorials

    • Fix tutorial dataset paths by @julian-risch in https://github.com/deepset-ai/haystack/pull/2340
    • Polish Evaluation Tutorial by @brandenchan in https://github.com/deepset-ai/haystack/pull/2212
    • Comment out Milvus cell on Tutorial6 by @ZanSara in https://github.com/deepset-ai/haystack/pull/2243
    • Change document attribute from text to content by @julian-risch in https://github.com/deepset-ai/haystack/pull/2352
    • Replace dpr with embeddingretriever tut5 by @mkkuemmel in https://github.com/deepset-ai/haystack/pull/2274
    • ipynb: inserted links to graph images by @mkkuemmel in https://github.com/deepset-ai/haystack/pull/2309

    Other Changes

    • Implement Context Matching by @tstadel in https://github.com/deepset-ai/haystack/pull/2293
    • Fix surrounding context extraction in ParsrConverter by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2162
    • Fix table extraction in ParsrConverter by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2262
    • Api pages by @brandenchan in https://github.com/deepset-ai/haystack/pull/2248
    • fix pip backtracking issue by @tstadel in https://github.com/deepset-ai/haystack/pull/2281
    • Update reader/base.py to fix UnboundLocalError in #2273 by @thimo72 in https://github.com/deepset-ai/haystack/pull/2275
    • Remove substrings basic implementation by @dmigo in https://github.com/deepset-ai/haystack/pull/2152
    • adding quotes for zsh shell issue by @TuanaCelik in https://github.com/deepset-ai/haystack/pull/2289
    • Prevent Preprocessor from changing existing documents by @tstadel in https://github.com/deepset-ai/haystack/pull/2297
    • Fix install because of missing jsonschema dependency by @tstadel in https://github.com/deepset-ai/haystack/pull/2315
    • Add basic telemetry features by @julian-risch in https://github.com/deepset-ai/haystack/pull/2314
    • Let SquadData support data from Annotation Tool by @brandenchan in https://github.com/deepset-ai/haystack/pull/2329

    New Contributors

    • @thimo72 made their first contribution in https://github.com/deepset-ai/haystack/pull/2275
    • @agnieszka-m made their first contribution in https://github.com/deepset-ai/haystack/pull/2286
    • @TuanaCelik made their first contribution in https://github.com/deepset-ai/haystack/pull/2289
    • @OmniscienceAcademy made their first contribution in https://github.com/deepset-ai/haystack/pull/2264
    • @jamescalam made their first contribution in https://github.com/deepset-ai/haystack/pull/2254
    • @cjb06776 made their first contribution in https://github.com/deepset-ai/haystack/pull/1976

    ❤️ Big thanks to all contributors and the whole community!

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Feb 23, 2022)

    ⭐ Highlights

    Brownfield Support of Existing Elasticsearch Indices

    You have an existing Elasticsearch index from other projects and now want to try out Haystack? The newly added method es_index_to_document_store provides brownfield support of existing Elasticsearch indices by converting each of the records in the provided index to Haystack Document objects and writing them to the specified DocumentStore.

    document_store = es_index_to_document_store(
        document_store=InMemoryDocumentStore(), #or any other Haystack DocumentStore
        original_index_name="existing_index",
        original_content_field="content",
        original_name_field="name",
        included_metadata_fields=["date_field"],
        index="new_index",
    )
    

    It can even be used on a regular basis in order to add new records of the Elasticsearch index to the DocumentStore! https://github.com/deepset-ai/haystack/pull/2229

    Tapas Reader With Scores

    The new model class TapasForScoredQA introduced in https://github.com/deepset-ai/haystack/pull/1997 supports Tapas Reader models that return confidence scores. When you load a Tapas Reader model, Haystack automatically infers whether the model supports confidence scores and chooses the correct model class under the hood. The returned answers are sorted first by a general table score and then by answer span scores. To try it out, just use one of the new TableReader models:

    reader = TableReader(model_name_or_path="deepset/tapas-large-nq-reader", max_seq_len=512) #or
    reader = TableReader(model_name_or_path="deepset/tapas-large-nq-hn-reader", max_seq_len=512)
    

    Extended Meta Data Filtering

    We extended the filter capabilities of all(*) document stores to support more complex filter expressions than previously. Besides simple selections on multiple fields you can now use more complex comparison expressions and connect these using boolean operators. For people having used mongodb the new syntax should look familiar. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name.

    Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value.

    If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

    Therefore, we don't have any breaking changes and you can keep on using your existing filter expressions.

    Example:

    filters = {
        "$and": {
            "type": {"$eq": "article"},
            "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
            "rating": {"$gte": 3},
            "$or": {
                "genre": {"$in": ["economy", "politics"]},
                "publisher": {"$eq": "nytimes"}
            }
        }
    }
    # or simpler using default operators
    filters = {
        "type": "article",
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": ["economy", "politics"],
            "publisher": "nytimes"
        }
    }
    

    (*) FAISSDocumentStore and MilvusDocumentStore currently do not support filters during search.

    Code Style and Linting

    In addition to mypy we already had for static type checking, we now use pylint for linting and the Haystack code base does now comply with Black formatting standards. As a result, the code is formatted in a consistent way and easier to read. When you would like to contribute to Haystack you don't need to worry about that though - our CI will automatically format your code changes correctly. Our contributor guidelines give more details in case you would like to run the checks locally. https://github.com/deepset-ai/haystack/pull/2115 https://github.com/deepset-ai/haystack/pull/2130

    Installation with fewer dependencies

    Installing Haystack has become easier and faster thanks to optional dependencies. From now on, there is no need to install all dependencies if you don't need them. For example, pip3 install farm-haystack will install the latest release together with only a small subset of packages required for basic Pipelines with an ElasticsearchDocumentStore. As another example, if you are experimenting with FAISSDocumentStore in a colab notebook, you can install Haystack from the master branch together with FAISS dependency by running: !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]. The installation guide reflects these updates and the full list of subsets of dependencies can be found here. Keep in mind, though, that this system works best with pip versions above 22 https://github.com/deepset-ai/haystack/pull/1994

    ⚠️ Known Issues

    Installing haystack with all dependencies results in heavy pip backtracking that might never finish. This is due to a dependency conflict that was introduced by a new release of one of our sub dependencies. To circumvent this problem install haystack like this:

    pip install farm-haystack[all] "azure-core<1.23"
    

    This might also be needed for other non-default dependencies (e.g. farm-haystack[dev] "azure-core<1.23"). See https://github.com/deepset-ai/haystack/issues/2280 for more information.

    ⚠️ Breaking Changes

    • Improve dependency management by @ZanSara in https://github.com/deepset-ai/haystack/pull/1994
    • Make ui and rest proper packages by @ZanSara in https://github.com/deepset-ai/haystack/pull/2098
    • Add aiorwlock to 'ray' extra & fix maximum version for some dependencies by @ZanSara in https://github.com/deepset-ai/haystack/pull/2140

    🤓 Detailed Changes

    Pipeline

    • Add top_k_join parameter to JoinDocuments.run by @adri1wald in https://github.com/deepset-ai/haystack/pull/2065
    • ✨ Add JSON Schema autogeneration for Pipeline YAML files by @tiangolo in https://github.com/deepset-ai/haystack/pull/2020
    • Make FileTypeClassifier more flexible by @ZanSara in https://github.com/deepset-ai/haystack/pull/2101
    • Query response without answers by @ZanSara in https://github.com/deepset-ai/haystack/pull/2161
    • Generate JSON schema index for Schemastore by @ZanSara in https://github.com/deepset-ai/haystack/pull/2225
    • Fix Pipeline.components by @tstadel in https://github.com/deepset-ai/haystack/pull/2215
    • Join node should allow reciprocal rank fusion as additional merging method by @mathislucka in https://github.com/deepset-ai/haystack/pull/2133
    • Apply filter in eval only if no gold docs are given as input by @julian-risch in https://github.com/deepset-ai/haystack/pull/2154
    • pipeline.save_to_deepset_cloud() by @tstadel in https://github.com/deepset-ai/haystack/pull/2145
    • Fix typo in save_to_deepset_cloud() by @tstadel in https://github.com/deepset-ai/haystack/pull/2189
    • Generate code from pipeline (pipeline.to_code()) by @tstadel in https://github.com/deepset-ai/haystack/pull/2214
    • Allow different filters per query in pipeline evaluation by @julian-risch in https://github.com/deepset-ai/haystack/pull/2068
    • List all pipeline(_configs) on Deepset Cloud by @tstadel in https://github.com/deepset-ai/haystack/pull/2102
    • Evaluating a pipeline consisting only of a reader node by @julian-risch in https://github.com/deepset-ai/haystack/pull/2132
    • DC SDK - load pipeline from deepset cloud by @ArzelaAscoIi in https://github.com/deepset-ai/haystack/pull/2013
    • YAML versioning by @ZanSara in https://github.com/deepset-ai/haystack/pull/2209

    Models

    • Add Tapas reader with scores by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1997
    • Fix finetuning notebook augmentation by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2071
    • Fix Seq2SeqGenerator return type by @tstadel in https://github.com/deepset-ai/haystack/pull/2099
    • Distribute intermediate layer distillation loss calculation over multiple GPUs by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2090
    • Do not apply DataParallel twice by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2095

    DocumentStores

    • Pin Milvus to <2.0.0 by @ZanSara in https://github.com/deepset-ai/haystack/pull/2063
    • fix: get_documents_by_id should return docs for all passed ids by @mathislucka in https://github.com/deepset-ai/haystack/pull/2064
    • Supported Highlighting in Elasticsearch by @SjSnowball in https://github.com/deepset-ai/haystack/pull/1930
    • pass faiss batch_size to sqldocumentstore by @AhmedIdr in https://github.com/deepset-ai/haystack/pull/2061
    • Fixed the Search Field mapping in ElasticSearch DocumentStore by @SjSnowball in https://github.com/deepset-ai/haystack/pull/2080
    • Provide option to recreate es doc store on initialization by @mathislucka in https://github.com/deepset-ai/haystack/pull/2084
    • Fixed performance bug. Using a list where a set is needed. by @baregawi in https://github.com/deepset-ai/haystack/pull/2125
    • Extend metadata filtering support in ElasticsearchDocumentStore by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2108
    • OpenSearchDocumentStore: Extend similarity support by @tstadel in https://github.com/deepset-ai/haystack/pull/2070
    • Speed up query_by_embedding in InMemoryDocumentStore. by @baregawi in https://github.com/deepset-ai/haystack/pull/2091
    • Fix dependency management in Tutorial 6 by @ZanSara in https://github.com/deepset-ai/haystack/pull/2148
    • Enable use of dot_product OpenSearch Script Scoring by @tstadel in https://github.com/deepset-ai/haystack/pull/2168
    • Changed document_store to ElasticsearchDocumentStore by @mkkuemmel in https://github.com/deepset-ai/haystack/pull/2192
    • Support more data types and extended filters in WeaviateDocStore by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2143
    • Adding extended meta data filtering support for InMemoryDocumenStore by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2120
    • Fix ef_search param for hnsw in OpenSearchDocumentStore by @tstadel in https://github.com/deepset-ai/haystack/pull/2227
    • Add Brownfield Support of existing Elasticsearch indices by @bogdankostic in https://github.com/deepset-ai/haystack/pull/2229
    • Introduce readonly DCDocumentStore (without labels support) by @tstadel in https://github.com/deepset-ai/haystack/pull/1991
    • Extend meta data support for SQLDocumentStore by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2199
    • Fix missing embeddings not skipped if filters are used by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2230

    REST API

    • Convert doc embedding from ndarray to list of float for REST API by @julian-risch in https://github.com/deepset-ai/haystack/pull/1901
    • Autogenerate OpenAPI specs file by @ZanSara in https://github.com/deepset-ai/haystack/pull/2047
    • Make openapi.json multiline so the diff is parsable by @ZanSara in https://github.com/deepset-ai/haystack/pull/2163
    • Align REST API and Haystack versions by @ZanSara in https://github.com/deepset-ai/haystack/pull/2164
    • Add DELETE /feedback for testing and make the label's id generate server-side by @ZanSara in https://github.com/deepset-ai/haystack/pull/2159
    • Add type check for meta & add tests by @ZanSara in https://github.com/deepset-ai/haystack/pull/2184
    • Update url in POST /file-upload by @ZanSara in https://github.com/deepset-ai/haystack/pull/2193
    • Versioning openapi.json by @ZanSara in https://github.com/deepset-ai/haystack/pull/2228

    Docker

    • Change docstores_gpu into docstores-gpu in Dockerfile-GPU by @ZanSara in https://github.com/deepset-ai/haystack/pull/2129
    • Remove run_docker_gpu.sh by @ZanSara in https://github.com/deepset-ai/haystack/pull/2003
    • Remove rest extra from Dockerfile-GPU by @ZanSara in https://github.com/deepset-ai/haystack/pull/2122
    • Fix dependency related build issues in Dockerfiles by @ZanSara in https://github.com/deepset-ai/haystack/pull/2135
    • Add docker-compose override file for Traffic Monitoring by @tstadel in https://github.com/deepset-ai/haystack/pull/2224
    • Adding a minimal haystack gpu build by @ArzelaAscoIi in https://github.com/deepset-ai/haystack/pull/2185

    Documentation

    • Remove stray requirements.txt files and update README.md by @ZanSara in https://github.com/deepset-ai/haystack/pull/2075
    • Make the docstring bot work only on master by @ZanSara in https://github.com/deepset-ai/haystack/pull/2078
    • Add who uses Haystack section by @dmigo in https://github.com/deepset-ai/haystack/pull/1975
    • Rename image to fix link in CONTRIBUTING.md by @ZanSara in https://github.com/deepset-ai/haystack/pull/2211
    • Add ADR template for transparent architecture decisions by @tholor in https://github.com/deepset-ai/haystack/pull/2072
    • Update Readme to reflect changes to installation procedure by @brandenchan in https://github.com/deepset-ai/haystack/pull/2157
    • Add REST API and UI installation info to readme by @brandenchan in https://github.com/deepset-ai/haystack/pull/2160
    • Upgrade pydoc-markdown by @ZanSara in https://github.com/deepset-ai/haystack/pull/2117

    CI

    • Introduce pylint & other improvements on the CI by @ZanSara in https://github.com/deepset-ai/haystack/pull/2130
    • Apply black formatting by @ZanSara in https://github.com/deepset-ai/haystack/pull/2115
    • Pylint: solve or silence locally rare warnings by @ZanSara in https://github.com/deepset-ai/haystack/pull/2170
    • Revert "Make the docstring bot work only on master" by @ZanSara in https://github.com/deepset-ai/haystack/pull/2114
    • Fix CI build-cache issue causing code changes to take no effect by @tstadel in https://github.com/deepset-ai/haystack/pull/2082
    • Disable cache on the CI by @ZanSara in https://github.com/deepset-ai/haystack/pull/2083
    • Reintroduce push on master trigger for Linux CI by @ZanSara in https://github.com/deepset-ai/haystack/pull/2127
    • Allow Linux CI to push changes to forks by @ZanSara in https://github.com/deepset-ai/haystack/pull/2182
    • Fix windows ci tests by @tstadel in https://github.com/deepset-ai/haystack/pull/2144
    • Disable autoformat.yml on master by @ZanSara in https://github.com/deepset-ai/haystack/pull/2198
    • Testing actions (@ZanSara) by @hegyibalint in https://github.com/deepset-ai/haystack/pull/2200

    Other Changes

    • Add UnlabeledTextProcessor by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2054
    • fix answer is not subscriptable error by @julian-risch in https://github.com/deepset-ai/haystack/pull/2069
    • Add faiss dependency to tutorial 12 by @julian-risch in https://github.com/deepset-ai/haystack/pull/2109
    • Simplify SQuAD data to df conversion by @mathislucka in https://github.com/deepset-ai/haystack/pull/2124
    • Remove requirements for json schema by @ZanSara in https://github.com/deepset-ai/haystack/pull/2128
    • Move pytest configuration into pyproject.toml by @ZanSara in https://github.com/deepset-ai/haystack/pull/2141
    • Fix MultiLabel creation with aggregate_by_meta by @tstadel in https://github.com/deepset-ai/haystack/pull/2165
    • Add tests on MultiLabel's meta and filter aggregation by @tstadel in https://github.com/deepset-ai/haystack/pull/2169
    • Improve Label and MultiLabel __str__ and __repr__ by @ZanSara in https://github.com/deepset-ai/haystack/pull/2202

    New Contributors

    • @adri1wald made their first contribution in https://github.com/deepset-ai/haystack/pull/2065
    • @tiangolo made their first contribution in https://github.com/deepset-ai/haystack/pull/2020
    • @baregawi made their first contribution in https://github.com/deepset-ai/haystack/pull/2125
    • @mkkuemmel made their first contribution in https://github.com/deepset-ai/haystack/pull/2192
    • @hegyibalint made their first contribution in https://github.com/deepset-ai/haystack/pull/2200

    ❤️ Big thanks to all contributors and the whole community!

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Jan 20, 2022)

    ⭐ Highlights

    Model Distillation for Reader Models

    With the new model distillation features, you don't need to choose between accuracy and speed! Now you can compress a large reader model (teacher) into a smaller model (student) while retaining most of the teacher's performance. For example, deepset/tinybert-6l-768d-squad2 is twice as fast as bert-base with an F1 reduction of only 2%.

    To distil your own model, just follow these steps:

    1. Call python augment_squad.py --squad_path <your dataset> --output_path <output> --multiplication_factor 20 where augment_squad.py is our data augmentation script.
    2. Run student.distil_intermediate_layers_from(teacher, data_dir="dataset", train_filename="augmented_dataset.json") where student is a small model and teacher is a highly accurate, larger reader model.
    3. Run student.distil_prediction_layer_from(teacher, data_dir="dataset", train_filename="dataset.json") with the same teacher and student.

    For more information on what kinds of students and teachers you can use and on model distillation in general, just take a look at this guide.

    Integrated vs. Isolated Pipeline Evaluation Modes

    When you evaluate a pipeline, you can now use two different evaluation modes and create an automatic report that shows the results of both. The integrated evaluation (default) shows what result quality users will experience when running the pipeline. The isolated evaluation mode additionally shows what the maximum result quality of a node could be if it received the perfect input from the preceeding node. Thereby, you can find out whether the retriever or the reader in an ExtractiveQAPipeline is the bottleneck.

    eval_result_with = pipeline.eval(labels=eval_labels, add_isolated_node_eval=True)
    pipeline.print_eval_report(eval_result)
    
    ================== Evaluation Report ==================
    =======================================================
                          Query
                            |
                          Retriever
                            |
                            | recall_single_hit:   ...
                            |
                          Reader
                            |
                            | f1 upper  bound:   0.78
                            | f1:   0.65
                            |
                          Output
    

    As the gap between the upper bound F1-score of the reader differs a lot from its actual F1-score in this report, you would need to improve the predictions of the retriever node to achieve the full performance of this pipeline. Our updated evaluation tutorial lists all the steps to generate an evaluation report with all the metrics you need and their upper bounds of each individual node. The guide explains the two evaluation modes in detail.

    Row-Column-Intersection Model for TableQA

    Now you can use a Row-Column-Intersection model on your own tabular data. To try it out, just replace the declaration of your TableReader:

    reader = RCIReader(row_model_name_or_path="michaelrglass/albert-base-rci-wikisql-row",
                       column_model_name_or_path="michaelrglass/albert-base-rci-wikisql-col")
    

    The RCIReader requires two separate models: One for rows and one for columns. Working on each column and row separately allows it to be used on much larger tables. It is also able to return meaningful confidence scores unlike the TableReader. Please note, however, that it currently does not support aggregations over multiple cells and that it is a bit slower than other approaches.

    Advanced File Converters

    Given a file (PDF or DOCX), there are two file converters to extract text and tables in Haystack now: The ParsrConverter based on the open-source Parsr tool by axa-group introduced into Haystack in this release and the AzureConverter, which we improved on. Both of them return a list of dictionaries containing one dictionary for each table detected in the file and one dictionary containing the text of the file. This format matches the document format and can be used right away for TableQA (see the guide).

    converter = ParsrConverter()
    docs = converter.convert(file_path="samples/pdf/sample_pdf_1.pdf")
    

    ⚠️ Breaking Changes

    • Custom id hashing on documentstore level by @ArzelaAscoIi in https://github.com/deepset-ai/haystack/pull/1910
    • Implement proper FK in MetaDocumentORM and MetaLabelORM to work on PostgreSQL by @ZanSara in https://github.com/deepset-ai/haystack/pull/1990

    🤓 Detailed Changes

    Pipeline

    • Extend TranslationWrapper to work with QA Generation by @julian-risch in https://github.com/deepset-ai/haystack/pull/1905
    • Add nDCG to pipeline.eval()'s document metrics by @tstadel in https://github.com/deepset-ai/haystack/pull/2008
    • change column order for evaluatation dataframe by @ju-gu in https://github.com/deepset-ai/haystack/pull/1957
    • Add isolated node eval mode in pipeline eval by @julian-risch in https://github.com/deepset-ai/haystack/pull/1962
    • introduce node_input param by @tstadel in https://github.com/deepset-ai/haystack/pull/1854
    • Add ParsrConverter by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1931
    • Add improvements to AzureConverter by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1896

    Models

    • Prevent wrapping DataParallel in second DataParallel by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1855
    • Enable batch mode for SAS cross encoders by @tstadel in https://github.com/deepset-ai/haystack/pull/1987
    • Add RCIReader for TableQA by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1909
    • distinguish intermediate layer & prediction layer distillation phases with different parameters by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2001
    • Add TinyBERT data augmentation by @MichelBartels in https://github.com/deepset-ai/haystack/pull/1923
    • Adding distillation loss functions from TinyBERT by @MichelBartels in https://github.com/deepset-ai/haystack/pull/1879

    DocumentStores

    • Raise exception if Elasticsearch search_fields have wrong datatype by @tstadel in https://github.com/deepset-ai/haystack/pull/1913
    • Support custom headers per request in pipeline by @tstadel in https://github.com/deepset-ai/haystack/pull/1861
    • Fix retrieving documents in WeaviateDocumentStore with content_type=None by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1938
    • Fix Numba TypingError in normalize_embedding for cosine similarity by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1933
    • Fix loading a saved FAISSDocumentStore by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1937
    • Propagate duplicate_documents to base class initialization by @yorickvanzweeden in https://github.com/deepset-ai/haystack/pull/1936
    • Fix vector_id collision in FAISS by @yorickvanzweeden in https://github.com/deepset-ai/haystack/pull/1961
    • Unify vector_dim and embedding_dim parameter in Document Store by @mathew55 in https://github.com/deepset-ai/haystack/pull/1922
    • Align similarity scores across document stores by @MichelBartels in https://github.com/deepset-ai/haystack/pull/1967
    • Bugfix - save_to_yaml for OpenSearchDocumentStore by @ArzelaAscoIi in https://github.com/deepset-ai/haystack/pull/2017
    • Fix elasticsearch scores if they are 0.0 by @tstadel in https://github.com/deepset-ai/haystack/pull/1980

    REST API

    • Rely api healthcheck on status code rather than json decoding by @fabiolab in https://github.com/deepset-ai/haystack/pull/1871
    • Bump version in REST api by @tholor in https://github.com/deepset-ai/haystack/pull/1875

    UI / Demo

    • Replace SessionState with Streamlit built-in by @yorickvanzweeden in https://github.com/deepset-ai/haystack/pull/2006
    • Fix demo deployment by @askainet in https://github.com/deepset-ai/haystack/pull/1877
    • Add models to demo docker image by @ZanSara in https://github.com/deepset-ai/haystack/pull/1978

    Documentation

    • Update pydoc-markdown-file-classifier.yml by @brandenchan in https://github.com/deepset-ai/haystack/pull/1856
    • Create v1.0 docs by @brandenchan in https://github.com/deepset-ai/haystack/pull/1862
    • Fix typo by @amotl in https://github.com/deepset-ai/haystack/pull/1869
    • Correct bug with encoding when generating Markdown documentation issue #1880 by @albertovilla in https://github.com/deepset-ai/haystack/pull/1881
    • Minor typo by @javier in https://github.com/deepset-ai/haystack/pull/1900
    • Fixed the grammatical issue in optimization guides #1940 by @eldhoittangeorge in https://github.com/deepset-ai/haystack/pull/1941
    • update link to annotation tool docu by @julian-risch in https://github.com/deepset-ai/haystack/pull/2005
    • Extend Tutorial 5 with Upper Bound Reader Eval Metrics by @julian-risch in https://github.com/deepset-ai/haystack/pull/1995
    • Add distillation to finetuning tutorial by @MichelBartels in https://github.com/deepset-ai/haystack/pull/2025
    • Add ndcg and eval_mode to docs by @tstadel in https://github.com/deepset-ai/haystack/pull/2038
    • Remove hard-coded variables from the Tutorial 15 by @dmigo in https://github.com/deepset-ai/haystack/pull/1984

    Other Changes

    • upgrade transformers to 4.13.0 by @julian-risch in https://github.com/deepset-ai/haystack/pull/1659
    • Fix typo in the Windows CI UI deps by @ZanSara in https://github.com/deepset-ai/haystack/pull/1876
    • Exchanged minimal with minimum in print_answers function call by @albertovilla in https://github.com/deepset-ai/haystack/pull/1890
    • Improved version of print_answers by @albertovilla in https://github.com/deepset-ai/haystack/pull/1891
    • Fix WIndows CI by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1899
    • Changed export to csv method to new answer format by @Johnny-KP in https://github.com/deepset-ai/haystack/pull/1907
    • Unpin ray version by @dmigo in https://github.com/deepset-ai/haystack/pull/1906
    • Fix Windows CI OOM by @tstadel in https://github.com/deepset-ai/haystack/pull/1878
    • Text for contributor license agreement by @PiffPaffM in https://github.com/deepset-ai/haystack/pull/1766
    • Fix issue #1925 - UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow by @AlonEirew in https://github.com/deepset-ai/haystack/pull/1926
    • Update Ray to version 1.9.1 by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1934
    • Fix #1927 - RuntimeError when loading data using data_silo due to many open file descriptors from multiprocessing by @AlonEirew in https://github.com/deepset-ai/haystack/pull/1928
    • Add GitHub Action for Docker Build for GPU by @oryx1729 in https://github.com/deepset-ai/haystack/pull/1916
    • Use Commit ID for Docker tags by @oryx1729 in https://github.com/deepset-ai/haystack/pull/1946
    • Upgrade torch version by @oryx1729 in https://github.com/deepset-ai/haystack/pull/1960
    • check multiprocessing sharing strategy is available by @julian-risch in https://github.com/deepset-ai/haystack/pull/1965
    • fix UserWarning from slow tensor conversion by @mathislucka in https://github.com/deepset-ai/haystack/pull/1948
    • Fix Dockerfile-GPU by @oryx1729 in https://github.com/deepset-ai/haystack/pull/1969
    • Use scikit-learn, not sklearn, in requirements.txt by @benjamin-klara in https://github.com/deepset-ai/haystack/pull/1974
    • Upgrade pillow version to 9.0.0 by @mapapa in https://github.com/deepset-ai/haystack/pull/1992
    • Disable pip cache for Dockerfiles by @oryx1729 in https://github.com/deepset-ai/haystack/pull/2015

    New Contributors

    • @amotl made their first contribution in https://github.com/deepset-ai/haystack/pull/1869
    • @fabiolab made their first contribution in https://github.com/deepset-ai/haystack/pull/1871
    • @albertovilla made their first contribution in https://github.com/deepset-ai/haystack/pull/1881
    • @javier made their first contribution in https://github.com/deepset-ai/haystack/pull/1900
    • @Johnny-KP made their first contribution in https://github.com/deepset-ai/haystack/pull/1907
    • @dmigo made their first contribution in https://github.com/deepset-ai/haystack/pull/1906
    • @eldhoittangeorge made their first contribution in https://github.com/deepset-ai/haystack/pull/1941
    • @yorickvanzweeden made their first contribution in https://github.com/deepset-ai/haystack/pull/1936
    • @benjamin-klara made their first contribution in https://github.com/deepset-ai/haystack/pull/1974
    • @mathew55 made their first contribution in https://github.com/deepset-ai/haystack/pull/1922
    • @mapapa made their first contribution in https://github.com/deepset-ai/haystack/pull/1992

    ❤️ Big thanks to all contributors and the whole community!

    Source code(tar.gz)
    Source code(zip)
    farm-haystack-1.1.0.tar.gz(379.42 KB)
  • v1.0.0(Dec 8, 2021)

    :gift: Haystack 1.0

    We worked hard to bring you an early Christmas present: 1.0 is out! In the last months, we re-designed many essential parts of Haystack, introduced new features, and simplified many user-facing methods. We believe Haystack is now much easier to use and a solid base for many exciting upcoming features that we plan. This release is a major milestone on our journey with you, the community, and we want to thank you again for all the great contributions, discussions, questions, and bug reports that helped us to build a better Haystack. This journey has just started :rocket:

    :star: Highlights

    Improved Evaluation of Pipelines

    Evaluation helps you find out how well your system is doing on your data. This includes Pipeline level evaluation to ensure that the system's output is really what you're after, but also Node level evaluation so that you can figure out whether it's your Reader or Retriever that is holding back the performance.

    In this release, evaluation is much simpler and cleaner to perform. All the functionality is now baked into the Pipeline class and you can kick off the process by providing Label or MultiLabel objects to the Pipeline.eval() method.

    eval_result = pipeline.eval(
        labels=labels,
        params={"Retriever": {"top_k": 5}},
    )
    

    The output is an EvaluationResult object which stores each Node's prediction for each sample in a Pandas DataFrame - so you can easily inspect granular predictions and potential mistakes without re-running the whole thing. There is a EvaluationResult.calculate_metrics() method which will return the relevant metrics for your evaluation and you can print a convenient summary report via the new .

    metrics = eval_result.calculate_metrics()
    
    pipeline.print_eval_report(eval_result)
    

    If you'd like to start evaluating your own systems on your own data, check out our Evaluation Tutorial!

    Table QA

    A lot of valuable information is stored in tables - we've heard this again and again from the community. While they are an efficient structured data format, it hasn't been possible to search for table contents using traditional NLP techniques. But now, with the new TableTextRetriever and TableReader our users have all the tools they need to query for relevant tables and perform Question Answering.

    The TableTextRetriever is the result of our team's research into table retrieval methods which you can read about in this paper that was presented at EMNLP 2021. Behind the scenes, it uses three transformer-based encoders - one for text passages, one for tables, and one for the query. However, in Haystack, you can swap it in for any other dense retrieval model and start working with tables. The TableReader is built upon the TAPAS model and when handed table containing Documents, it can return a single cell as an answer or perform an aggregation operation on a set of cells to form a final answer.

    retriever = TableTextRetriever(
        document_store=document_store,
        query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
        passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
        table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
        embed_meta_fields=["title", "section_title"]
    )
    
    reader = TableReader(
    		model_name_or_path="google/tapas-base-finetuned-wtq",
    		max_seq_len=512
    )
    

    Have a look at the Table QA documentation if you'd like to learn more or dive into the Table QA tutorial to start unlocking the information in your table data.

    Improved Debugging of Pipelines & Nodes

    We've made debugging much simpler and also more informative! As long as your node receives a boolean debug argument, it can propagate its input, output or even some custom information to the output of the pipeline. It is now a built-in feature of all existing nodes and can also easily be inherited by your custom nodes.

    result = pipeline.run(
            query="Who is the father of Arya Stark?",
            params={
                "debug": True
            }
        )
    
    {'ESRetriever': {'input': {'debug': True,
                               'query': 'Who is the father of Arya Stark?',
                               'root_node': 'Query',
                               'top_k': 1},
                     'output': {'documents': [<Document: {'content': "\n===In the Riverlands===\nThe Stark army reaches the Twins, a bridge strong", ...}>]
                                ...}
    

    To find out more about this feature, check out debugging. To learn how to define custom debug information, have a look at custom debugging.

    FARM Migration

    Those of you following Haystack from its first days will know that Haystack first evolved out of the FARM framework. While FARM is designed to handle diverse NLP models and tasks, Haystack gives full end-to-end support to search and question answering use cases with a focus on coordinating all components that take a proof-of-concept into production.

    Haystack has always relied on FARM for much lower-level processing and modeling. To reduce the implementation overhead and simplify debugging, we have migrated the relevant parts of FARM into the new haystack/modeling package.

    :warning: Breaking Changes & Migration Guide

    Migration to v1.0

    With the release of v1.0, we decided to make some bold changes. We believe this has brought a significant improvement in usability and makes the project more future-proof. While this does come with a few breaking changes, and we do our best to guide you on how to go from v0.x to v1.0. For more details see the Migration Guide and if you need more guidance, just reach out via Slack.

    New Package Structure & Changed Imports

    Due to the ever-increasing number of Nodes and Document Stores being integrated into Haystack, we felt the need to implement a repository structure that makes it easier to navigate to what you're looking for. We've also shortened the length of the imports.

    haystack.document_stores

    • All Document Stores can now be directly accessed from here
    • Note the pluralization of document_store to document_stores

    haystack.nodes

    • This directory directly contains any class that can be used as a node
    • This includes File Converters and PreProcessors

    haystack.pipelines

    • This contains all the base, custom and pre-made pipeline classes
    • Note the pluralization of pipeline to pipelines

    haystack.utils

    • Any utility functions

    :arrow_right: For the large majority of imports, the old style still works but this will be deprecated in future releases!

    Primitive Objects

    Instead of relying on dictionaries, Haystack now standardizes more of the inputs and outputs of Nodes using the following primitive classes:

    With these, there is now support for data structures beyond text and the REST API schema is built around their structure. Using these classes also allows for the autocompletion of fields in your IDE.

    Tip: To see examples of these primitive classes being returned, have a look at Ready-Made Pipelines.

    Many of the fields in these classes have also been renamed or removed. You can see a more comprehensive list of them in this Github issue. Below, we will go through a few cases that are likely to impact established workflows.

    Input Document Format

    This dictionary schema used to be the recommended way to prepare your data to be indexed. Now we strongly recommend using our dedicated Document class as a replacement. The text field has been renamed content to accommodate for cases where it is used for another data format, for example in Table QA.

    Click here to see code example

    v0.x:

    doc = {
    	'text': 'DOCUMENT_TEXT_HERE',
    	'meta': {'name': DOCUMENT_NAME, ...}
    }
    

    v1.0:

    doc = Document(
        content='DOCUMENT_TEXT_HERE',
        meta={'name': DOCUMENT_NAME, ...}
    )
    

    From here, you can take the same steps to write Documents into your Document Store.

    document_store.write_documents([doc])
    

    Response format of Reader

    All Reader Nodes now return Answer objects instead of dictionaries.

    Click here to see code example

    v0.x:

    [
        {
            'answer': 'Fang',
            'score': 13.26807975769043,
            'probability': 0.9657130837440491,
            'context': """Криволапик (Kryvolapyk, kryvi lapy "crooked paws")
                ===Fang (Hagrid's dog)===
                *Chinese (PRC): 牙牙 (ya2 ya) (from 牙 "tooth", 牙,"""
        }
    ]
    

    v1.0:

    [
        <Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9946763813495636, 'context': "s Nymeria after a legendary warrior queen. She travels...", 'offsets_in_document': [{'start': 147, 'end': 153}], 'offsets_in_context': [{'start': 72, 'end': 78}], 'document_id': 'ba2a8e87ddd95e380bec55983ee7d55f', 'meta': {'name': '43_Arya_Stark.txt'}}>,
        <Answer {'answer': 'King Robert', 'type': 'extractive', 'score': 0.9251320660114288, 'context': 'ordered by the Lord of Light. Melisandre later reveals to Gendry that...', 'offsets_in_document': [{'start': 1808, 'end': 1819}], 'offsets_in_context': [{'start': 70, 'end': 81}], 'document_id': '7b67b0e27571c2b2025a34b4db18ad49', 'meta': {'name': '349_List_of_Game_of_Thrones_characters.txt'}}>,
        <Answer {'answer': 'Ned', 'type': 'extractive', 'score': 0.8103329539299011, 'context': " girl disguised as a boy all along and is surprised to learn she is Arya...", 'offsets_in_document': [{'start': 920, 'end': 923}], 'offsets_in_context': [{'start': 74, 'end': 77}], 'document_id': '7b67b0e27571c2b2025a34b4db18ad49', 'meta': {'name': '349_List_of_Game_of_Thrones_characters.txt'}}>,
        ...
    ]
    

    Label Structure

    The attributes of the Label object have gone through some changes. To see their current structure see Label.

    Click here to see code example

    v0.x:

    label = Label(
        question=QUESTION_TEXT_HERE,
        answer=ANSWER_STRING_HERE,
        ...
    )
    

    v1.0:

    label = Label(
        query=QUERY_TEXT_HERE,
        answer=Answer(...),
        ...
    )
    

    REST API Format

    The response format for the /query matches that of the primitive objects, only in JSON form. This means, there are similar breaking changes as described above for the Answer format of a Reader. Particularly, the names of the offset fields have changed and need to be aligned to the new format when coming from Haystack v0.x. For detailed examples and guidance see the Migration Guide.

    Other breaking changes

    • Save/load of FAISSDocumentstore @ZanSara in https://github.com/deepset-ai/haystack/pull/1459
    • Add AzureConverter & change response format of FileConverter.convert() by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1813

    :nerd_face: Detailed Changes

    Pipeline

    • Return intermediate nodes output in pipelines by @ZanSara in https://github.com/deepset-ai/haystack/pull/1558
    • Add debug and debug_logs params to standard pipelines by @tholor in https://github.com/deepset-ai/haystack/pull/1586
    • Pipeline node names validation by @ZanSara in https://github.com/deepset-ai/haystack/pull/1601
    • Multi query eval by @tstadel in https://github.com/deepset-ai/haystack/pull/1746
    • Pipelines now tolerate custom _debug content by @ZanSara in https://github.com/deepset-ai/haystack/pull/1756
    • Adding yaml functionality to standard pipelines (save/load...) by @MichelBartels in https://github.com/deepset-ai/haystack/pull/1735
    • Calculation of metrics and presentation of eval results by @tstadel in https://github.com/deepset-ai/haystack/pull/1760
    • Fix loading and saving of EvaluationReszult by @tstadel in https://github.com/deepset-ai/haystack/pull/1831
    • remove queries param from pipeline.eval() by @tstadel in https://github.com/deepset-ai/haystack/pull/1836
    • Deprecate old pipeline eval nodes: EvalDocuments and EvalAnswers by @tstadel in https://github.com/deepset-ai/haystack/pull/1778

    Models

    • Farm merging base by @Timoeller in https://github.com/deepset-ai/haystack/pull/1422
    • Add inferencer for QA only by @julian-risch in https://github.com/deepset-ai/haystack/pull/1484
    • Remove mentions of FARM from Ranker comments by @julian-risch in https://github.com/deepset-ai/haystack/pull/1535
    • Remove NER and text classification from model conversion by @julian-risch in https://github.com/deepset-ai/haystack/pull/1536
    • TransformersDocumentClassifier replacing FARMClassifier by @julian-risch in https://github.com/deepset-ai/haystack/pull/1540
    • LFQA: Remove InferenceProcessor dependency by @vblagoje in https://github.com/deepset-ai/haystack/pull/1559
    • Add BatchEncoding flatten by @vblagoje in https://github.com/deepset-ai/haystack/pull/1562
    • Enable GPU usage for question generator by @tholor in https://github.com/deepset-ai/haystack/pull/1571
    • Create EntityExtractor by @ZanSara in https://github.com/deepset-ai/haystack/pull/1573
    • Add more flexible options for model downloads (Proxies, resume_download, local_files_only...) by @tholor in https://github.com/deepset-ai/haystack/pull/1256
    • Add checkpointing for reader.train() to allow stopping + resuming training by @gak97 in https://github.com/deepset-ai/haystack/pull/1554
    • DPR training: Rename TransformersAdamW to AdamW by @ZanSara in https://github.com/deepset-ai/haystack/pull/1613
    • Add TableTextRetriever by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1529
    • Truncate too large tables for TableReader by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1662
    • ensure tf-idf matrix calculation before retrieval by @julian-risch in https://github.com/deepset-ai/haystack/pull/1665
    • Add TableTextRetriever to nodes' init.py by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1678
    • Fix TableReader when model does not select any cells by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1703
    • Standardize initialisation of device settings by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1683
    • fix issue #1687 - DPR training fails on multiple GPU's by @AlonEirew in https://github.com/deepset-ai/haystack/pull/1688
    • Allow TableReader models without aggregation classifier by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1772
    • Huggingface private model support via API tokens (FARMReader) by @ArzelaAscoIi in https://github.com/deepset-ai/haystack/pull/1775
    • private hugging face models for retrievers by @ArzelaAscoIi in https://github.com/deepset-ai/haystack/pull/1785
    • Model Distillation by @MichelBartels in https://github.com/deepset-ai/haystack/pull/1758
    • Added max_seq_length and batch_size params to embeddingretriever by @AhmedIdr in https://github.com/deepset-ai/haystack/pull/1817
    • Fix bug ranker: wrong lambda function by @gabinguo in https://github.com/deepset-ai/haystack/pull/1824

    DocumentStores

    • Fix bug when loading FAISS from supplied config file path by @ZanSara in https://github.com/deepset-ai/haystack/pull/1506
    • Standardize delete_documents(filter=...) across all document stores by @ZanSara in https://github.com/deepset-ai/haystack/pull/1509
    • Update sql.py to ignore multi thread issues. by @adithyaur99 in https://github.com/deepset-ai/haystack/pull/1442
    • [fix] MySQL connection 'check_same_thread' error by @CandiceYu8 in https://github.com/deepset-ai/haystack/pull/1585
    • Delete documents by ID in all document stores by @ZanSara in https://github.com/deepset-ai/haystack/pull/1606
    • Fix Opensearch field type (flattened -> nested) by @tholor in https://github.com/deepset-ai/haystack/pull/1609
    • Add delete_labels() except for weaviate doc store by @julian-risch in https://github.com/deepset-ai/haystack/pull/1604
    • Experimental changes to support Milvus 2.x by @lalitpagaria in https://github.com/deepset-ai/haystack/pull/1473
    • Fix import in Milvus2DocumentStore by @ZanSara in https://github.com/deepset-ai/haystack/pull/1646
    • Allow setting of scroll param in ElasticsearchDocumentStore by @Timoeller in https://github.com/deepset-ai/haystack/pull/1645
    • Rename every occurrence of 'embed_passages' with 'embed_documents' by @ZanSara in https://github.com/deepset-ai/haystack/pull/1667
    • Cosine similarity for the rest of DocStores. by @fingoldo in https://github.com/deepset-ai/haystack/pull/1569
    • Make weaviate more compliant to other doc stores (UUIDs and dummy embedddings) by @julian-risch in https://github.com/deepset-ai/haystack/pull/1656
    • Make FAISSDocumentStore work with yaml by @tstadel in https://github.com/deepset-ai/haystack/pull/1727
    • Capitalize starting letter in params by @nishanthcgit in https://github.com/deepset-ai/haystack/pull/1750
    • Support Tables in all DocumentStores by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1744
    • Facilitate concurrent query / indexing in Elasticsearch with dense retrievers (new skip_missing_embeddings param) by @cvgoudar in https://github.com/deepset-ai/haystack/pull/1762
    • Introduced an arg to add synonyms - Elasticsearch by @SjSnowball in https://github.com/deepset-ai/haystack/pull/1625
    • Allow SQLDocumentStore to filter by many filters by @ZanSara in https://github.com/deepset-ai/haystack/pull/1776

    REST API

    • Add rest api endpoint to delete documents by filter by @ZanSara in https://github.com/deepset-ai/haystack/pull/1546
    • Fix circular import in the REST API by @ZanSara in https://github.com/deepset-ai/haystack/pull/1556
    • Add /documents/get_by_filters endpoint by @ZanSara in https://github.com/deepset-ai/haystack/pull/1580
    • Add a restart policy on-failure to all containers by @ZanSara in https://github.com/deepset-ai/haystack/pull/1664
    • Add execute permissions to file upload folder by @Timoeller in https://github.com/deepset-ai/haystack/pull/1666
    • disable file upload for InMemoryDocStore by @julian-risch in https://github.com/deepset-ai/haystack/pull/1677
    • Improve open api spec by @tholor in https://github.com/deepset-ai/haystack/pull/1700
    • Fix usage of filters in /query endpoint in REST API by @tholor in https://github.com/deepset-ai/haystack/pull/1774
    • ignore empty filters parameter by @julian-risch in https://github.com/deepset-ai/haystack/pull/1783
    • Fix the REST API tests by @ZanSara in https://github.com/deepset-ai/haystack/pull/1791

    UI / Demo

    • Add "API is loading" message in the UI by @ZanSara in https://github.com/deepset-ai/haystack/pull/1493
    • Fix answer format in ui by @tholor in https://github.com/deepset-ai/haystack/pull/1591
    • Change 'ESRetriever' with 'Retriever' in the Streamlit app by @ZanSara in https://github.com/deepset-ai/haystack/pull/1620
    • Public demo by @ZanSara in https://github.com/deepset-ai/haystack/pull/1747
    • Small fixes to the public demo by @ZanSara in https://github.com/deepset-ai/haystack/pull/1781
    • Add missing dependency to the Streamlit container by @ZanSara in https://github.com/deepset-ai/haystack/pull/1798
    • Improve the Random Question functionality by @ZanSara in https://github.com/deepset-ai/haystack/pull/1808
    • Add description to the demo by @ZanSara in https://github.com/deepset-ai/haystack/pull/1809
    • Fix UI demo feedback by @ZanSara in https://github.com/deepset-ai/haystack/pull/1816
    • Remove feedback from no-answers by @ZanSara in https://github.com/deepset-ai/haystack/pull/1827
    • Demo UI add env vars & other small fixes by @ZanSara in https://github.com/deepset-ai/haystack/pull/1828
    • More demo bugfixes by @ZanSara in https://github.com/deepset-ai/haystack/pull/1832
    • Add backlink below the context, if available in the doc's meta by @ZanSara in https://github.com/deepset-ai/haystack/pull/1834

    Documentation

    • changed delete_all_documents to delete_documents in Tutorial5 by @ju-gu in https://github.com/deepset-ai/haystack/pull/1477
    • Regenerate API and Tutorial md files by @brandenchan in https://github.com/deepset-ai/haystack/pull/1480
    • Define SAS model in notebook by @brandenchan in https://github.com/deepset-ai/haystack/pull/1485
    • Update Tutorial1_Basic_QA_Pipeline.ipynb by @julian-risch in https://github.com/deepset-ai/haystack/pull/1489
    • Clarify PDF conversion, languages and encodings by @MarkusSagen in https://github.com/deepset-ai/haystack/pull/1570
    • Fix Tutorials by @tholor in https://github.com/deepset-ai/haystack/pull/1594
    • Update Crawler documentation by @ju-gu in https://github.com/deepset-ai/haystack/pull/1588
    • add note on gpu runtime to tutorial 13 by @julian-risch in https://github.com/deepset-ai/haystack/pull/1614
    • Update jobs link to personio by @julian-risch in https://github.com/deepset-ai/haystack/pull/1611
    • Update jobs link in readme by @julian-risch in https://github.com/deepset-ai/haystack/pull/1629
    • Bugfix Tutorial 5 parameters, adjust default split length by @Timoeller in https://github.com/deepset-ai/haystack/pull/1635
    • Fix parameter names in tutorial 5 and 12 by @julian-risch in https://github.com/deepset-ai/haystack/pull/1639
    • Link the logo to the website by @aantti in https://github.com/deepset-ai/haystack/pull/1649
    • Replace Haystack banner for readme by @brandenchan in https://github.com/deepset-ai/haystack/pull/1654
    • Update README.md by @brandenchan in https://github.com/deepset-ai/haystack/pull/1653
    • fix typo in docstring of crawler by @ju-gu in https://github.com/deepset-ai/haystack/pull/1673
    • Add TableQA tutorial by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1670
    • Add collapsing sections to readme by @brandenchan in https://github.com/deepset-ai/haystack/pull/1663
    • Fix links in readme.md by @brandenchan in https://github.com/deepset-ai/haystack/pull/1682
    • fixed typo by @julian-risch in https://github.com/deepset-ai/haystack/pull/1680
    • Standardize similarity argument description by @brandenchan in https://github.com/deepset-ai/haystack/pull/1684
    • Fix Typo in TableQA Tutorial by @Timoeller in https://github.com/deepset-ai/haystack/pull/1690
    • Improve tutorials' output by @ZanSara in https://github.com/deepset-ai/haystack/pull/1694
    • Tutorial for DocumentClassifier at Index Time by @tstadel in https://github.com/deepset-ai/haystack/pull/1697
    • Update API Reference Pages for v1.0 by @brandenchan in https://github.com/deepset-ai/haystack/pull/1729
    • Add debugging example to tutorial by @brandenchan in https://github.com/deepset-ai/haystack/pull/1731
    • Fix a few details of some tutorials by @ZanSara in https://github.com/deepset-ai/haystack/pull/1733
    • initialize doc store with doc and label index in tutorial 5 by @julian-risch in https://github.com/deepset-ai/haystack/pull/1730
    • Fix Tutorial 11 on Google Colab by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1795
    • Fix link to colab notebook in tutorial 16 by @julian-risch in https://github.com/deepset-ai/haystack/pull/1802

    Other Changes

    • Redesign primitives by @tholor in https://github.com/deepset-ai/haystack/pull/1398
    • Adding prediction head, trainer, evaluator from FARM by @julian-risch in https://github.com/deepset-ai/haystack/pull/1419
    • Farm merging base bogdan by @Timoeller in https://github.com/deepset-ai/haystack/pull/1424
    • Add data, add tests for qa processor, add dpr tests (some failing) by @Timoeller in https://github.com/deepset-ai/haystack/pull/1425
    • Fix DPR tests + add Tokenizer tests by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1429
    • Farm merging base fix test by @Timoeller in https://github.com/deepset-ai/haystack/pull/1444
    • Automate updates docstrings tutorials by @PiffPaffM in https://github.com/deepset-ai/haystack/pull/1461
    • fixed workflow conflict with intorducting new one by @PiffPaffM in https://github.com/deepset-ai/haystack/pull/1472
    • feat: normalize embeddings for faiss cosine similarity by @mathislucka in https://github.com/deepset-ai/haystack/pull/1352
    • Remove 'restart=always' from 'haystack-api' in both docker-compose files by @ZanSara in https://github.com/deepset-ai/haystack/pull/1498
    • Add comment to tutorial notebooks about restarting runtime in colab by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1486
    • Feat: Download archive from url without temp file by @lalitpagaria in https://github.com/deepset-ai/haystack/pull/1470
    • Add newline between paragraphs in DocxToTextConverter by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1500
    • Release Docs 0.10.0 by @PiffPaffM in https://github.com/deepset-ai/haystack/pull/1460
    • Simplify tests & allow running on individual doc stores by @tholor in https://github.com/deepset-ai/haystack/pull/1487
    • Replace FARM import statements; add dependencies by @julian-risch in https://github.com/deepset-ai/haystack/pull/1492
    • Fix document_store_type flag for tests with multiple fixtures by @tholor in https://github.com/deepset-ai/haystack/pull/1526
    • Remove double mentions from requirements by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1545
    • Format doc classifier usage example by @julian-risch in https://github.com/deepset-ai/haystack/pull/1550
    • Adding TfidfRetriever to init.py of the retriever package by @mhamdan91 in https://github.com/deepset-ai/haystack/pull/1575
    • Limit generator tests to memory doc store; split pipeline tests by @julian-risch in https://github.com/deepset-ai/haystack/pull/1602
    • Add Table Reader by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1446
    • Switch from dataclass to pydantic dataclass & Fix Swagger API Docs by @tholor in https://github.com/deepset-ai/haystack/pull/1598
    • Use smaller model for one generator test case by @julian-risch in https://github.com/deepset-ai/haystack/pull/1622
    • Make EntityExtractor work when loaded from YAML by @ZanSara in https://github.com/deepset-ai/haystack/pull/1636
    • Improve docker images: Add nltk download, add folder for file upload by @Timoeller in https://github.com/deepset-ai/haystack/pull/1633
    • Refactoring of the haystack package by @ZanSara in https://github.com/deepset-ai/haystack/pull/1624
    • Remove trailing comma in import statement by @julian-risch in https://github.com/deepset-ai/haystack/pull/1655
    • Raise a warning if the 'query' param of the 'query' method of 'ElasticsearchDocumentStore' is not a string by @ZanSara in https://github.com/deepset-ai/haystack/pull/1674
    • Add CI for windows runner by @lalitpagaria in https://github.com/deepset-ai/haystack/pull/1458
    • rename text variable of document to content by @julian-risch in https://github.com/deepset-ai/haystack/pull/1704
    • Simplify logs management by @ZanSara in https://github.com/deepset-ai/haystack/pull/1696
    • Change answer aggregation key to (doc_id, query) instead of (label_id, query) by @julian-risch in https://github.com/deepset-ai/haystack/pull/1726
    • Fix another self.device/s typo by @ZanSara in https://github.com/deepset-ai/haystack/pull/1734
    • Fix print_answers by @ZanSara in https://github.com/deepset-ai/haystack/pull/1743
    • Split pipeline tests into three suites by @ZanSara in https://github.com/deepset-ai/haystack/pull/1755
    • Split summarizer tests in order to make windows CI work again by @tstadel in https://github.com/deepset-ai/haystack/pull/1757
    • Update test_pipeline_extractive_qa.py by @julian-risch in https://github.com/deepset-ai/haystack/pull/1763
    • Exclude test_summarizer_translation.py for windows_ci by @tstadel in https://github.com/deepset-ai/haystack/pull/1759
    • Upgrade torch to v1.10.0 by @bogdankostic in https://github.com/deepset-ai/haystack/pull/1789
    • Adapt docker-compose-gpu.yml to use DPR by default by @ZanSara in https://github.com/deepset-ai/haystack/pull/1810
    • bugfix metadata extraction in form recognizer & split of surrounding content length by @ju-gu in https://github.com/deepset-ai/haystack/pull/1829
    • Fix OOM in test_eval.py Windows CI by @tstadel in https://github.com/deepset-ai/haystack/pull/1830
    • Update evaluation tutorial to cover the new pipeline.eval() by @julian-risch in https://github.com/deepset-ai/haystack/pull/1765
    • Add config for github release notes by @tholor in https://github.com/deepset-ai/haystack/pull/1840
    • Extend categories for release notes by @tholor in https://github.com/deepset-ai/haystack/pull/1841

    New Contributors

    • @mathislucka made their first contribution in https://github.com/deepset-ai/haystack/pull/1352
    • @ZanSara made their first contribution in https://github.com/deepset-ai/haystack/pull/1459
    • @ju-gu made their first contribution in https://github.com/deepset-ai/haystack/pull/1477
    • @adithyaur99 made their first contribution in https://github.com/deepset-ai/haystack/pull/1442
    • @mhamdan91 made their first contribution in https://github.com/deepset-ai/haystack/pull/1575
    • @CandiceYu8 made their first contribution in https://github.com/deepset-ai/haystack/pull/1585
    • @gak97 made their first contribution in https://github.com/deepset-ai/haystack/pull/1554
    • @fingoldo made their first contribution in https://github.com/deepset-ai/haystack/pull/1569
    • @AlonEirew made their first contribution in https://github.com/deepset-ai/haystack/pull/1688
    • @tstadel made their first contribution in https://github.com/deepset-ai/haystack/pull/1697
    • @nishanthcgit made their first contribution in https://github.com/deepset-ai/haystack/pull/1750
    • @ArzelaAscoIi made their first contribution in https://github.com/deepset-ai/haystack/pull/1775
    • @SjSnowball made their first contribution in https://github.com/deepset-ai/haystack/pull/1625
    • @AhmedIdr made their first contribution in https://github.com/deepset-ai/haystack/pull/1817
    • @gabinguo made their first contribution in https://github.com/deepset-ai/haystack/pull/1824

    :heart: Thanks to all contributors and the whole community!

    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Sep 16, 2021)

    :star: Highlights

    :rocket: Making Pipelines more scalable

    You can now easily scale and distribute Haystack Pipelines thanks to the new integration of the Ray framework (https://ray.io/). Ray allows distributing a Pipeline's components across a cluster of machines. The individual components of a Pipeline can be independently scaled. For instance, an extractive QA Pipeline deployment can have three replicas of the Reader and a single replica for the Retriever. It enables efficient resource utilization by horizontally scaling Components. You can use Ray via the new RayPipeline class (#1255)

    To set the number of replicas, add replicas in the YAML config for the node in a pipeline:

    components:
        ...
    
    pipelines:
      - name: ray_query_pipeline
        type: RayPipeline
        nodes:
          - name: ESRetriever
            replicas: 2  # number of replicas to create on the Ray cluster
            inputs: [ Query ]
    

    A RayPipeline currently can only be created with a YAML Pipeline config:

    from haystack.pipeline import RayPipeline
    pipeline = RayPipeline.load_from_yaml(path="my_pipelines.yaml", pipeline_name="my_query_pipeline")
    pipeline.run(query="What is the capital of Germany?")
    

    See docs for more details

    :heart_eyes: Making Pipelines more user-friendly

    The old Pipeline design came with a couple of flaws:

    • Impossible to route certain parameters (e.g. top_k) to dedicated nodes
    • Incorrect parameters in pipeline.run() are silently swallowed
    • Hard to understand what is in **kwargs when working with node.run() methods
    • Hard to debug

    We tackled those with a big refactoring of the Pipeline class and changed how data is passed between nodes #1321. This comes now with a few breaking changes:

    Component params like top_k, no_ans_boost for Pipeline.run() must be passed in a params dict

    pipeline.run(query="Why?", params={"top_k":10, "no_ans_boost":0.5})
    
    

    Component specific top_ks like top_k_reader, top_k_retriever are now replaced with top_k. To disambiguate, the params can be "targeted" to a specific node.

    pipeline.run(query="Why?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5})
    

    See breaking changes section and the docs for details

    :chart_with_upwards_trend: Better evaluation metric for QA: Semantic Answer Similarity (SAS)

    The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semantically similar, thus treating correct answers as false. This underestimation of the true performance of models hinders user acceptance in applications and complicates a fair comparison of different models. Therefore, there is a need for an evaluation metric that is based on semantics instead of pure string similarity. In our recent EMNLP paper, we proposed "SAS", a cross-encoder-based metric for the estimation of semantic answer similarity. We compared it to seven existing metrics and found that it correlates better with human judgement. See our paper #1338

    You can use it in Haystack like this:

    ...
    # initialize the node with a SAS model
    eval_reader = EvalAnswers(sas_model="sentence-transformers/paraphrase-multilingual-mpnet-base-v2")
    
    # define a pipeline 
    p = Pipeline()
    p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
    p.add_node(component=eval_retriever, name="EvalDocuments", inputs=["ESRetriever"])
    p.add_node(component=reader, name="QAReader", inputs=["EvalDocuments"])
    p.add_node(component=eval_reader, name="EvalAnswers", inputs=["QAReader"])
    ...
    

    See our updated Tutorial 5 for a full example.

    :exploding_head: New nodes: Doc Classifier, Re-Ranker, QuestionGenerator & more

    More nodes, more use cases:

    • FARMClassifier node for Document Classification: tag a document at indexing time or add a class downstream in your inference pipeline #1265
    • SentenceTransformersRanker: Re-Rank your documents after retrieval to maximize the relevance of your results. This implementation uses the popular sentence-transformer models #1209
    • QuestionGenerator: Question Answering systems are trained to find an answer given a question and a document; but with the recent advances in generative NLP, there are now models that can read a document and suggest questions that can be answered by that document. All this power is available to you now via the QuestionGenerator class. QuestionGenerator models can be trained using Question Answering datasets. Instead of predicting answers, the QuestionGenerator takes the document as input and is trained to output the questions. This can be useful when you want to add "autosuggest" questions in your search bar or accelerate labeling processes See docs (#1267)

    :telescope: Better support for OpenSearch

    We now support Approximate nearest neighbour (ANN) search in OpenSearch (#1225) and fixed some initialization issues.

    :bookmark_tabs: New Tutorials

    ⚠️ Breaking Changes

    probability field removed from results #1340

    Having two fields probability and score in answers / documents returned from nodes caused often confusion. From now on we'll only have one field called score that is in range [0,1]. In QA results, this field is populated with the old probability value, so you can simply switch to this one. These fields have changed in Python and REST API.

    Old:

    {
      "query": "Who is the father of Arya Stark?",
      "answers": [
        {
          "answer": "Lord Eddard Stark",
          "score": 14.684528350830078,
          "probability": 0.9044522047042847,
          "context": ...,
          ...
        },
       ...
       ]
    }
    
    

    New:

    {
      "query": "Who is the father of Arya Stark?",
      "answers": [
        {
          "answer": "Lord Eddard Stark",
          "score": 0.9044522047042847,
          "context": ...,
          ...
        },
       ...
       ]
    }
    
    

    RemovedFinder #1326

    After being deprecated a few months ago, Finder is now gone - R.I.P

    Params in Pipeline.run() #1321

    Component params like top_k, no_ans_boost for Pipeline.run() must be passed in a params dict

    Old:

    pipeline.run(query="Why?", top_k_retriever=10, no_ans_boost=0.5)
    

    New:

    pipeline.run(query="Why?", params={"top_k":10, "no_ans_boost":0.5})
    
    

    Component specific top_ks like top_k_reader, top_k_retriever are now replaced with top_k. To disambiguate, the params can be "targeted" to a specific node. Old:

    pipeline.run(query="Why?", top_k_retriever=10, top_k_reader=5)
    

    New:

    pipeline.run(query="Why?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5})
    

    Also, custom nodes must not have **kwargs in their run methods anymore and should only return the data (e.g. answers) they produce themselves.

    :nerd_face: Detailed Changes

    Crawler

    • Serialize crawler output to JSON #1284
    • Add Crawler support for indexing pipeline #1360

    Converter

    • Add ImageToTextConverter and PDFToTextOCRConverter that utilize OCR #1349

    Preprocessor

    • Add PreProcessor optional language parameter. #1160
    • Improve preprocessing logging #1263
    • Make PreProcessor.process() work on lists of documents #1163

    Pipeline

    • Add Ray integration for Pipelines #1255
    • MostSimilarDocumentsPipeline introduced #1413
    • QoL function: access certain nodes in pipeline #1441
    • Refactor replicas config for Ray Pipelines #1378
    • Add simple docs2answer node to allow FAQ style QA / Doc search in API #1361
    • Allow for batch indexing when using Pipelines fix #1168 #1231

    Document Stores

    • Implement OpenSearch ANN #1225
    • Bump Weaviate version to 1.7.0 #1412
    • Catch Elastic's search_phase_execution and raise with descriptive message. #1371
    • Fix behavior of delete_documents() with filters for Milvus #1354
    • delete_all_documents() replaced by delete_documents() #1377
    • Support OpenDistro init #1334
    • Integrate filters with knn queries in OpenDistroElasticsearchDocumentStore #1301
    • feat: add support for elastic search to connect without any authentication #1294
    • Raise warning when labels are overwritten #1257
    • Fix SQLAlchemy relationship warnings #1289
    • Added explicit refresh call during refresh_type is false in update em… #1259
    • Add id in write_labels() for SQLDocumentStore #1253
    • ElasticsearchDocumentStore get_label_count() bug fixed. #1252
    • SQLDocumentStore get_label_count() index bug fixed. #1251

    Retriever

    • Adding multi gpu support for DPR inference #1414
    • Ensure num_hard_negatives is 0 when embedding passages #1402
    • global_loss_buffer_size to the DensePassageRetriever, fix exceeds max_size #1245

    Summarizer

    • Transformer summarizer truncation bug fixed #1309

    Document Classifier

    • Add FARMClassifier node for Document Classification #1265

    Re-Ranker

    • Add SentenceTransformersRanker with pre-trained Cross-Encoder #1209

    Reader

    • Use Reader's device by default #1208

    Generator

    • Add QuestionGenerator #1267

    Evaluation

    • Add new QA eval metric: Semantic Answer Similarity (SAS) #1338

    REST API

    • Fix handling of filters in Search REST API #1431
    • Add support for Dense Retrievers in REST API Indexing Pipeline #1430
    • Add Header in sample REST API Search Request #1293
    • Fix convert integer CONCURRENT_REQUEST_PER_WORKER #1247
    • Env var CONCURRENT_REQUEST_PER_WORKER #1235
    • Small UI and REST API fixes #1223
    • Add scaffold for defining custom components for Pipelines #1205

    Docker

    • Update DocumentStore env in docker-compose #1450
    • Enable docker-compose for GPUs & Add public UI image #1406
    • Fix tesseract installation in Dockerfile #1405

    User Interface

    • Allow multiple files to upload for Haystack UI #1323
    • Add faq annotation #1333
    • Upgrade streamlit #1279

    Documentation and Tutorials

    • new docs version for 0.9.0 #1217
    • Added functionality for Google Colab usecase in Crawler Module #1436
    • Update sentence transformer model in FAQ tutorial #1401
    • crawler api docs updated. #1388
    • Add support for no Docker envs in Tutorial 13 #1365
    • Rag tutorial fixes #1375
    • Editing docs read.me for new docs website workflow #1372
    • Add query classifier usage docs #1348
    • Adding tutorial 13 and 14 #1364
    • Remove Finder from tutorials #1329
    • Tutorial1 remove finder class from import #1328
    • Update docstring for RAG #1149
    • Update README.md for tutorial 13 Question Generation #1325
    • add query classifier colab and jupyter notebook #1324
    • Remove pipeline eval example script #1297
    • Change variable names in tutorials #1286
    • Add links to tutorial 12 to readme #1274
    • Encapsulate tutorial code in method #1266
    • Fix Links #1199

    Misc

    • Improve document stores unit test parametrization #1202
    • Version tag added to Haystack #1216
    • Add type ignore to resolve mypy errors #1427
    • Bump pillow from 8.2.0 to 8.3.2 #1423
    • Add sentence-transformers as mandatory dependency and remove from dev… #1387
    • Adjust WeaviateDocumentStore import #1379
    • Update test documentation in readme #1355
    • Add tests for Crawler #1339
    • Suppress FAISS logs & apex warnings #1315
    • Pin Weaviate version #1306
    • Relax typing for meta data #1224

    🙏 Big thanks to all contributors! :heart:

    A big thank you to all the contributors for this release: @prikmm @akkefa @MichelBartels @hammer @ramgarg102 @bishalgaire @MarkusSagen @dfhssilva @srevinsaju @demarant @mosheber @MichaelBitard @guillim @vblagoje @stefanondisponibile @cambiumproject @bobvanluijt @tanay1337 @Timoeller @annagruendler @PiffPaffM @oryx1729 @bogdankostic @brandenchan @shahrukhx01 @julian-risch @tholor

    We would like to thank everyone who participated in the insightful discussions on GitHub and our community Slack!

    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Jun 21, 2021)

    :star: Highlights

    Long-Form Question Answering (LFQA)

    Haystack now provides LFQA with a Seq2SeqGenerator for generative QA and a Retribert Retriever thanks to community member @vblagoje. #1086 If you would like to ask questions where the answer is not a short phrase explicitly given in one of the documents but a more elaborate answer than LFQA is interesting for you. These elaborate answers are generated by combining information from multiple relevant documents.

    Document Re-Ranking

    For pure "semantic document search" use cases that do not need question answering functionality but only document ranking, there is now a new type of node: Ranker. While the Retriever is a perfect fit for document retrieval, we can further improve its results with the Ranker. #1025 To this end, the Ranker uses a pre-trained model to calculate the semantic similarity of the question and each of the top-k retrieved documents. Documents with a high semantic similarity are ranked higher. The combination of a Retriever and Ranker is especially powerful if you combine a sparse retriever, e.g., ElasticsearchRetriever based on BM25 and a dense Ranker. A pipeline with a Ranker and Retriever can be setup in just a few lines of code:

    ...
    retriever = ElasticsearchRetriever(document_store=document_store)
    ranker = FARMRanker(model_name_or_path="deepset/gbert-base-germandpr-reranking")
    
    p = Pipeline()
    p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
    p.add_node(component=ranker, name="Ranker", inputs=["ESRetriever"])
    ...
    

    Weaviate

    Thanks to a contribution by our community member @venuraja79 Weaviate is integrated into Haystack as another DocumentStore #1064 It allows a combination of vector search and scalar filtering, i.e., you can filter for a certain tag and do dense retrieval on that subset. After starting a Weaviate server with docker, it's as simple as:

    from haystack.document_store import WeaviateDocumentStore
    document_store = WeaviateDocumentStore()
    

    Haystack uses the most recent Weaviate version 1.4.0 and the updating of embeddings has also been optimized #1181

    Query Classifier

    Some search applications need to distinguish between keyword queries and longer textual questions that come in. If you only want to route longer questions to the Reader branch in order to maximize the accuracy of results and minimize computation efforts/costs and route keyword queries to a Document Retriever, you can do that now with a QueryClassifier node thanks to a contribution by @shahrukhx01. #1099 You could use it as shown in this exemplary pipeline: image

    New Tutorials

    1. Tutorial 11: Pipelines #991
    2. Tutorial 12: Generative QA with LFQA #1086

    ⚠️ Breaking Changes

    • Remove Python 3.6 support #1059
    • Refactor REST APIs to use Pipelines #922
    • Bump to FARM 0.8.0, torch 1.8.1 and transformers 4.6.1 #1192

    :nerd_face: Detailed Changes

    Connector

    • Add crawler to get texts from websites #775

    Preprocessor

    • Add white space normalization warning #1022
    • Preserve whitespace during PreProcessor.split() #1121
    • Fix equality check in preprocessor #969

    Pipeline

    • Add validation for root node in Pipeline #987
    • Fix passing a list as parameter value in Pipeline YAML #952
    • Add export of Pipeline YAML config #1003
    • Add config to JoinDocuments node to allow yaml export in pipelines #1134

    Document Stores

    • Integrate Weaviate as another DocumentStore #957 #1064
    • Add OpenDistro init #1101
    • Rename all document stores delete_all_documents() method to delete_documents #1047
    • Fix Elasticsearch connection for non-admin users #1028
    • Fix update_embeddings() for FAISSDocumentStore #978
    • Feature: Enable AWS Elasticsearch IAM connection #965
    • Fix optional FAISS import #971
    • Make FAISS import conditional #970
    • Benchmark milvus #850
    • Improve Milvus HNSW Performance #1127
    • Update Milvus benchmarks #1128
    • Upgrade milvus to 1.1.0 #1066
    • Update tests for FAISSDocumentStore #999
    • Add L2 support for FAISS HNSW #1138
    • Improve the speed of FAISSDocumentStore.delete_documents() #1095
    • Add options for handling duplicate documents (skip, fail, overwrite) #1088
    • Update Embeddings - Use update instead of replace #1181
    • Improve the progress bar in update_embeddings() + Fix filters in update_embeddings() #1063
    • Using text hash as id to prevent document duplication #1000

    Retriever

    • DPR Training parameter #989
    • Removed single_model_path; added infer_tokenizer to dpr load() #1060
    • Integrate sentence transformers into benchmarks #843
    • added use_amp to the train method, in order to use mixed precision training #1048

    Ranker

    • Re-ranking component for document search without QA #1025
    • Remove quickfix from reader and ranker #1196
    • Distinguish labels for calculating similarity scores #1124

    Query Classifier

    • Fix typo in Query Classifier Exception Message #1190
    • Add QueryClassifier incl. baseline models #1099

    Reader

    • Filtering duplicate answers #1021
    • Add ONNXRuntime support #157
    • Remove unused function _get_pseudo_prob #1201

    Generator

    • Integrate LFQA with Haystack - inferencing #1086

    Evaluation Nodes

    • Reduce precision in pipeline eval print functions #943
    • Fix division by zero error in EvalRetriever #938
    • Add evaluation nodes for Pipelines #904
    • Add More top_k handling to EvalDocuments #1133
    • Prevent merge of same questions on different documents during evaluation #1119

    REST API

    • adding root_path option #982
    • Add PDF converter dependencies Docker #1107
    • Disable Gunicorn preload option #960

    User Interface

    • change file-upload response to sidebar #1018
    • Add File Upload Functionality in UI #995
    • Streamlit UI Evaluation mode #920
    • Fix evaluation mode in UI #1024
    • Fix typo in streamlit UI #1106

    Documentation and Tutorials

    • Add about sections to Tutorial 12 #1195
    • Tutorial update #1166
    • Documentation update #1162
    • Add FAQ page #1151
    • Refresh API docs #1152
    • Add docu of confidence scores and calibration method #1131
    • Adding indentation to markup files #947
    • Update preprocessing.md #1087
    • Add badges to readme #1136
    • Regen api docs #1015
    • Docs: Add usage information detailes for aws elastic search service #1008
    • Add tutorial pages #1013
    • Pipelines tutorial #991
    • knowledge graph documentation #979
    • knowledge graph example #934
    • Add Milvus to the retriever / document store table #931
    • New docs version #964
    • Update Documentation #976
    • update api markdown files and add markdown file for ranker #1198
    • Reformat FAQ page #1177
    • Minor change with a link to the Weaviate docs #1180
    • Add links to GitHub Discussion and SO #984
    • Update milvus links and docstrings #959
    • Fixed link to dpr #962
    • Removed comma from last item in json list #1114
    • Fixing inconsistency #926

    Misc

    • Squad tools #1029
    • Bugfix setting of device by defaulting to "cpu" #1182
    • Fixing issues caused due to mypy upgrade #1165
    • Remove Duplicate Benchmark Run #1132
    • Fixing grpcio-tools to version of colab's pre-installed grpcio #1113
    • Update farm version #936

    🙏 Big thanks to all contributors! :heart:

    A big thank you to all the contributors for this release: @PiffPaffM @oryx1729 @jacksbox @guillim @Timoeller @aantti @tholor @brandenchan @julian-risch @bhadreshpsavani @akkefa @mosheber @lalitpagaria @Avi777 @MichaelBitard @AlviseSembenico @shahrukhx01 @venuraja79 @bobvanluijt @vblagoje @cvgoudar

    We would like to thank everyone who participated in the insightful discussions on GitHub and our community Slack!

    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Apr 13, 2021)

    :star: Highlights

    This is a major Haystack release with many new features. The release blog post has a detailed summary. Below are the top highlights:

    Milvus Document Store

    Milvus is an open-source vector database. With the MilvusDocumentStore contributed by @lalitpagaria, embedding based Retrievers like the DensePassageRetriever or EmbeddingRetriever can use production-ready Milvus servers for large-scale deployments.

    Knowledge Graph

    An experimental integration for KnowledgeGraphs is introduced using GraphDB. The GraphDBKnowlegeGraph stores Triples and executes SPARQL queries. It can be integrated with Text2SparqlRetriever to convert natural language queries to SPARQL.

    Pipeline configuration with YAML

    The Pipelines can now be configured with YAML. This enables easier sharing of query & indexing configuration, reproducible setups, A/B testing of Pipelines, and moving from development to the production environment.

    REST APIs

    The REST APIs are revamped to use Pipelines for Query & Indexing files. The YAML configurations are in the rest_api/pipelines.YAML. The new API endpoints are more generic to accommodate custom Pipeline configurations.

    Confidence Scores

    The answers now have a probability score that is better calibrated to the model's confidence. It has a range of 0-1; 0 signifying very low confidence, while, 1 for very high confidence.

    Web Crawler

    A Selenium based web crawler is now part of Haystack, thanks to @divya-19 for the contribution. It takes as input a list of URLs and converts extracted text to Haystack Documents.

    ⚠️ Breaking Changes

    REST APIs

    The REST APIs got a major revamp with this release.

    • /doc-qa & /faq-qa endpoints are replaced with a more generic POST /query endpoint. This new endpoint uses Pipelines under-the-hood, that can be configured at rest_api/pipeline.yaml.

    • The new /query endpoint expects a single query per request instead of a list of query strings. The new request format is:

      {
          "query": "Why did the revenue change?"
      }
      

      and the response looks like this:

      {
          "query": "Why did the revenue change?",
          "answers": [
              {
                  "answer": "rapid technological change and evolving industry standards",
                  "question": null,
                  "score": 0.543937623500824,
                  "probability": 0.014070278964936733,
                  "context": "tion process. The market for our products is intensely competitive and is characterized by rapid technological change and     evolving industry standards.",
                  "offset_start": 91,
                  "offset_end": 149,
                  "offset_start_in_doc": 511,
                  "offset_end_in_doc": 569,
                  "document_id": "f30273b2-4d49-40d8-8824-43b3b6a0ea57",
                  "meta": {
                      "_split_id": "7"
                  }
              },
              {
                   // other answers
              }
          ]
      }
      
    • The /doc-qa-feedback & /faq-qa-feedback endpoints are replaced with a new generic /feedback endpoint.

    Created At Timestamp

    Previously, all documents/labels in SQLDocumentStore and FAISSDocumentStore had a field called created to store the creation timestamp, while ElasticsearchDocumentStore did not have any timestamp field. Now, all document stores have a created_at field for documents and labels.

    RAGenerator

    The top_k_answers parameter in the RAGenerator is renamed to top_k for consistency across Haystack components.

    Custom Query for Elasticsearch

    The placeholder terms in custom_query should not have quotes around them. See more details here.

    :nerd_face: Detailed Changes

    Pipeline

    • Fix execution of Pipelines with parallel nodes #901 (@oryx1729)
    • Add abstract run method to basecomponent #887 (@tholor)
    • Add support for parallel paths in Pipeline #884 (@oryx1729)
    • Add runtime parameters to component initialization #873 (@oryx1729 )
    • Add support for indexing pipelines #816 (@oryx1729 )
    • Adding translator with many generic input parameter support #782 (@lalitpagaria)
    • Fix building Pipeline with YAML #800 (@oryx1729)
    • Load Pipeline with YAML config file #785 (@oryx1729)
    • Add evaluation nodes for Pipelines #904 (@brandenchan)
    • Fix passing a list as parameter value in Pipeline YAML #952 (@oryx1729)

    Document Store

    • Fixes elasticsearch auth #871 (@grafke)
    • Allow more options for elasticsearch client (auth, multiple hosts) #845 (@tholor)
    • Fix ElasticsearchDocumentStore.query_by_embedding() #823 (@oryx1729)
    • Introduce incremental updates for embeddings in document stores #812 (@oryx1729)
    • Add method to get metadata values for a key from Elasticsearch #776 (@oryx1729)
    • Fix refresh behaviour for Elasticsearch delete #794 (@oryx1729)
    • Milvus integration #771 (@lalitpagaria)
    • Add flag for use of window queries in SQLDocumentStore #768 (@oryx1729)
    • Remove quotes around placeholders in Elasticsearch custom query #762 (@oryx1729)
    • Fix delete_all_documents for the SQLDocumentStore #761 (@oryx1729)

    Retriever

    • Improve dpr conversion #826 (@Timoeller)
    • Fix DPR training batch size #898 (@brandenchan)
    • Upgrade FAISS to 1.7.0 #834 (@tholor)
    • Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg #811(@psorianom)

    Modeling

    • Add model versioning support #784 (@brandenchan)
    • Improve preprocessing and adding of eval data #780 (@Timoeller)
    • SQuAD to DPR dataset converter #765 (@psorianom)
    • Remove RAG todos after transformers update #781 (@Timoeller)
    • Update farm version #936 (@Timoeller)

    REST API

    • Refactor REST APIs to use Pipelines #922 (@oryx1729)
    • Add PDF converter in Dockerfiles #877 (@oryx1729)
    • Update GPU Dockerimage (Cuda 11, Fix faiss) #836 (@tholor)
    • Add API endpoint to export accuracy metrics from user feedback + created_at timestamp #803(@tholor)
    • Fix file upload API #808 (@oryx1729)

    File Converter

    • Add Markdown file convertor #875 (@lalitpagaria)
    • Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions #813 (@tholor)

    Crawler

    • Add crawler to get texts from websites #775 (@DIVYA-19)

    Knowledge Graph

    • knowledge graph example #934 (@julian-risch)

    Annotation Tool

    • Annotation Tool: data is not persisted when using local version #853 #855(@venuraja79)

    Search UI

    • Fix UI when API returns fewer answers than expected #828(@tholor)

    CI

    • Revamp CI #825 (@oryx1729)
    • Fix mypy typing #792 (@oryx1729)
    • Fix pdftotext dependency in CI #788 (@tholor)

    Misc Fixes

    • Adding indentation to markup files #947 (@julian-risch)
    • Reduce precision in pipeline eval print functions #943 (@lewtun)
    • Fix division by zero error in EvalRetriever #938 (@lewtun)
    • Logged warning in Faiss and Milvus for filters #913 (@peteradorjan)
    • fixed "cannot allocate memory" exception by specifying max_processes #910(@mosheber)
    • Fix error when is_impossible not exist #870 (@voidful)
    • Fix validation for split_respect_sentence_boundary in Preprocessor #869 (@oryx1729)
    • Fix boolean progress_bar for disabling tqdm progressbar #863 (@tholor)
    • Remove conditional import of FAISS for Windows #819 (@oryx1729)
    • Make tqdm progress bars optional (less verbose prod logs) #796 (@tholor)
    • Fix error when is_impossible not is_impossible and json dump encoding error [#868](https://github.com/deepset-ai/haystack/pull/868 (@voidful)
    • fix download ntlk preprocessor #852 (@mrtunguyen)

    Documentation

    • Add Milvus to the retriever / document store table #931 (@lewtun)
    • Fixing inconsistency #926 (@guillim)
    • Better default value for mp chunksize #923 (@Timoeller)
    • Run Grammarly over README.md #890 (@peterdemin)
    • Remove tf-idf youtube link #888 (@ms10596)
    • Add Milvus Documentation #838 (@brandenchan)
    • Fix link to Quick Demo in ToC. #831 (@aantti)
    • Revamp Readme #820 (@brandenchan)
    • Update tutorials (torch versions, ES version, replace Finder with Pipeline) #814 (@tholor)
    • Choose correct similarity fns during benchmark runs & re-run benchmarks #773 (@brandenchan)
    • Docs v0.7.0 #757 (@PiffPaffM)
    • Fix top_k param in RAG tutorials #906 (@Timoeller)
    • Integrate sentence transformers into benchmarks #843 (@Timoeller)

    🙏 Thanks to our contributors

    A big thank you to all the contributors for this release: @aantti, @brandenchan, @DIVYA-19, @grafke, @guillim, @julian-risch, @lalitpagaria, @lewtun, @mosheber, @mrtunguyen, @ms10596, @oryx1729, @peteradorjan, @PiffPaffM, @psorianom, @tholor, @Timoeller, @venuraja79, and @voidful.

    We would like to thank everyone who participated in the insightful discussions on GitHub and our community Slack!

    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Jan 21, 2021)

    :star: Highlights

    New Slack Channel

    As many people in the community asked us for it, we decided to open a slack channel! Join us and ask questions, show what you've built with Haystack, and simply exchange with like-minded folks!

    :point_right: https://haystack.deepset.ai/community/join

    Optimizing Memory + CPU consumption of documentstores for large datasets (#733)

    Interacting with large datasets can be challenging for the local memory. Therefore, we ...

    1. ... add batch_size parameters for most methods of the document store that allow to only load smaller chunks of documents at a time
    2. ... add a get_all_documents_generator() method that "streams" documents one by one from your document store. Both help to lower the memory footprint significantly- especially when calling methods like update_embeddings() on datasets > 1 Mio docs.

    Add Simple Demo UI (#671)

    Thanks to our community member @tanmaylaud, we now have a great and simple UI that allows you to easily try your search pipelines. Ask questions, see the results, change basic config params, debug the API response and give your colleagues a better flavor of what you are building ...

    Image

    Support for summarization models (#698)

    Thanks to another community contribution from @lalitpagaria we now also support summarization models like PEGASUS in Haystack. You can use them ...

    ... standalone:

    docs = [Document(text="PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions.
                        The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by
                        the shutoffs which were expected to last through at least midday tomorrow.")]
    
    summarizer = TransformersSummarizer(model_name_or_path="google/pegasus-xsum")
    summary = summarizer.predict(documents=docs, generate_single_summary=False)
    

    ... as a node in your pipeline:

    ...
    pipeline.add_node(component=summarizer, name="Summarizer", inputs=["Retriever"])
    
    

    ... by simply calling a predefined pipeline that first retrieves and then summarizes the resulting docs:

    ...
    pipe = SearchSummarizationPipeline(summarizer=summarizer, retriever=retriever)
    pipe.run()
    

    We see many interesting use cases around search for it. For example, running semantic document search and displaying the summary of docs as a "preview" in the results.

    New Tutorials

    1. Wonder how to train a DPR retriever on your own domain dataset? Check out this new tutorial!
    2. Proper preprocessing (Cleaning, Splitting etc.) of docs can have a big impact on your performance. Check out this new tutorial to learn more about it.

    :warning: Breaking Changes

    Dropping index_buffer_size from FAISSDocumentStore

    We removed the arg index_buffer_size from the init of FAISSDocumentStore. "Buffering" is now handled via the new batch_size arguments that you can pass to most methods like write_documents(), update_embeddings() and get_all_documents().

    Renaming of Preprocessor arg

    Old:

    PreProcessor(..., split_stride=5)
    

    New:

    PreProcessor(..., split_overlap=5)
    

    :nerd_face: Detailed Changes

    Preprocessing / File Conversion

    • Using PreProcessor functions on eval data #751

    DocumentStore

    • Support filters for DensePassageRetriever + InMemoryDocumentStore #754
    • use Path class in add_eval_data of haystack.document_store.base.py #745
    • Make batchwise adding of evaluation data possible #717
    • Change signature and docstring for ca_certs parameter #730
    • Rename label id field for elastic & add UPDATE_EXISTING_DOCUMENTS to API config #728
    • Fix SQLite errors in tests #723
    • Add support for custom embedding field for InMemoryDocumentStore #640
    • Using Columns names instead of ORM to get all documents #620

    Other

    • Generate docstrings and deploy to branches to Staging (Website) #731
    • Script for releasing docs #736
    • Increase FARM to Version 0.6.2 #755
    • Reduce memory consumption of fetch_archive_from_http #737
    • Add links to more resources #746
    • Fix Tutorial 9 #734
    • Adding a guard that prevents the tutorial from being executed in every subprocess on windows #729
    • Add ID to Label schema #727
    • Automate docstring and tutorial generation with every push to master #718
    • Pass custom label index name to REST API #724
    • Correcting pypi download badge #722
    • Fix GPU docker build #703
    • Remove sourcerer.io widget #702
    • Haystack logo is not visible on github mobile app #697
    • Update pipeline documentation and readme #693
    • Enable GPU args in tutorials #692
    • Add docs v0.6.0 #689

    Big thanks to all contributors :heart: !

    @Rob192 @antoniolanza1996 @tanmaylaud @lalitpagaria @Timoeller @tanaysoni @bogdankostic @aantti @brandenchan @PiffPaffM @julian-risch

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Dec 17, 2020)

    :star: Highlights

    Flexible Pipelines powered by DAGs (#596)

    In order to build modern search pipelines, you need two things: powerful building blocks and a flexible way to stick them together. While we always had great building blocks in Haystack, we didn't have a good way to stick them together so far. That's why we put a lof thought into it in the last weeks and came up with a new Pipeline class that enables many new search scenarios beyond QA. The core idea: you can build a Directed Acyclic Graph (DAG) where each node is one "building block" (Reader, Retriever, Generator ...). Here's a simple example for a "standard" Open-Domain QA Pipeline:

    p = Pipeline()
    p.add_node(component=retriever, name="ESRetriever1", inputs=["Query"])
    p.add_node(component=reader, name="QAReader", inputs=["ESRetriever1"])
    res = p.run(query="What did Einstein work on?", top_k_retriever=1)
    
    

    You can draw the DAG to better inspect what you are building:

    p.draw(path="custom_pipe.png")
    

    image

    Multiple retrievers

    You can now also use multiple Retrievers and join their results:

    p = Pipeline()
    p.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
    p.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["Query"])
    p.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"])
    p.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
    res = p.run(query="What did Einstein work on?", top_k_retriever=1)
    

    image

    Custom nodes

    You can easily build your own custom nodes. Just respect the following requirements:

    1. Add a method run(self, **kwargs) to your class. **kwargs will contain the output from the previous node in your graph.
    2. Do whatever you want within run() (e.g. reformatting the query)
    3. Return a tuple that contains your output data (for the next node) and the name of the outgoing edge output_dict, "output_1
    4. Add a class attribute outgoing_edges = 1 that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).

    Decision nodes

    Or you can add decision nodes where only one "branch" is executed afterwards. This allows, for example, to classify an incoming query and depending on the result routing it to different modules: image

        class QueryClassifier():
            outgoing_edges = 2
    
            def run(self, **kwargs):
                if "?" in kwargs["query"]:
                    return (kwargs, "output_1")
    
                else:
                    return (kwargs, "output_2")
    
        pipe = Pipeline()
        pipe.add_node(component=QueryClassifier(), name="QueryClassifier", inputs=["Query"])
        pipe.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_1"])
        pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_2"])
        pipe.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults",
                      inputs=["ESRetriever", "DPRRetriever"])
        pipe.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
        res = p.run(query="What did Einstein work on?", top_k_retriever=1)
    

    Default Pipelines (replacing the "Finder")

    Last but not least, we added some "Default Pipelines" that allow you to run standard patterns with very few lines of code. This is replacing the Finder class which is now deprecated.

    from haystack.pipeline import DocumentSearchPipeline, ExtractiveQAPipeline, Pipeline, JoinDocuments
    
    # Extractive QA
    qa_pipe = ExtractiveQAPipeline(reader=reader, retriever=retriever)
    res = qa_pipe.run(query="When was Kant born?", top_k_retriever=3, top_k_reader=5)
    
    # Document Search
    doc_pipe = DocumentSearchPipeline(retriever=retriever)
    res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)
    
    # Generative QA
    doc_pipe = GenerativeQAPipeline(generator=rag_generator, retriever=retriever)
    res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)
    
    # FAQ based QA
    doc_pipe = FAQPipeline(retriever=retriever)
    res = doc_pipe.run(query="How can I change my address?", top_k_retriever=3)
    
    

    We plan many more features around the new pipelines incl. parallelized execution, distributed execution, definition via YAML files, dry runs ...

    New DocumentStore for the Open Distro of Elasticsearch (#676)

    From now on we also support the Open Distro of Elasticsearch. This allows you to use many of the hosted Elasticsearch services (e.g. from AWS) more easily with Haystack. Usage is similar to the regular ElasticsearchDocumentStore:

    document_store = OpenDistroElasticsearchDocumentStore(host="localhost", port="9200", ...)
    

    :warning: Breaking Changes

    As Haystack is extending from QA to further search types, we decided to rename all parameters from question to query. This includes for example the predict() methods of the Readers but also several other places. See #614 for details.

    :nerd_face: Detailed Changes

    Preprocessing / File Conversion

    • Redone: Fix concatenation of sentences in PreProcessor. Add stride for word-based splits with sentence boundaries #641
    • Add needed whitespace before sentence start #582

    DocumentStore

    • Scale dot product into probabilities #667
    • Add refresh_type param for Elasticsearch update_embeddings() #630
    • Add return_embedding parameter for get_all_documents() #615
    • Adding support for update_existing_documents to sql and faiss document stores #584
    • Add filters for delete_all_documents() #591

    Retriever

    • Fix saving tokenizers in DPR training + unify save and load dirs #682
    • fix a typo, num_negatives -> num_positives #681
    • Refactor DensePassageRetriever._get_predictions #642
    • Move DPR embeddings from GPU to CPU straight away #618
    • Add MAP retriever metric for open-domain case #572

    Reader / Generator

    • add GPU support for rag #669
    • Enable dynamic parameter updates for the FARMReader #650
    • Add option in API Config to configure if reader can return "No Answer" #609
    • Fix various generator issues #590

    Pipeline

    • Add support for building custom Search Pipelines #596
    • Add set_node() for Pipeline #659
    • Add support for aggregating scores in JoinDocuments node #683
    • Add pipelines for GenerativeQA & FAQs #645

    Other

    • Cleanup Pytest Fixtures #639
    • Add latest benchmark run #652
    • Fix image links in tutorial #663
    • Update query arg in Tutorial 7 #656
    • Fix benchmarks #648
    • Add link to FAISS Info in documentation #643
    • Improve User Feedback Documentation #539
    • Add formatting checks for shell scripts #627
    • Update md files for API docs #631
    • Clean API docs and increase coverage #621
    • Add boxes for recommendations #629
    • Automate benchmarks via CML #518
    • Add contributor hall of fame #628
    • README: Fix link to roadmap #626
    • Fix docstring examples #604
    • Cleaning the api docs #616
    • Fix link to DocumentStore page #613
    • Make more changes to documentation #578
    • Remove column in benchmark website #608
    • Make benchmarks clearer #606
    • Fixing defaults configs for rest_apis #583
    • Allow list of filter values in REST API #568
    • Fix CI bug due to new Elasticsearch release and new model release #579
    • Update Colab Torch Version #576

    :heart: Big thanks to all contributors!

    @sadakmed @Krak91 @icy @lalitpagaria @guillim @tanaysoni @tholor @timoeller @PiffPaffM @bogdankostic

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Nov 6, 2020)

    Highlights

    :speech_balloon: Generative Question Answering via RAG (#484)

    Thanks to our community member @lalitpagaria, Haystack now also support generative QA via Retrieval Augmented Generation ("RAG"). Instead of "finding" the answer within a document, these models generate the answer. In that sense, RAG follows a similar approach as GPT-3 but it comes with two huge advantages for real-world applications: a) it has a manageable model size b) the answer generation is conditioned on retrieved documents, i.e. the model can easily adjust to domain documents even after training has finished (in contrast: GPT-3 relies on the web data seen during training)

    Example:

        question = "who got the first nobel prize in physics?"
    
        # Retrieve related documents from retriever
        retrieved_docs = retriever.retrieve(query=question)
    
        # Now generate answer from question and retrieved documents
        predicted_result = generator.predict(
            question=question,
            documents=retrieved_docs,
            top_k=1
        )
    

    You already play around with it in this minimal tutorial:

    We are looking forward to improve this class of models further in the next months and already plan a tighter integration into the Finder class.

    :arrow_upper_right: Better DPR (incl. training) (#527)

    We migrated the existing DensePassageRetriever to an own pipeline based on FARM. This allows a better modularization and most importantly simple training of DPR models! You can either train models from scratch or take an existing DPR model and fine-tune it on your own domain data. The required training data consists of queries and positive passages (i.e. passages that are related to your query / contain the answer) and the format complies with the one in the original DPR codebase.

    Example:

    dense_passage_retriever.train(self,
                                  data_dir: str,
                                  train_filename: str,
                                  dev_filename: str = None,
                                  test_filename: str = None,
                                  batch_size: int = 16,
                                  embed_title: bool = True,
                                  num_hard_negatives: int = 1,
                                  n_epochs: int = 3)
    
    

    Future improvements: At the moment training is only supported on single GPUs. We will add support for Multi-GPU Training via DDP soon.

    📊 New Benchmarks

    Happy to introduce a new benchmark section on our website! Do you wonder if you should use BERT, RoBERTa or MiniLM for your reader? Is it worth to use DPR for retrieval instead of Elastic's BM25? How would this impact speed and accuracy?

    See the relevant metrics here to guide your decision: :point_right: https://haystack.deepset.ai/bm/benchmarks

    We will extend this section over time with more models, metrics and key parameters.

    :warning: Breaking Changes

    Consistent parameter naming for TransformersReader #510

    # old: 
    TransformersReader(model="distilbert-base-uncased-distilled-squad" ..) 
    
    # new
    TransformersReader(model="distilbert-base-uncased-distilled-squad" ..) 
    TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad" ...)
    

    FAISS: Remove phi normalization, support more index types #467

    New default index type is "Flat" and params have changed slightly:

    
    # old 
     FAISSDocumentStore(
            sql_url: str = "sqlite:///",
            index_buffer_size: int = 10_000,
            vector_size: int = 768,
            faiss_index: Optional[IndexHNSWFlat] = None,
    
    # new
    FAISSDocumentStore(
            sql_url: str = "sqlite:///",
            index_buffer_size: int = 10_000,
            vector_dim: int = 768,
            faiss_index_factory_str: str = "Flat",
            faiss_index: Optional[faiss.swigfaiss.Index] = None,
            return_embedding: Optional[bool] = True,
            **kwargs,
    

    DPR signature

    Splitting max_seq_len into two independent params. Removing remove_sep_tok_from_untitled_passages param.

    # old
    DensePassageRetriever(
                     document_store: BaseDocumentStore,
                     query_embedding_model: str = "facebook/dpr-question_encoder-single-nq-base",
                     passage_embedding_model: str = "facebook/dpr-ctx_encoder-single-nq-base",
                     max_seq_len: int = 256,
                     use_gpu: bool = True,
                     batch_size: int = 16,
                     embed_title: bool = True,
                     remove_sep_tok_from_untitled_passages: bool = True
                     )
    
    # new 
     DensePassageRetriever(
     		 document_store: BaseDocumentStore,
                     query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base",
                     passage_embedding_model: Union[Path, str] = "facebook/dpr-ctx_encoder-single-nq-base",
                     max_seq_len_query: int = 64,
                     max_seq_len_passage: int = 256,
                     use_gpu: bool = True,
                     batch_size: int = 16,
                     embed_title: bool = True,
                     use_fast_tokenizers: bool = True,
                     similarity_function: str = "dot_product"
                     ):
    

    Detailed Changes

    Preprocessing / File Conversion

    • Add preprocessing pipeline #473
    • Restructure checks in PreProcessor #504
    • Updated the example code to Indexing PDF / Docx files #502
    • Fix meta data = None in PreProcessor #496
    • add explicit encoding mode to file_converter/txt.py #478
    • Skip file conversion if file type is not supported #456

    DocumentStore

    • Add support for MySQL database #556
    • Allow configuration of Elasticsearch Analyzer (e.g. for other languages) #554
    • Add support to return embedding #514
    • Fix scoring in Elasticsearch for dot product #517
    • Allow filters for get_document_count() #512
    • Make creation of label index optional #490
    • Fix update_embeddings function in FAISSDocumentStore #481
    • FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings #422
    • Enable bulk operations on vector IDs for FAISSDocumentStore #460
    • fixing ElasticsearchDocumentStore initialisation #415
    • bug: filters on a query_by_embedding #464

    Retriever

    • DensePassageRetriever: Add Training, Refactor Inference to FARM modules #527
    • Fix retriever evaluation metrics #547
    • Add save and load method for DPR #550
    • Typo in dense.py comment #545
    • Make returning predictions in Finder & Retriever eval() possible #524
    • Make title info optional when evaluating on QA data #494
    • Make sentence-transformers usage more user-friendly #439

    Reader

    • Fix FARMReader.eval() handling of no_answers #531
    • Added automatic mixed precision (AMP) support for reader training from Haystack side #463
    • Update ONNX conversion for FARMReader #438

    Other

    • Fix sentencepiece dependencies in Dockerfiles #553
    • Update Dockerfile #537
    • Removing (deprecated) warnings from the Haystack codebase. #530
    • Pytest fix memory leak and put pytest marker on slow tests #520
    • [enhancement] Create deploy_website.yml #450
    • Add Docker Images & Setup for the Annotation Tool #444

    REST API

    • Make filter value optional in REST API #497
    • Add Elasticsearch Query DSL compliant Query API #471
    • Allow configuration of log level in REST API #541
    • Add create_index and similarity metric to api config #493
    • Add deepcopy for meta dicts in answers #485
    • Fix windows platform installation #480
    • Update GPU docker & fix race condition with multiple workers #436

    Documentation / Benchmarks / Tutorials

    • New readme #534
    • Add public roadmap #432
    • Time and performance benchmarks for all readers and retrievers #339
    • Added new formatting for examples in docstrings #555
    • Update annotation docs for website #505
    • Add annotation tool manual to README.md #523
    • Change metric to queries per second on benchmarks webpage #529
    • Add --ci and --update-json to CLI for benchmarks #522
    • Add requirement to colab notebooks #509
    • Update doc string for ElasticsearchDocumentStore.write_documents() & sync markdown files #501
    • Add versioning docs #495
    • READ.me for Docstring Generation #468
    • Separate data and view for benchmarks #451
    • Update DPR docstring for embed_title #459
    • Update Tutorial4_FAQ_style_QA.py #416

    :heart: Big thanks to all contributors!

    @lalitpagaria @guillim @elyase @kolk @rsanjaykamath @antoniolanza1996 @Zenahr @Futurne @tanaysoni @tholor @timoeller @PiffPaffM @bogdankostic

    Source code(tar.gz)
    Source code(zip)
Owner
deepset
Building enterprise search systems powered by latest NLP & open-source.
deepset
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
Question answering app is used to answer for a user given question from user given text.

Question answering app is used to answer for a user given question from user given text.It is created using HuggingFace's transformer pipeline and streamlit python packages.

Siva Prakash 3 Apr 5, 2022
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.1k Feb 14, 2021
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 220 Dec 11, 2022
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 184 Feb 10, 2021
Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

wangle 823 Dec 28, 2022
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

VUMBLEB 69 Nov 4, 2022
Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2018) dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.

Google Research Datasets 52 Jun 21, 2022
chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

abhishek thakur 33 Dec 18, 2022
Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

Jan 2 Apr 20, 2022
BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

Bithiah Yuan 61 Sep 18, 2022
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

null 475 Jan 4, 2023
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Keon Lee 237 Jan 2, 2023
DaCy: The State of the Art Danish NLP pipeline using SpaCy

DaCy: A SpaCy NLP Pipeline for Danish DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Ar

Kenneth Enevoldsen 71 Jan 6, 2023
Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

Harald Scheidl 736 Jan 3, 2023