Cherche
Neural search
Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. Cherche's main strength is its ability to build diverse and end-to-end pipelines.
🤖
Installation pip install cherche
To install the development version:
pip install git+https://github.com/raphaelsty/cherche
Documentation
📜
Documentation is available here. It provides details about retrievers, rankers, pipelines, question answering, summarization, and examples.
💨
QuickStart
📑
Documents Cherche allows findings the right document within a list of objects. Here is an example of a corpus.
from cherche import data
documents = data.load_towns()
documents[:3]
[{'id': 0,
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'article': 'Paris is the capital and most populous city of France.'},
{'id': 1,
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'article': "Since the 17th century, Paris has been one of Europe's major centres of science, and arts."},
{'id': 2,
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France.'
}]
🔍
Retriever ranker Here is an example of a neural search pipeline composed of a TfIdf that quickly retrieves documents, followed by a ranking model. The ranking model sorts the documents produced by the retriever based on the semantic similarity between the query and the documents.
from cherche import data, retrieve, rank
from sentence_transformers import SentenceTransformer
# List of dicts
documents = data.load_towns()
# Retrieve on fields title and article
retriever = retrieve.TfIdf(key="id", on=["title", "article"], documents=documents, k=30)
# Rank on fields title and article
ranker = rank.Encoder(
key = "id",
on = ["title", "article"],
encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
k = 3,
path = "encoder.pkl"
)
# Pipeline creation
search = retriever + ranker
search.add(documents=documents)
search("Bordeaux")
[{'id': 57, 'similarity': 0.69513476},
{'id': 63, 'similarity': 0.6214991},
{'id': 65, 'similarity': 0.61809057}]
Map the index to the documents to access their contents.
search += documents
search("Bordeaux")
[{'id': 57,
'title': 'Bordeaux',
'url': 'https://en.wikipedia.org/wiki/Bordeaux',
'article': 'Bordeaux ( bor-DOH, French: [bɔʁdo] (listen); Gascon Occitan: Bordèu [buɾˈðɛw]) is a port city on the river Garonne in the Gironde department, Southwestern France.',
'similarity': 0.69513476},
{'id': 63,
'title': 'Bordeaux',
'url': 'https://en.wikipedia.org/wiki/Bordeaux',
'article': 'The term "Bordelais" may also refer to the city and its surrounding region.',
'similarity': 0.6214991},
{'id': 65,
'title': 'Bordeaux',
'url': 'https://en.wikipedia.org/wiki/Bordeaux',
'article': "Bordeaux is a world capital of wine, with its castles and vineyards of the Bordeaux region that stand on the hillsides of the Gironde and is home to the world's main wine fair, Vinexpo.",
'similarity': 0.61809057}]
👻
Retrieve Cherche provides different retrievers that filter input documents based on a query.
- retrieve.Elastic
- retrieve.TfIdf
- retrieve.Lunr
- retrieve.BM25Okapi
- retrieve.BM25L
- retrieve.Flash
- retrieve.Encoder
🤗
Rank Cherche rankers are compatible with SentenceTransformers models, Hugging Face sentence similarity models, Hugging Face zero shot classification models, and of course with your own models.
Summarization and question answering
Cherche provides modules dedicated to summarization and question answering. These modules are compatible with Hugging Face's pre-trained models and can be fully integrated into neural search pipelines.
👏
Acknowledgements The BM25 models available in Cherche are wrappers around rank_bm25. Elastic retriever is a wrapper around Python Elasticsearch Client. TfIdf retriever is a wrapper around scikit-learn's TfidfVectorizer. Lunr retriever is a wrapper around Lunr.py. Flash retriever is a wrapper around FlashText. DPR and Encode rankers are wrappers dedicated to the use of the pre-trained models of SentenceTransformers in a neural search pipeline. ZeroShot ranker is a wrapper dedicated to the use of the zero-shot sequence classifiers of Hugging Face in a neural search pipeline.
👀
See also Cherche is a minimalist solution and meets a need for modularity. Cherche is the way to go if you start with a list of documents as JSON with multiple fields to search on and want to create pipelines. Also ,Cherche is well suited for middle sized corpora.
Do not hesitate to look at Haystack, Jina, or TxtAi which offer very advanced solutions for neural search and are great.
💾
Dev Team The Cherche dev team is made up of Raphaël Sourty and François-Paul Servant