An easier way to build neural search on the cloud
Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g. text, images, video, audio) on the cloud.
Docs • Hello World • Quick Start • Learn • Examples • Contribute • Jobs • Website • Slack
Installation
x86/64,arm/v6,v7,v8 (Apple M1) |
On Linux/macOS & Python 3.7/3.8/3.9 | Docker Users |
---|---|---|
Standard | pip install -U jina |
docker run jinaai/jina:latest |
Daemon | pip install -U "jina[daemon]" |
docker run --network=host jinaai/jina:latest-daemon |
With Extras | pip install -U "jina[devel]" |
docker run jinaai/jina:latest-devel |
Dev/Pre-Release | pip install --pre jina |
docker run jinaai/jina:master |
Version identifiers are explained here. To install Jina with extra dependencies please refer to the docs. Jina can run on Windows Subsystem for Linux. We welcome the community to help us with native Windows support.
👋
🌍
Jina "Hello, World!" Just starting out? Try Jina's "Hello, World" - a simple image neural search demo for Fashion-MNIST. No extra dependencies needed, simply run:
jina hello-world # more options in --help
...or even easier for Docker users, no install required:
docker run -v "$(pwd)/j:/j" jinaai/jina hello-world --workdir /j && open j/hello-world.html
# replace "open" with "xdg-open" on Linux
Covid-19 Chatbot
For NLP engineers, we provide a simple chatbot demo for answering Covid-19 questions. You will need PyTorch and Transformers, which can be installed along with Jina:
pip install "jina[torch,transformers]"
jina hello-world-chatbot
This downloads CovidQA dataset and tells Jina to index 418 question-answer pairs with DistilBERT. The index process takes about 1 minute on CPU. Then it opens a webpage where you can input questions and ask Jina.
Get Started
🥚
Fundamental
CRUD Functions
First we look at basic CRUD operations. In Jina, CRUD corresponds to four functions: index
(create), search
(read), update
, and delete
. With Documents below as an example:
import numpy as np
from jina import Document
docs = [Document(id='🐲', embedding=np.array([0, 0]), tags={'guardian': 'Azure Dragon', 'position': 'East'}),
Document(id='🐦', embedding=np.array([1, 0]), tags={'guardian': 'Vermilion Bird', 'position': 'South'}),
Document(id='🐢', embedding=np.array([0, 1]), tags={'guardian': 'Black Tortoise', 'position': 'North'}),
Document(id='🐯', embedding=np.array([1, 1]), tags={'guardian': 'White Tiger', 'position': 'West'})]
Let's build a Flow with a simple indexer:
from jina import Flow
f = Flow().add(uses='_index')
Document
and Flow
are basic concepts in Jina, which will be explained later. _index
is a built-in embedding + structured storage that one can use out of the box.
Index |
# save four docs (both embedding and structured info) into storage
with f:
f.index(docs, on_done=print)
|
Search |
# retrieve top-3 neighbours of 🐲, this print 🐲🐦🐢 with score 0, 1, 1 respectively
with f:
f.search(docs[0], top_k=3, on_done=lambda x: print(x.docs[0].matches))
{"id": "🐲", "tags": {"guardian": "Azure Dragon", "position": "East"}, "embedding": {"dense": {"buffer": "AAAAAAAAAAAAAAAAAAAAAA==", "shape": [2], "dtype": "<i8"}}, "score": {"opName": "NumpyIndexer", "refId": "🐲"}, "adjacency": 1}
{"id": "🐦", "tags": {"position": "South", "guardian": "Vermilion Bird"}, "embedding": {"dense": {"buffer": "AQAAAAAAAAAAAAAAAAAAAA==", "shape": [2], "dtype": "<i8"}}, "score": {"value": 1.0, "opName": "NumpyIndexer", "refId": "🐲"}, "adjacency": 1}
{"id": "🐢", "tags": {"guardian": "Black Tortoise", "position": "North"}, "embedding": {"dense": {"buffer": "AAAAAAAAAAABAAAAAAAAAA==", "shape": [2], "dtype": "<i8"}}, "score": {"value": 1.0, "opName": "NumpyIndexer", "refId": "🐲"}, "adjacency": 1}
|
Update |
# update 🐲 embedding in the storage
docs[0].embedding = np.array([1, 1])
with f:
f.update(docs[0])
|
Delete |
# remove 🐦🐲 Documents from the storage
with f:
f.delete(['🐦', '🐲'])
|
Document
Document
is Jina's primitive data type. It can contain text, image, array, embedding, URI, and accompanied by rich meta information. To construct a Document, one can use:
import numpy
from jina import Document
doc1 = Document(content=text_from_file, mime_type='text/x-python') # a text document contains python code
doc2 = Document(content=numpy.random.random([10, 10])) # a ndarray document
Document can be recurred both vertically and horizontally to have nested documents and matched documents. To better see the recursive structure of a document, one can use .plot()
function. If you are using JupyterLab/Notebook, all Document objects will be auto-rendered.
Click here to see more about MultimodalDocument
MultimodalDocument
A MultimodalDocument
is a document composed of multiple Document
from different modalities (e.g. text, image, audio).
Jina provides multiple ways to build a multimodal Document. For example, one can provide the modality names and the content in a dict
:
from jina import MultimodalDocument
document = MultimodalDocument(modality_content_map={
'title': 'my holiday picture',
'description': 'the family having fun on the beach',
'image': PIL.Image.open('path/to/image.jpg')
})
One can also compose a MultimodalDocument
from multiple Document
directly:
from jina.types import Document, MultimodalDocument
doc_title = Document(content='my holiday picture', modality='title')
doc_desc = Document(content='the family having fun on the beach', modality='description')
doc_img = Document(content=PIL.Image.open('path/to/image.jpg'), modality='image')
doc_img.tags['date'] = '10/08/2019'
document = MultimodalDocument(chunks=[doc_title, doc_description, doc_img])
Fusion Embeddings from Different Modalities
To extract fusion embeddings from different modalities Jina provides BaseMultiModalEncoder
abstract class, which has a unqiue encode
interface.
def encode(self, *data: 'numpy.ndarray', **kwargs) -> 'numpy.ndarray':
...
MultimodalDriver
provides data
to the MultimodalDocument
in the correct expected order. In this example below, image
embedding is passed to the endoder as the first argument, and text
as the second.
!MyMultimodalEncoder
with:
positional_modality: ['image', 'text']
requests:
on:
[IndexRequest, SearchRequest]:
- !MultiModalDriver {}
Interested readers can refer to jina-ai/example
: how to build a multimodal search engine for image retrieval using TIRG (Composing Text and Image for Image Retrieval) for the usage of MultimodalDriver
and BaseMultiModalEncoder
in practice.
Flow
Jina provides a high-level Flow API to simplify building CRUD workflows. To create a new Flow:
from jina import Flow
f = Flow().add()
This creates a simple Flow with one Pod. You can chain multiple .add()
s in a single Flow.
To visualize the Flow, simply chain it with .plot('my-flow.svg')
. If you are using a Jupyter notebook, the Flow object will be displayed inline without plot
.
Gateway
is the entrypoint of the Flow.
Get the vibe? Now we are talking! Let's learn more about the basic concepts and features in Jina.
🐣
Basic
Feed Data
To use a Flow, open it via with
context manager, like you would open a file in Python. Now let's create some empty document and index it:
from jina import Document
with Flow().add() as f:
f.index((Document() for _ in range(10)))
Flow supports CRUD operations: index
, search
, update
, delete
. Besides, it also provides sugary syntax on ndarray
, csv
, ndjson
and arbitrary files.
Input | Example on index /search |
Explain |
numpy.ndarray |
with f:
f.index_ndarray(numpy.random.random([4,2]))
|
Input four |
CSV |
with f, open('index.csv') as fp:
f.index_csv(fp1, field_resolver={'pic_url': 'uri'})
|
Each line in the |
JSON Lines/ndjson /LDJSON |
with f, open('index.ndjson') as fp:
f.index_ndjson(fp1, field_resolver={'question_id': 'id'})
|
Each line in |
Files with wildcard |
with f:
f.index_files(['/tmp/*.mp4', '/tmp/*.pdf'])
|
Each file captured is constructed as a |
Fetch Result
Once a request is done, callback functions are fired. Jina Flow implements Promise-like interface, you can add callback functions on_done
, on_error
, on_always
to hook different events. In the example below, our Flow passes the message then prints the result when successful. If something wrong, it beeps. Finally, the result is written to output.txt
.
def beep(*args):
# make a beep sound
import os
os.system('echo -n "\a";')
with Flow().add() as f, open('output.txt', 'w') as fp:
f.index(numpy.random.random([4, 5, 2]),
on_done=print, on_error=beep, on_always=lambda x: fp.write(x.json()))
Add Logic
To add logic to the Flow, use the uses
parameter to attach a Pod with an Executor. uses
accepts multiple value types including class name, Docker image, (inline) YAML or built-in shortcut.
f = (Flow().add(uses='MyBertEncoder') # class name of a Jina Executor
.add(uses='docker://jinahub/pod.encoder.dummy_mwu_encoder:0.0.6-0.9.3') # the image name
.add(uses='myencoder.yml') # YAML serialization of a Jina Executor
.add(uses='!WaveletTransformer | {freq: 20}') # inline YAML config
.add(uses='_pass') # built-in shortcut executor
.add(uses={'__cls': 'MyBertEncoder', 'with': {'param': 1.23}})) # dict config object with __cls keyword
The power of Jina lies in its decentralized architecture: each add
creates a new Pod, and these Pods can be run as a local thread/process, a remote process, inside a Docker container, or even inside a remote Docker container.
Inter & Intra Parallelism
Chaining .add()
s creates a sequential Flow. For parallelism, use the needs
parameter:
f = (Flow().add(name='p1', needs='gateway')
.add(name='p2', needs='gateway')
.add(name='p3', needs='gateway')
.needs(['p1','p2', 'p3'], name='r1').plot())
p1
, p2
, p3
now subscribe to Gateway
and conduct their work in parallel. The last .needs()
blocks all Pods until they finish their work. Note: parallelism can also be performed inside a Pod using parallel
:
f = (Flow().add(name='p1', needs='gateway')
.add(name='p2', needs='gateway')
.add(name='p3', parallel=3)
.needs(['p1','p3'], name='r1').plot())
Decentralized Flow
A Flow does not have to be local-only, one can put any Pod to remote(s). In the example below, with the host
keyword gpu-pod
is put to a remote machine for parallelization, whereas other pods stay local. Extra file dependencies that need to be uploaded are specified via the upload_files
keyword.
123.456.78.9 |
# have docker installed
docker run --name=jinad --network=host -v /var/run/docker.sock:/var/run/docker.sock jinaai/jina:latest-daemon --port-expose 8000
# to stop it
docker rm -f jinad
|
Local |
import numpy as np
from jina import Flow
f = (Flow()
.add()
.add(name='gpu_pod',
uses='mwu_encoder.yml',
host='123.456.78.9:8000',
parallel=2,
upload_files=['mwu_encoder.py'])
.add())
with f:
f.index_ndarray(np.random.random([10, 100]), output=print)
|
We provide a demo server on cloud.jina.ai:8000
, give the following snippet a try!
from jina import Flow
with Flow().add().add(host='cloud.jina.ai:8000') as f:
f.index(['hello', 'world'])
Asynchronous Flow
Synchronous from outside, Jina runs asynchronously underneath: it manages the eventloop(s) for scheduling the jobs. If the user wants more control over the eventloop, then AsyncFlow
comes to use.
Unlike Flow
, the CRUD of AsyncFlow
accepts input & output functions as async generator. This is useful when your data sources involve other asynchronous libraries (e.g. motor for MongoDB):
from jina import AsyncFlow
async def input_fn():
for _ in range(10):
yield Document()
await asyncio.sleep(0.1)
with AsyncFlow().add() as f:
async for resp in f.index(input_fn):
print(resp)
AsyncFlow
is particular useful when Jina is using as part of the integration, where another heavy-lifting job is running concurrently:
async def run_async_flow_5s(): # WaitDriver pause 5s makes total roundtrip ~5s
with AsyncFlow().add(uses='- !WaitDriver {}') as f:
async for resp in f.index_ndarray(numpy.random.random([5, 4])):
print(resp)
async def heavylifting(): # total roundtrip takes ~5s
print('heavylifting other io-bound jobs, e.g. download, upload, file io')
await asyncio.sleep(5)
print('heavylifting done after 5s')
async def concurrent_main(): # about 5s; but some dispatch cost, can't be just 5s, usually at <7s
await asyncio.gather(run_async_flow_5s(), heavylifting())
if __name__ == '__main__':
asyncio.run(concurrent_main())
AsyncFlow
is very useful when using Jina inside the Jupyter Notebook. As Jupyter/ipython already manages an eventloop and thanks to autoawait
, AsyncFlow
can run out-of-the-box in Jupyter.
That's all you need to know for understanding the magic behind hello-world
. Now let's dive into it!
🐥
Breakdown of hello-world
Customize Encoder
Let's first build a naive image encoder that embeds images into vectors using an orthogonal projection. To do this, we simply inherit from BaseImageEncoder
: a base class from the jina.executors.encoders
module. We then override its __init__()
and encode()
methods.
import numpy as np
from jina.executors.encoders import BaseImageEncoder
class MyEncoder(BaseImageEncoder):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
np.random.seed(1337)
H = np.random.rand(784, 64)
u, s, vh = np.linalg.svd(H, full_matrices=False)
self.oth_mat = u @ vh
def encode(self, data: 'np.ndarray', *args, **kwargs):
return (data.reshape([-1, 784]) / 255) @ self.oth_mat
Jina provides a family of Executor
classes, which summarize frequently-used algorithmic components in neural search. This family consists of encoders, indexers, crafters, evaluators, and classifiers, each with a well-designed interface. You can find the list of all 107 built-in executors here. If they don't meet your needs, inheriting from one of them is the easiest way to bootstrap your own Executor. Simply use our Jina Hub CLI:
pip install jina[hub] && jina hub new
Test Encoder in Flow
Let's test our encoder in the Flow with some synthetic data:
def validate(req):
assert len(req.docs) == 100
assert NdArray(req.docs[0].embedding).value.shape == (64,)
f = Flow().add(uses='MyEncoder')
with f:
f.index_ndarray(numpy.random.random([100, 28, 28]), on_done=validate)
All good! Now our validate
function confirms that all one hundred 28x28 synthetic images have been embedded into 100x64 vectors.
Parallelism & Batching
By setting a larger input, you can play with batch_size
and parallel
:
f = Flow().add(uses='MyEncoder', parallel=10)
with f:
f.index_ndarray(numpy.random.random([60000, 28, 28]), batch_size=1024)
Add Data Indexer
Now we need to add an indexer to store all the embeddings and the image for later retrieval. Jina provides a simple numpy
-powered vector indexer NumpyIndexer
, and a key-value indexer BinaryPbIndexer
. We can combine them in a single YAML file:
!CompoundIndexer
components:
- !NumpyIndexer
with:
index_filename: vec.gz
- !BinaryPbIndexer
with:
index_filename: chunk.gz
metas:
workspace: ./
!
tags a structure with a class namewith
defines arguments for initializing this class object.
Essentially, the above YAML config is equivalent to the following Python code:
from jina.executors.indexers.vector import NumpyIndexer
from jina.executors.indexers.keyvalue import BinaryPbIndexer
from jina.executors.indexers import CompoundIndexer
a = NumpyIndexer(index_filename='vec.gz')
b = BinaryPbIndexer(index_filename='vec.gz')
c = CompoundIndexer()
c.components = lambda: [a, b]
Compose Flow from YAML
Now let's add our indexer YAML file to the Flow with .add(uses=)
. Let's also add two shards to the indexer to improve its scalability:
f = Flow().add(uses='MyEncoder', parallel=2).add(uses='myindexer.yml', shards=2).plot()
When you have many arguments, constructing a Flow in Python can get cumbersome. In that case, you can simply move all arguments into one flow.yml
:
!Flow
version: '1.0'
pods:
- name: encode
uses: MyEncoder
parallel: 2
- name:index
uses: myindexer.yml
shards: 2
And then load it in Python:
f = Flow.load_config('flow.yml')
Search
Querying a Flow is similar to what we did with indexing. Simply load the query Flow and switch from f.index
to f.search
. Say you want to retrieve the top 50 documents that are similar to your query and then plot them in HTML:
f = Flow.load_config('flows/query.yml')
with f:
f.search_ndarray(numpy.random.random([10, 28, 28]), shuffle=True, on_done=plot_in_html, top_k=50)
Evaluation
To compute precision recall on the retrieved result, you can add _eval_pr
, a built-in evaluator for computing precision & recall.
f = (Flow().add(...)
.add(uses='_eval_pr'))
You can construct an iterator of query and groundtruth pairs and feed to the flow f
, via:
from jina import Document
def query_generator():
for _ in range(10):
q = Document()
# now construct expect matches as groundtruth
gt = Document(q, copy=True) # make sure 'gt' is identical to 'q'
gt.matches.append(...)
yield q, gt
f.search(query_iterator, ...)
REST Interface
In practice, the query Flow and the client (i.e. data sender) are often physically separated. Moreover, the client may prefer to use a REST API rather than gRPC when querying. You can set port_expose
to a public port and turn on REST support with restful=True
:
f = Flow(port_expose=45678, restful=True)
with f:
f.block()
That is the essence behind jina hello-world
. It is merely a taste of what Jina can do. We’re really excited to see what you do with Jina! You can easily create a Jina project from templates with one terminal command:
pip install jina[hub] && jina hub new --type app
This creates a Python entrypoint, YAML configs and a Dockerfile. You can start from there.
Learn
Jina 101: First Things to Learn About JinaEnglish • 日本語 • Français • Português • Deutsch • Русский язык • 中文 • عربية |
View all)
Examples (Example code to build your own projects
|
Semantic Wikipedia Search with Transformers and DistilBERTBrand new to neural search? See a simple text-search example to understand how Jina works |
|
Add Incremental Indexing to Wikipedia SearchIndex more effectively by adding incremental indexing to your Wikipedia search |
|
Build a NLP Semantic Search System with TransformersUpgrade from plain search to sentence search and practice your Flows and Pods by searching sentences from Wikipedia |
|
Search Lyrics with Transformers and PyTorchGet a better understanding of chunks by searching a lyrics database. Now with shiny front-end! |
|
Google's Big Transfer Model in (Poké-)ProductionUse SOTA visual representation for searching Pokémon! |
|
Search YouTube audio data with VggishA demo of neural search for audio data based Vggish model. |
|
Search Tumblr GIFs with KerasEncoderUse prefetching and sharding to improve the performance of your index and query flow when searching animated GIFs. |
Please check our examples repo for advanced and community-submitted examples.
Want to read more? Check our Founder Han Xiao's blog and our official blog.
Documentation
Apart from the learning resources we provided above, We highly recommended you go through our documentation to master Jina.
Our docs are built on every push, merge, and release of Jina's master branch. Documentation for older versions is archived here.
Are you a "Doc"-star? Join us! We welcome all kinds of improvements on the documentation.
Contributing
We welcome all kinds of contributions from the open-source community, individuals and partners. We owe our success to your active involvement.
✨
Contributors
Community
- Code of conduct - play nicely with the Jina community
- Slack workspace - join #general on our Slack to meet the team and ask questions
- YouTube channel - subscribe to the latest video tutorials, release demos, webinars and presentations.
- LinkedIn - get to know Jina AI as a company and find job opportunities
- - follow and interact with us using hashtag
#JinaSearch
- Company - know more about our company and how we are fully committed to open-source.
Open Governance
As part of our open governance model, we host Jina's Engineering All Hands in public. This Zoom meeting recurs monthly on the second Tuesday of each month, at 14:00-15:30 (CET). Everyone can join in via the following calendar invite.
The meeting will also be live-streamed and later published to our YouTube channel.
Join Us
Jina is an open-source project. We are hiring full-stack developers, evangelists, and PMs to build the next neural search ecosystem in open source.
License
Copyright (c) 2020-2021 Jina AI Limited. All rights reserved.
Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.