txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.

Overview

Build AI-powered semantic search applications

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status


txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.

demo

Traditional search systems use keywords to find data. Semantic search applications have an understanding of natural language and identify results that have the same meaning, not necessarily the same keywords.

Backed by state-of-the-art machine learning models, data is transformed into vector representations for search (also known as embeddings). Innovation is happening at a rapid pace, models can understand concepts in documents, audio, images and more.

Summary of txtai features:

  • πŸ”Ž Large-scale similarity search with multiple index backends (Faiss, Annoy, Hnswlib)
  • πŸ“„ Create embeddings for text snippets, documents, audio, images and video. Supports transformers and word vectors.
  • πŸ’‘ Machine-learning pipelines to run extractive question-answering, zero-shot labeling, transcription, translation, summarization and text extraction
  • β†ͺ️ ️ Workflows that join pipelines together to aggregate business logic. txtai processes can be microservices or full-fledged indexing workflows.
  • πŸ”— API bindings for JavaScript, Java, Rust and Go
  • ☁️ Cloud-native architecture that scales out with container orchestration systems (e.g. Kubernetes)

Applications range from similarity search to complex NLP-driven data extractions to generate structured databases. The following applications are powered by txtai.

Application Description
paperai AI-powered literature discovery and review engine for medical/scientific papers
tldrstory AI-powered understanding of headlines and story text
neuspo Fact-driven, real-time sports event and news site
codequestion Ask coding questions directly from the terminal

txtai is built with Python 3.6+, Hugging Face Transformers, Sentence Transformers and FastAPI

Installation

The easiest way to install is via pip and PyPI

pip install txtai

Python 3.6+ is supported. Using a Python virtual environment is recommended.

See the detailed install instructions for more information covering installing from source, environment specific prerequisites and optional dependencies.

Examples

The examples directory has a series of notebooks and applications giving an overview of txtai. See the sections below.

Semantic Search

Build semantic/similarity/vector search applications.

Notebook Description
Introducing txtai Overview of the functionality provided by txtai Open In Colab
Build an Embeddings index with Hugging Face Datasets Index and search Hugging Face Datasets Open In Colab
Build an Embeddings index from a data source Index and search a data source with word embeddings Open In Colab
Add semantic search to Elasticsearch Add semantic search to existing search systems Open In Colab
API Gallery Using txtai in JavaScript, Java, Rust and Go Open In Colab
Similarity search with images Embed images and text into the same space for search Open In Colab
Distributed embeddings cluster Distribute an embeddings index across multiple data nodes Open In Colab

Pipelines and Workflows

NLP-backed data transformation pipelines and workflows.

Notebook Description
Extractive QA with txtai Introduction to extractive question-answering with txtai Open In Colab
Extractive QA with Elasticsearch Run extractive question-answering queries with Elasticsearch Open In Colab
Apply labels with zero shot classification Use zero shot learning for labeling, classification and topic modeling Open In Colab
Building abstractive text summaries Run abstractive text summarization Open In Colab
Extract text from documents Extract text from PDF, Office, HTML and more Open In Colab
Transcribe audio to text Convert audio files to text Open In Colab
Translate text between languages Streamline machine translation and language detection Open In Colab
Run pipeline workflows Simple yet powerful constructs to efficiently process data Open In Colab

Model Training

Train NLP models.

Notebook Description
Train a text labeler Build text sequence classification models Open In Colab
Train without labels Use zero-shot classifiers to train new models Open In Colab
Train a QA model Build and fine-tune question-answering models Open In Colab
Export and run models with ONNX Export models with ONNX, run natively in JavaScript, Java and Rust Open In Colab

Applications

Series of example applications with txtai.

Application Description
Demo query shell Basic similarity search example. Used in the original txtai demo.
Book search Book similarity search application. Index book descriptions and query using natural language statements.
Image search Image similarity search application. Index a directory of images and run searches to identify images similar to the input query.
Wiki search Wikipedia search application. Queries Wikipedia API and summarizes the top result.
Workflow builder Build and execute txtai workflows. Connect summarization, text extraction, transcription, translation and similarity search pipelines together to run unified workflows.

Documentation

Full documentation on txtai including configuration settings for pipelines, workflows, indexing and the API.

Contributing

For those who would like to contribute to txtai, please see this guide.

Comments
  • A .Net binding for txtai?

    A .Net binding for txtai?

    Hello community, it would be very nice to make a .Net binding. Where can I find a guide for that? So please if anyone can give me some instructions on how to get in that would be great. Of course, I would also like to know if this binding issue is only handled by maintainers or if it is something open to the community.

    opened by oliver021 21
  • embeddings.index Truncation RuntimeError: The size of tensor a (889) must match the size of tensor b (512) at non-singleton dimension 1

    embeddings.index Truncation RuntimeError: The size of tensor a (889) must match the size of tensor b (512) at non-singleton dimension 1

    Hello, when I try to run the indexing step, I get this error.

    Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
    
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    <ipython-input-33-6e863ca8aecc> in <module>
    ----> 1 embeddings.index(to_index)
    
    ~\Anaconda3\envs\bert2\lib\site-packages\txtai\embeddings.py in index(self, documents)
         80 
         81         # Transform documents to embeddings vectors
    ---> 82         ids, dimensions, stream = self.model.index(documents)
         83 
         84         # Load streamed embeddings back to memory
    
    ~\Anaconda3\envs\bert2\lib\site-packages\txtai\vectors.py in index(self, documents)
        245                 if len(batch) == 500:
        246                     # Convert batch to embeddings
    --> 247                     uids, dimensions = self.batch(batch, output)
        248                     ids.extend(uids)
        249 
    
    ~\Anaconda3\envs\bert2\lib\site-packages\txtai\vectors.py in batch(self, documents, output)
        279 
        280         # Build embeddings
    --> 281         embeddings = self.model.encode(documents, show_progress_bar=False)
        282         for embedding in embeddings:
        283             if not dimensions:
    
    ~\Anaconda3\envs\bert2\lib\site-packages\sentence_transformers\SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
        192 
        193             with torch.no_grad():
    --> 194                 out_features = self.forward(features)
        195 
        196                 if output_value == 'token_embeddings':
    
    ~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\container.py in forward(self, input)
        117     def forward(self, input):
        118         for module in self:
    --> 119             input = module(input)
        120         return input
        121 
    
    ~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
        887             result = self._slow_forward(*input, **kwargs)
        888         else:
    --> 889             result = self.forward(*input, **kwargs)
        890         for hook in itertools.chain(
        891                 _global_forward_hooks.values(),
    
    ~\Anaconda3\envs\bert2\lib\site-packages\sentence_transformers\models\Transformer.py in forward(self, features)
         36             trans_features['token_type_ids'] = features['token_type_ids']
         37 
    ---> 38         output_states = self.auto_model(**trans_features, return_dict=False)
         39         output_tokens = output_states[0]
         40 
    
    ~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
        887             result = self._slow_forward(*input, **kwargs)
        888         else:
    --> 889             result = self.forward(*input, **kwargs)
        890         for hook in itertools.chain(
        891                 _global_forward_hooks.values(),
    
    ~\Anaconda3\envs\bert2\lib\site-packages\transformers\models\bert\modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
        962         head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
        963 
    --> 964         embedding_output = self.embeddings(
        965             input_ids=input_ids,
        966             position_ids=position_ids,
    
    ~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
        887             result = self._slow_forward(*input, **kwargs)
        888         else:
    --> 889             result = self.forward(*input, **kwargs)
        890         for hook in itertools.chain(
        891                 _global_forward_hooks.values(),
    
    ~\Anaconda3\envs\bert2\lib\site-packages\transformers\models\bert\modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
        205         if self.position_embedding_type == "absolute":
        206             position_embeddings = self.position_embeddings(position_ids)
    --> 207             embeddings += position_embeddings
        208         embeddings = self.LayerNorm(embeddings)
        209         embeddings = self.dropout(embeddings)
    
    RuntimeError: The size of tensor a (889) must match the size of tensor b (512) at non-singleton dimension 1
    

    Where to_index =

     [('0015023cc06b5362d332b3baf348d11567ca2fbb',
      'The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3\nword count: 194 22 Text word count: 5168 23 24 25 author/funder. All rights reserved. No reuse allowed without permission. Abstract 27 The positive stranded RNA genomes of picornaviruses comprise a single large open reading 28 frame flanked by 5β€² and 3β€² untranslated regions (UTRs). Foot-and-mouth disease virus (FMDV) 29 has an unusually large 5β€² UTR (1.3 kb) containing five structural domains. These include the 30 internal ribosome entry site (IRES), which facilitates initiation of translation, and the cis-acting 31 replication element (cre). Less well characterised structures are a 5β€² terminal 360 nucleotide 32 stem-loop, a variable length poly-C-tract of approximately 100-200 nucleotides and a series of 33 two to four tandemly repeated pseudoknots (PKs). We investigated the structures of the PKs 34 by selective 2β€² hydroxyl acetylation analysed by primer extension (SHAPE) analysis and 35 determined their contribution to genome replication by mutation and deletion experiments. 36 SHAPE and mutation experiments confirmed the importance of the previously predicted PK 37 structures for their function. Deletion experiments showed that although PKs are not essential 38',
      None),
     ('00340eea543336d54adda18236424de6a5e91c9d',
      'Analysis Title: Regaining perspective on SARS-CoV-2 molecular tracing and its implications\nDuring the past three months, a new coronavirus (SARS-CoV-2) epidemic has been growing exponentially, affecting over 100 thousand people worldwide, and causing enormous distress to economies and societies of affected countries. A plethora of analyses based on viral sequences has already been published, in scientific journals as well as through non-peer reviewed channels, to investigate SARS-CoV-2 genetic heterogeneity and spatiotemporal dissemination. We examined all full genome sequences currently available to assess the presence of sufficient information for reliable phylogenetic and phylogeographic studies. Our analysis clearly shows severe limitations in the present data, in light of which any finding should be considered, at the very best, preliminary and hypothesis-generating. Hence the need for avoiding stigmatization based on partial information, and for continuing concerted efforts to increase number and quality of the sequences required for robust tracing of the epidemic.',
      None),
     ('004f0f8bb66cf446678dc13cf2701feec4f36d76',
      'Healthcare-resource-adjusted vulnerabilities towards the 2019-nCoV epidemic across China\n',
      None), ...]
    

    How do I fix this? I don't see anywhere in the documentation about this. I assume the error message:

    Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. 
    Default to no truncation.
    

    is related, and I need to set a max_length to 512 so that any documents that are larger than 512 get truncated to 512 tokens, but I don't see anywhere how to do that...

    opened by shinthor 16
  • POST Error indexing images via Embeddings API service

    POST Error indexing images via Embeddings API service

    I'm getting the following error, when indexing images using POST to the txtai service url http://txtai.default.127.0.0.1.sslip.io/add.

    "detail":[{"loc":["body"],"msg":"value is not a valid list","type":"type_error.list"}]}

    Possible related to the FastAPI endpoint?

    The same cluster is successful with text documents, but unsure how to index images.

    Is it possible to periodically index images in a remote s3 directory via a workflow?

    My current workflow YAML is:

    writable: true
    path: /tmp/index.tar.gz
    cloud:
      provider: s3
      container: index
      key: "<key>"
      secret: "<secret>"
      host: txtai.s3.amazonaws.com
      port: 80
    
    embeddings:
      path: "sentence-transformers/clip-ViT-B-32-multilingual-v1"
      content: true
    

    I'm hoping to implement the Images embedding search in a workflow configuration, as in the examples/images.ipynb notebook

    opened by edanweis 14
  • External vectors

    External vectors

    First of all, thank you for adding this feature!

    I am attempting to recreate the example locally from the docs:

    import numpy as np
    import requests
    from txtai.embeddings import Embeddings
    
    data = ['test', 'test2']
    def transform(inputs):
      response = requests.post("https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/nli-mpnet-base-v2",
                               json={"inputs": inputs})
    
      return np.array(response.json(), dtype=np.float32)
    
    # Index data using vectors from Inference API
    embeddings = Embeddings({"method": "external", "transform": transform, "content": True})
    embeddings.index([(uid, text, None) for uid, text in enumerate(data)])
    
    print("%-20s %s" % ("Query", "Best Match"))
    print("-" * 50)
    
    print(embeddings.search('test', 1))
    

    when I run the search, I receive an empty array. Perhaps data must be a tuple?

    opened by luquitared 13
  • Is there any relation between embedding data and /similarity API ?

    Is there any relation between embedding data and /similarity API ?

    I have a lot of test data and embedding data /add-ed and /upsert-ed in a system, In that system when I do /search request

    (with an SQL query like Select id,text, score, data from txtai where similar("what is the specialtiy of this ?",'8000') AND something ='something' ... )

    it takes around 0 to 3 seconds to pick 10 results. I'm ok with that(because it may take time since the number data I have stored is larger like thousands of text documents' paragraphs extracted and stored ).

    I think only /search API or /batchsearch API is gonna look into the 1000s of paragraphs to find the response. But /similarity API is not like them... (please correct me if I'm wrong) Because when we use http://localhost:8000/similarity we ourselves giving the list of texts and question in the request body.. and txtai is gonna return the list of id,score pairs only.

    What I'm trying to say is, when I did the following (POST) request by port forwarding the txtAI from a Linux system that already has a lot of embeddings stored. It took 600ms to 1200ms to get the response (I checked this timing by sending requests few more times using Postman .)

    curl -X POST "http://localhost:8000/similarity" -H "Content-Type: application/json" -d '{"query": "feel good story", "texts": ["Maine man wins $1M from $25 lottery ticket", "Dont sacrifice slower friends in a bear attack"]}'

    But when I do the same in a fresh system or system with lesser embedding data, it is only taking 50ms or lesser than that to find the response.

    Is there any relation between embedding data and /similarity API ?

    (For a text list with 2 or 3 sentence, it is fine to take 600 to 1200 ms, but when I try to do the same with 100s of texts and 1 question.. it is taking 13 or more seconds)

    My usecase is to do the /similarity operarion and calculate id,score for 100s of texts and 1 question in 1 to 3 seconds or lesser. I could achieve it in a system having lesser trained(embedding) data but not in the other case

    So that's why I want to know, Is there any relation between embedding data and /similarity API ?

    Actually I'm having a semantic search program in golang in which I'm using txtAI's APIs to perform searching, I'm making use of /search and /similarity APIs to fetch the results for the user query. Which is like after getting the 10 results from /search , I will be doing /similarity operation on each of 10 results's metadata keywords (that I have trained along with the "text" and "id" at the time of indexing them which will be in the "data" field ) following this approach I could improve the search result accuracy that I have expected, but the time it takes is more than 13 seconds when I have more keywords or tags in the "data" field.. I noticed that this time taken is only because of the /similarity API. Im pointing out the same in the following to make it clear to understand,

    1. Do /search with a limit 10, Get 10 results and scores
    2. Recalculate each result's score - 1 by 1 by.. finding /similarity between "user question"(query) and metadata content(texts) of each result - get the list of id and scores.
    3. Merge these /similarity scores with /search scores and compute the scores again.
    4. Now sort the results based on score again
    5. Return 10 results to user.

    Please suggest me an efficient way to improve the search speed and search accuracy.

    opened by akset2X 10
  • The result of the

    The result of the "embeddings.search()" method has a length that does not match to the input argument limit

    I am trying to do a search using a query with the same limit as the input data length, but why is the output data length half of the input data length?

    why len(result) < len(data)?

    def get_embeddings(model_name, data):
        embeddings = Embeddings(
            {"path": model_name, "content": True, "objects": True})
        embeddings.index([(id, text, None) for id, text in enumerate(data)])
        return embeddings
    
    def search(queries, embeddings, limit):
        return [result for result in embeddings.similarity(queries, limit)]
    
    queries = (string)
    data = [string, ...]
    embeddings = get_embeddings(model_name, data)
    result = search(queries, embeddings, len(data))
    opened by muazhari 10
  • Add Cross-Encoder support to Similarity pipeline

    Add Cross-Encoder support to Similarity pipeline

    It seems like Cross Encoders are the preferred model for doing Re-Ranking of search results that were generated by another means (BM25, vector search etc...). However, if I provide one of these models as the path, all the results just have scores of 0.5.

    Sentence Transformers recommends doing this. https://www.sbert.net/examples/applications/retrieve_rerank/README.html

    In particular, their msmarco-minilm models seem ideal as a default (maybe the L-6 version?) https://www.sbert.net/docs/pretrained-models/ce-msmarco.html

    Haystack's implementation uses this in its Ranker node. https://haystack.deepset.ai/pipeline_nodes/ranker

    opened by nickchomey 7
  • txtai Similarity really slow with ElasticSearch

    txtai Similarity really slow with ElasticSearch

    I've noticed when running ElasticSearch and txtai.pipeline for Similarity, the search (ranksearch) is very slow. When trying to search for 1 item, it can take upto 10 seconds.

    The code I'm using is:

    from txtai.pipeline import Similarity
    from elasticsearch import Elasticsearch, helpers
    
    # Connect to ES instance
    es = Elasticsearch(hosts=["http://localhost:9200"], timeout=60, retry_on_timeout=True)
    
    def ranksearch(query, limit):
      results = [text for _, text in search(query, limit * 10)]
      return [(score, results[x]) for x, score in similarity(query, results)][:limit]
    
    def search(query, limit):
      query = {
          "size": limit,
          "query": {
              "query_string": {"query": query}
          }
      }
    
      results = []
      for result in es.search(index="articles", body=query)["hits"]["hits"]:
        source = result["_source"]
        results.append((min(result["_score"], 18) / 18, source["title"]))
      return results
    
    similarity = Similarity("valhalla/distilbart-mnli-12-3")
    
    limit = 1
    query = "Bad News"
    print(ranksearch(query, limit))
    
    opened by Xyphius 7
  • add multi-label clasification

    add multi-label clasification

    This branch adds one more type of trainer. This is the multi-label classification model. The edits are minor, but it was useful for me to train a multi_label classifier with txtai. I really appreciate txtai, let me know if you request any edits. Thanks

    opened by abdullahtarek 7
  • Incremental Indexing

    Incremental Indexing

    Is it possible to add new data in previously created index files ? E.g. If I have created and saved a Embedding Index file with 5K records and, after some days I have new 2K data which I want to include in my old index file then can I load the old index file and append the new 2K data into it so that when we query in the updated embedding index it will include the new 2K data too ? OR I need to re-index the whole 7K (old 5k + new 2K) data together ?

    opened by roy-sr 7
  • [Python 3.9, Mac OS] Code hangs while building embedding index

    [Python 3.9, Mac OS] Code hangs while building embedding index

    Minimal example below. I suspect something weird is happening to multithreading, but was not able to confirm or find a resolution.

    Python version - 3.9.1
    OS - Mac Big Sur
    
    import os
    import tempfile
    import pandas as pd
    
    
    from txtai.tokenizer import Tokenizer
    from txtai.vectors import WordVectors
    from txtai.embeddings import Embeddings
    
    # generate random data
    random_array = ["a" + pd.util.testing.rands_array(3, 2) for i in range(100)]
    df = pd.DataFrame(random_array, columns=list("AB"))
    
    
    with tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False) as output:
        tokens = output.name
    
        for row in range(df.shape[0]):
            output.write(" ".join(Tokenizer.tokenize(df.iloc[row, 0])) + "\n")
    
    WordVectors.build(tokens, 10, 1, "search-data")
    
    os.remove(tokens)
    
    def stream():
        for row in range(df.shape[0]):
            uid = row
            tokens = Tokenizer.tokenize(df.iloc[row, 0])
            document = (uid, tokens, None)
            yield document
    
    embeddings = Embeddings({"path": "search-data.magnitude"})
    
    # Code hangs here
    embeddings.index(stream())
    
    opened by hardikudeshi 7
  • Allow save embeddings (the tmpfile.npy); or option to return embeddings with query

    Allow save embeddings (the tmpfile.npy); or option to return embeddings with query

    I'd like to work with the actual numpy embeddings (the ones buffered here). Context on why later. I thought it was storevectors, but that looks to save the transformers model; and the embeddings file is the ANN index. I see two solutions: add a vector field to the SQL search query (eg, select id, text, vector from txtai) which will run faiss.reconstruct() here. I don't know how reconstruct works, but I see it used for that purpose in haystack. My concern is that this solution is lossy (does reconstruct try to regenerate the original vector?). Also you'd have to have an equivalent for the other ANNs. So I think a better approach is:

    Add another config like save_actual_embeddings, which will copy over the tmpfile.npy. BUT, it would be hard to map the original IDs in - unless you did something dirty like the first column of the numpy array is the id (text, where the other columns are floats). I think a better approach is to save it as Pandas (or Feather / Arrow / whatever), with an ID column and vector column. I think this could be a simple add. But you could take it even further and scrap sqlite, just use the Pandas dataframe (wherein you can do sql-like queries, fields, etc) - two birds.


    The context (feel free to skip). I'm building search. When they search, they get their search results like normal. But also, in the side bar there are a bunch of resources related to the results of their search. So it would take the vectors of all the returned results, np.mean() them, then send them to the various resource searches. Specifically it's a journal app, and a user will search through past entries. The result of that search will be say 10 entries. np.mean(10_entry_vectors) now goes to book-search, therapist-search, etc. So I don't want to search books/therapists from their query, but from their entries which result from their query. I also want to cluster their resultant entries (I still need to explore txtai's graph feature), for which I'd need the vectors). And finally (I'll submit a separate ticket, I'd like to search directly by vector, not text (using said mean).

    opened by lefnire 6
  • What are the API endpoints to use semantic graph ?

    What are the API endpoints to use semantic graph ?

    Glad to know that txtAI brought Semantic graph as it's new feature. By the way how to actually use it if we have other language programmes and we expect it as an API ?.

    Are there any API endpoints out like search and extract? Anybody using it ? Please let me know. It will be better(atleast for me) if we get the details on how to write the Python script into a yaml config file to use graphs, categories and topic modeling.

    Following is my config file content to start the server using uvicorn..

    # Index file path
    path: ./tmp/index
    
    # Allow indexing of documents
    writable: True
    
    # Enbeddings index
    embeddings:
      path: sentence-transformers/all-MiniLM-L6-v2
      content: true
    # I manually added these below lines upto extractor part
      functions:
      - name: graph
        function: graph.attribute
      expressions:
      - name: category
        expression: graph(indexid, 'category')
      - name: topic
        expression: graph(indexid, 'topic')
      - name: topicrank
        expression: graph(indexid, 'topicrank')
      graph:
        limit: 15
        minscore: 0.1
        topics:
          categories:
          - Society & Culture
          - Science & Mathematics
          - Health
          - Education & Reference
          - Computers & Internet
          - Sports
          - Business & Finance
          - Entertainment & Music
          - Family & Relationships
          - Politics & Government
    
    extractor:
      path: distilbert-base-cased-distilled-squad
    
    textractor:
        paragraphs: true
        minlength: 100
        join: false
    

    Output:

    ...
    ModuleNotFoundError: No module named 'graph' 
    ERROR: Application startup failed. Exiting
    
    opened by akset2X 5
  • SQLite error on upsert and PIL exception

    SQLite error on upsert and PIL exception

    After indexing/upserting around 200k images, I'm now seeing this exception about an SQLite constraint error. Any suggestions to recover?

        embeddings.upsert(images(x, y))
      File "/home/loop/apps/anaconda3/envs/pytorch/lib/python3.10/site-packages/txtai/embeddings/base.py", line 158, in upsert
        ids, _, embeddings = transform(documents, buffer)
      File "/home/loop/apps/anaconda3/envs/pytorch/lib/python3.10/site-packages/txtai/embeddings/transform.py", line 58, in __call__
        ids, dimensions, batches, stream = self.model.index(self.stream(documents), self.batch)
      File "/home/loop/apps/anaconda3/envs/pytorch/lib/python3.10/site-packages/txtai/vectors/base.py", line 79, in index
        for document in documents:
      File "/home/loop/apps/anaconda3/envs/pytorch/lib/python3.10/site-packages/txtai/embeddings/transform.py", line 112, in stream
        self.load(batch, offset)
      File "/home/loop/apps/anaconda3/envs/pytorch/lib/python3.10/site-packages/txtai/embeddings/transform.py", line 142, in load
        self.database.insert(batch, self.offset)
      File "/home/loop/apps/anaconda3/envs/pytorch/lib/python3.10/site-packages/txtai/database/sqlite.py", line 151, in insert
        self.insertsection(index, uid, document, tags, entry)
      File "/home/loop/apps/anaconda3/envs/pytorch/lib/python3.10/site-packages/txtai/database/sqlite.py", line 433, in insertsection
        self.cursor.execute(SQLite.INSERT_SECTION, [index, uid, text, tags, entry])
    sqlite3.IntegrityError: UNIQUE constraint failed: sections.indexid
    
    Exception ignored in: <generator object images at 0x7fe384949770>
    RuntimeError: generator ignored GeneratorExit
    

    Current config (embeddings.info() output):

    {
      "backend": "faiss",
      "batch": 2,
      "build": {
        "create": "2022-11-24T23:38:28Z",
        "python": "3.10.6",
        "settings": {
          "components": "IDMap,Flat"
        },
        "system": "Linux (x86_64)",
        "txtai": "5.1.0"
      },
      "content": true,
      "dimensions": 512,
      "encodebatch": 2,
      "method": "sentence-transformers",
      "offset": 201487,
      "path": "clip-ViT-B-32",
      "update": "2022-11-25T03:38:44Z"
    }
    
    opened by LoopControl 5
  • Add translation pipeline parameter to return selected models

    Add translation pipeline parameter to return selected models

    The translation pipeline seamlessly loads and uses a series of models to run the translations.

    It would be beneficial to have a parameter to also return the associated models and detected languages to help with explainability and debugging.

    opened by davidmezzetti 0
  • GitHub Actions build error with torch 1.13 on macOS

    GitHub Actions build error with torch 1.13 on macOS

    This issue has re-emerged with torch 1.13. Same issue was reported in #300 with Python 3.7 and torch 1.12.

    It appears Transformers is also having issues - https://github.com/huggingface/transformers/pull/19989

    Pin build to torch==1.12.1 for now.

    bug 
    opened by davidmezzetti 1
  • Add Unsupervised Learning/Domain Adaptation Pipelines

    Add Unsupervised Learning/Domain Adaptation Pipelines

    I was quite happy to see the new Cross Encoder support in the Similarity Pipeline given how well it has ranked in various tests - e.g. from this paper from Nils Reimer of SBERT and the SBERT documentation's recommendations for BM25 Retrieval + CE Re-Ranking. However, I'm finding it to be rather slow, even with a small number of documents from the initial BM25 step.

    Nils subsequently published this paper about a new technique, Generative Pseudo Labelling (GPL), for unsupervised label generation for any text corpus which ultimately generates a fine-tuned embedding model.

    It is nearly as relevant as BM25 + Cross Encoder Re-Ranking, particularly when combined with TSDAE (for "denoising"/precleaning text) image

    but it would be much faster at query-time than CE Re-Ranking (table from the first paper linked above - presumably would be similar to other embedding models, eg 200ms): image

    This would surely be expensive to pre-generate, but so are most dense vector embeddings and that's generally acceptable because it is done one time. The important part is the cross encoder-level relevance with normal bi-encoder speed.

    I'm not sure what the best approach would be to incorporate into txtai, but it seems like TSDAE could be an optional pre-processing pipeline and then GPL could be an augmentation to the Labels pipeline (which seems to require you to pre-suggest labels rather than have them generated automatically and the HFTrainer pipeline.

    And, perhaps, the GPL pipeline could be made up of individual pipeline components for the 3 steps:

    1. Query generation
    2. Negative Mining
    3. Pseudo Labelling

    (I'm not as sure about step 2 and 3, but the 1st step for Query Generation would certainly be a useful standalone Pipeline)

    opened by nickchomey 4
Releases(v5.2.0)
  • v5.2.0(Dec 20, 2022)

    This release adds TextToSpeech and Cross-Encoder pipelines. The performance of the embeddings.batchtransform method was significantly improved, enabling a speed up in building semantic graphs. Default configuration is now available for Embeddings, allowing an Embeddings instance to be created with no arguments like Pipelines.

    See below for full details on the new features, improvements and bug fixes.

    New Features

    • Add Cross-Encoder support to Similarity pipeline (#372)
    • Create compression package (#376)
    • Add TextToSpeech pipeline (#389)
    • Add TextToSpeech Notebook (#391)
    • Add default configuration for Embeddings (#393)

    Improvements

    • Filter HF API list models request (#381)
    • Split pipeline extras by function area (#387)
    • Update data package to handle label arrays (#388)
    • Modify transcription pipeline to accept raw waveform data (#390)
    • Transcription pipeline improvements (#392)
    • Allow searching by embedding (#396)
    • Modified logger configuration in init.py (libraries shouldn't modify root logger) - Thank you @adin786! (#397)
    • Pass evaluation metrics to underlying Trainer (#398)
    • Improve batchtransform performance (#399)

    Bug Fixes

    • Example 31 - Duplicate image detection not working (#357)
    • All sorts of issues with Example 18 - Export and run models with ONNX (#369)
    • Fix issue with select distinct bug (#379)
    • Update build script and tests to address issues with latest version of FastAPI (#380)
    • Fix issue with similar and bracket SQL expressions embedded in functions (#382)
    • Fix bug with embeddings functions and application config bug (#400)
    Source code(tar.gz)
    Source code(zip)
  • v5.1.0(Oct 18, 2022)

    This release adds new model support for the translation pipeline, OpenAI Whisper support in the transcription pipeline and ARM Docker images. Topic modeling was also updated with improvements, including how to use BM25/TF-IDF indexes to drive topic models.

    See below for full details on the new features, improvements and bug fixes.

    New Features

    • Multiarch docker image (#324)
    • Add notebook covering classic topic modeling with BM25 (#360)

    Improvements

    • Read authentication parameters from storage task (#332)
    • Update scoring algorithms (#351)
    • Add config option for list of stopwords to ignore with topic generation (#352)
    • Allow for setting custom translation model path (#355)
    • Update caption pipeline to call image-to-text pipeline (#361)
    • Update transcription pipeline to call automatic-speech-recognition pipeline (#362)
    • Only pass tokenizer to pipeline when necessary (#363)
    • Improve default max length logic for text generation (#364)
    • Update transcription notebook (#365)
    • Update translation notebook (#366)
    • Move mkdocs dependencies from docs.yml to setup.py (#368)

    Bug Fixes

    • GitHub Actions build error with torch 1.12 on macOS (#300)
    • SQLite JSON support not built into Python Windows builds < 3.9 (#356)
    • Use tags field in application.add (#359)
    • Fix issue with Application autosequencing (#367)
    Source code(tar.gz)
    Source code(zip)
  • v5.0.0(Sep 27, 2022)

    πŸŽˆπŸŽ‰πŸ₯³ We're excited to announce the release of txtai 5.0! πŸ₯³πŸŽ‰πŸŽˆ

    Thank you to the txtai community! Please remember to ⭐ txtai!

    txtai 5.0 is a major new release. This release adds the semantic graph along with enabling external integrations. It also adds a number of improvements and bug fixes.

    New Features

    • Add scoring-based search (#327)
    • Add notebook demonstrating functionality of individual embeddings components (#328)
    • Add SQL expression columns (#338)
    • Add semantic graph component (#339)
    • Add notebook covering Semantic Graphs (#341)
    • Add graph documentation (#343)
    • Allow custom ann, database and graph instances (#344)

    Improvements

    • Clarify embeddings.save documentation (#325)
    • Modify embeddings search candidate default logic (#326)
    • Update console to conditionally import library (#333)
    • Update ANN package to make terminology more consistent (#334)
    • Support non-text document elements in Applications (#335)
    • Update workflow documentation to note generator execution (#336)
    • Update audio transcription notebook to include example with OpenAI Whisper (#345)

    Bug Fixes

    • Calling scoring.index with no tokens parsed results in error (#337)
    • Fix cached_path error with transformers v4.22 (#340)
    • Fix docker command "--it". Thank you to @lipusz! (#346)
    • Error loading compressed indexes in console bug (#347)
    Source code(tar.gz)
    Source code(zip)
  • v4.6.0(Aug 15, 2022)

    πŸŽˆπŸŽ‰πŸ₯³ txtai turns 2 πŸŽˆπŸŽ‰πŸ₯³

    We're excited to release the 25th version of txtai marking it's 2 year anniversary. Thank you to the txtai community. Please remember to ⭐ txtai!

    txtai 4.6 is a large but backwards compatible release! This release adds better integration between embeddings and workflows. It also adds a number of significant performance improvements and bug fixes.

    New Features

    • Add transform workflow action to application (#281)
    • Add ability to resolve workflows within applications (#290)
    • OFFSET in sql query statement (#293)
    • Add webpage summary image generation notebook (#299)
    • Add notebook on running txtai with native code (#304)
    • Add mmap parameter to Faiss (#308)
    • Add indexing guide to docs (#312)

    Improvements

    • Consume generator outputs in workflow tasks (#291)
    • Update pipeline workflow notebook (#292)
    • Update tabular notebook (#297)
    • Lower required version of Pillow library to prevent unnecessary upgrades (#303)
    • Embeddings vector batch improvements (#309)
    • Use single constant for current pickle protocol (#310)
    • Move quantize config param to Faiss (#311)
    • Update documentation with new demo and diagrams (#313)
    • Improve embeddings performance with large query limits (#318)

    Bug Fixes

    • ModuleNotFoundError: No module named 'transformers.hf_api' (#274)
    • Dependency issue with ONNX and Protobuf (#285)
    • The key should be writable instead of path. Thank you to @csnelsonchu! (#287)
    • Fix breaking change in build script from mkdocstrings bug (#289)
    • Index id sync issue when inserting multiple data types (text, documents, objects) into Embeddings (#294)
    • Labels pipeline outputs changed with transformers 4.20.0 (#295)
    • Tabular pipeline throws error when processing list fields (#296)
    • txtai load testing (#305)
    • Add cloud config to application.upsert method (#306)
    Source code(tar.gz)
    Source code(zip)
  • v4.5.0(May 17, 2022)

    This release adds the following new features, improvements and bug fixes.

    New Features

    • Add scripts to train bashsql query translation model (#271)
    • Add QA database example notebook (#272)
    • Add CITATION file (#273)

    Improvements

    • Improve efficiency of external vectors (#275)
    • Refactor vectors package to improve code reuse (#276)
    • Add logic to detect external vectors method (#277)

    Bug Fixes

    • Fix summary pipeline issue with transformers>=4.19.0 (#278)
    Source code(tar.gz)
    Source code(zip)
  • v4.4.0(Apr 20, 2022)

    This release adds the following new features, improvements and bug fixes.

    New Features

    • Add semantic search explainability (#248)
    • Add notebook covering model explainability (#249)
    • Add txtai console (#252)
    • Add sequences pipeline (#261)
    • Add scripts to train query translation models (#265)
    • Add query translation logic in embeddings searches (#266)
    • Add notebook for query translation (#269)

    Improvements

    • Update HFTrainer to support sequence-sequence models (#262)

    Bug Fixes

    • Unit tests failing with tokenizers>= 0.12 (#253)
    • Running default.config.yml returns TypeError: register() got an unexpected keyword argument 'ids' (#256)
    • Unit tests failing with transformers==4.18.0 (#258)
    • Update precommit to use latest version of psf black (#259)
    Source code(tar.gz)
    Source code(zip)
  • v4.3.1(Mar 11, 2022)

    This release adds the following new features, improvements and bug fixes.

    Bug Fixes

    • Fix word embeddings regression with batch transformation (#245)
    Source code(tar.gz)
    Source code(zip)
  • v4.3.0(Mar 10, 2022)

    This release adds the following new features, improvements and bug fixes.

    New Features

    • Add notebook covering txtai embeddings index file structure (#237)
    • Add Image Hash pipeline (#240)
    • Add support for custom SQL functions in embeddings queries (#241)
    • Add notebook for Embeddings SQL functions (#243)
    • Add notebook for near-duplicate image detection (#244)

    Improvements

    • Rename SQLException to SQLError (#232)
    • Refactor API instance into a separate package (#233)
    • API should raise an error if attempting to modify a read-only index (#235)
    • Add last update field to index metadata (#236)
    • Update transcription pipeline to use AutoModelForCTC (#238)

    Bug Fixes

    • Ensure limit always set in embeddings search/batchsearch (#234)
    • Fix issue with parsing multiline SQL statements bug (#242)
    Source code(tar.gz)
    Source code(zip)
  • v4.2.1(Feb 28, 2022)

  • v4.2.0(Feb 24, 2022)

    This release adds the following new features, improvements and bug fixes.

    New Features

    • Add notebook for workflow notifications (#225)
    • Add default and custom docker configurations (#226)
    • Create docker configuration for AWS Lambda (#228)
    • Add support for loading/storing embedding indexes on cloud storage (#229)

    Improvements

    • Add support for SQL || operator (#223)
    • Add flag to disable loading index data in API (#230)

    Bug Fixes

    • Modify database decoder methods to check for None (#220)
    • Modify embeddings search to make return type consistent when index initialized and not initialized (#221)
    • Embeddings index returning malformed JSON errors in certain situations (#222)
    • Check for empty documents input before indexing (#224)
    Source code(tar.gz)
    Source code(zip)
  • v4.1.0(Feb 3, 2022)

    This release adds the following new features, improvements and bug fixes.

    New Features

    • Add entity extraction pipeline (#203)
    • Add workflow scheduling (#206)
    • Add workflow search task to API (#210)
    • Add Console Task (#215)
    • Add Export Task (#216)
    • Add notebook for workflow scheduling (#218)

    Improvements

    • Default documentation theme using system preference (#197)
    • Improve multi-user experience for workflow application (#198)
    • Documentation improvements (#200)
    • Add social preview image for documentation (#201)
    • Add links to txtai in all example notebooks (#202)
    • Add limit parameter to API search method (#208)
    • Add documentation on local API instances (#209)
    • Add shorthand syntax for creating workflow tasks in API (#211)
    • Accept functions as workflow task actions in API (#213)

    Bug Fixes

    • Object detection model fails to load additional models (#204)
    • Update unit tests to limit cpu usage for word vector tests (#207)
    • Add better error handling around unindexed embedding instances (#212)
    • Fix issue when workflow task generates no output (#214)
    • Add lock to API search methods (#217)
    Source code(tar.gz)
    Source code(zip)
  • v4.0.0(Jan 11, 2022)

    πŸŽˆπŸŽ‰πŸ₯³ We're excited to announce the release of txtai 4.0! πŸ₯³πŸŽ‰πŸŽˆ

    Thank you to the growing txtai community. This couldn't be done without you. Please remember to ⭐ txtai if it has been helpful.

    txtai 4.0 is a major release with a significant number of new features. This release adds content storage, querying with sql, object storage, reindexing, index compression, external vectors and more!

    To quantify the changes, the code base increased by 50% with 36 resolved issues, by far the biggest release of txtai. These changes were designed to be fully backward compatible but keep in mind it is a new major release.

    What's new in txtai 4.0 covers all the changes with detailed examples. The documentation site has also been refreshed.

    New Features

    • Store text content (#168)
    • Add option to index dictionaries of content (#169)
    • Add SQL support for generating combined embeddings + database queries (#170)
    • Add reindex method to embeddings (#171)
    • Add index archive support (#172)
    • Add close method to embeddings (#173)
    • Update API to work with embeddings + database search (#176)
    • Add content option to tabular pipeline (#177)
    • Update workflow example to support embeddings content (#179)
    • Add index metadata to embeddings config (#180)
    • Add object storage (#183)
    • Aggregate partial query results when clustering (#184)
    • Add function parameter to embeddings reindex (#185)
    • Add support for user defined column aliases (#186)
    • Use SQL bracket notation to support multi word and more complex JSON path expressions (#187)
    • Support SQLite 3.22+ (#190)
    • Add pre-computed vector support (#192)
    • Change document/object inserts to only keep latest record (#193)
    • Update documentation with 4.0 changes (#196)

    Improvements

    • Modify workflow to select batches with slices (#158)
    • Add tensor support to workflows (#159)
    • Read YAML config if provided as a file path (#162)
    • Make adding pipelines to API easier (#163)
    • Process task actions concurrently (#164)
    • Add tensor workflow notebook (#167)
    • Update default ANN parameters (#174)
    • Require Python 3.7+ (#175)
    • Consistently name embeddings id fields (#178)
    • Add txtai version attribute (#181)
    • Refresh notebooks for 4.0 (#188)
    • Modify embeddings to only iterate over input documents once (#189)
    • Improve efficiency of vector transformations (#191)

    Bug Fixes

    • Add thread lock around API write calls (#160)
    • Expose caption and objects pipeline via API (#161)
    • Change pickle calls to use protocol supporting lowest Python version (#182)
    • HFOnnx expects ORT provider bug (#195)
    Source code(tar.gz)
    Source code(zip)
  • v3.7.0(Nov 23, 2021)

    This release adds the following new features, improvements and bug fixes.

    New Features

    • Add object detection pipeline (#148)
    • Add image caption pipeline (#149)
    • Add retrieval task (#150)
    • Add no-op pipeline (#152)
    • Add new workflow functionality (#155)

    Improvements

    • Add korean translation to README.md. Thank you @0206pdh! (#138)
    • Add links to external articles (#139)
    • Update example applications to be consistent (#140)
    • Add an article summarization example (#144)
    • Add fallback mode for textractor (#145)
    • Reorganize pipeline package (#147)
    • Update optional package tests to simulate missing packages (#154)
    • Add parameter to flatten labels output (#153)
    • Update documentation with latest changes (#156)

    Bug Fixes

    • Fix bug with importing service task when workflow extra not installed (#146)
    • Fix inconsistencies with url based tasks (#151)
    Source code(tar.gz)
    Source code(zip)
  • v3.6.0(Nov 8, 2021)

    This release adds the following new features, improvements and bug fixes.

    New Features

    • Add post workflow action to API (#129)
    • Add tabular pipeline (#134)
    • Enhance ServiceTask to support additional use cases (#135)
    • Add notebook for tabular pipeline (#136)
    • Add topn option to extractor pipeline (#137)

    Improvements

    • Refactor registering new auto models to use methods in Transformers library (#128)
    • Update workflow example application (#130)

    Bug Fixes

    • No issues this release
    Source code(tar.gz)
    Source code(zip)
  • v3.5.0(Oct 18, 2021)

    This release adds the following new features, improvements and bug fixes.

    New Features

    • Add scikit-learn to ONNX export pipeline (#124)
    • Add registry methods for auto models (#126)
    • Add notebook to demonstrate loading scikit-learn and PyTorch models (#127)

    Improvements

    • Add parameter to return raw model outputs for labels pipeline (#123)
    • Add parameter to use standard pooling for TransformersVectors (#125)

    Bug Fixes

    • Pass model configuration to ONNX Models (#121)
    • Fix incorrect import in Notebooks (#122)
    Source code(tar.gz)
    Source code(zip)
    tests.tar.gz(1.57 MB)
  • v3.4.0(Oct 7, 2021)

    This release adds the following new features, improvements and bug fixes.

    New Features

    • Create notebook using extractive qa to build structured data (#117)
    • Modify extractor pipeline to support similarity pipeline backed context (#119)

    Improvements

    • Improve performance of extractor context queries (#120)

    Bug Fixes

    • Update labels pipeline to filter text classification output (#116)
    • Fix issues with Transformers 4.11.2 (#118)
    Source code(tar.gz)
    Source code(zip)
  • v3.3.0(Sep 10, 2021)

    This release adds the following new features, improvements and bug fixes.

    New Features

    • Add ONNX export pipeline (#107)
    • Add notebook for ONNX pipeline (#108)
    • Add ONNX support for Embeddings and Pipelines (#109)
    • Support QA models in Trainer pipeline (#111)
    • Add notebook for training QA models (#115 )

    Improvements

    • Remove deprecated packages (#114)

    Bug Fixes

    • Fix issues with latest Transformers version (#110)
    Source code(tar.gz)
    Source code(zip)
  • v3.2.0(Aug 17, 2021)

    This release adds the following new features, improvements and bug fixes.

    New Features

    • Enhance Labels pipeline to support standard text classification models (#95)
    • Add Trainer pipeline (#96)
    • Modularize txtai install (#97)
    • Evaluate if faiss-cpu can be used as default across all platforms (#98)
    • Add vector method for sentence-transformers (#101)

    Improvements

    • Add book search example application (#91)
    • Add wiki search example application (#92)
    • Change tokenization to default to false for TransformerVectors (#99)
    • Infer vector method using path (#100)
    • Improve performance when running models through transformers (#102)
    • Update notebooks and example applications (#103)

    Bug Fixes

    • Clear workflow batch during processing bug (#90)
    Source code(tar.gz)
    Source code(zip)
  • v3.1.0(May 22, 2021)

    This release adds the following new features:

    • Add support for update/delete embeddings index operations (#86)
    • Add Embeddings Cluster component (#87)
    • Switch default backend on Windows to Hnswlib (#88)
    • Add notebook covering distributed embedding clusters (#89)
    Source code(tar.gz)
    Source code(zip)
  • v3.0.0(May 4, 2021)

    txtai 3.0.0 is a major release with a significant number of new features. This release overhauls the project structure, consolidates logic into pipelines and introduces workflows.

    Summary of txtai features:

    • πŸ”Ž Large-scale similarity search with multiple index backends (Faiss, Annoy, Hnswlib)
    • πŸ“„ Create embeddings for text snippets, documents, audio and images. Supports transformers and word vectors.
    • πŸ’‘ Machine-learning pipelines to run extractive question-answering, zero-shot labeling, transcription, translation, summarization and text extraction
    • β†ͺ️️ Workflows that join pipelines together to aggregate business logic. txtai processes can be microservices or full-fledged indexing workflows.
    • πŸ”— API bindings for JavaScript, Java, Rust and Go
    • ☁️ Cloud-native architecture that scales out with container orchestration systems (e.g. Kubernetes)

    New Features

    • Add Docker file for API (#59)
    • Require Faiss 1.7.0 (#60)
    • Add summary pipeline (#65)
    • Add text extraction pipeline (#66)
    • Add transcription pipeline (#67)
    • Add translation pipeline (#68)
    • Add workflow framework (#69)
    • Add additional pipeline abstraction layer for tensor frameworks (#70)
    • Add tests for new v3 functionality (#71)
    • Add notebooks covering new v3 functionality (#73)
    • Add Pipeline Factory (#76)
    • Add API extensions (#77)
    • Add workflow builder application (#80)
    • Add text segmentation pipeline (#81)
    • Add workflow to API (#82)
    • Add service workflow task (#83)
    • Add object storage workflow task (#84)
    • Add URL workflow task (#85)

    Improvements

    • Refactor code into smaller components and modules (#63)
    • Modify pipeline to accept GPU device id (#64)
    • Allow direct download of sentence-transformer models (#72)
    • Update documentation, add site through GitHub pages (#75)
    • Modularize the API (#78)
    • Add default truncation to pipelines (#79)

    Bug Fixes

    • Non intuitive behaviour of Tokenizer (#61)
    • [Python 3.9, Mac OS] Code hangs while building embedding index (#62)
    • embeddings.index Truncation RuntimeError: The size of tensor a (889) must match the size of tensor b (512) at non-singleton dimension 1 (#74)
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Jan 13, 2021)

    txtai 2.0.0 is a major release with a significant number of new features. This release brings a new zero-shot similarity pipeline, a more streamlined and consistent API, batch support for all modules and integration with Hugging Face Datasets.

    In addition to Python, txtai has API support for JavaScript, Java, Rust and Go.

    New Features

    • [BREAKING CHANGES] Make API definitions consistent (#54)
    • Zero-shot similarity pipeline (#21, #49)
    • Add batch support for all modules (#18, #53)
    • Add example notebook integrating Hugging Face Datasets (#26)
    • Add example notebook that adds semantic search to existing system (#57)

    Improvements

    • Add API tests, increase test coverage (#42)
    • Refactor pipeline component (#44)
    • Upgrade to Transformers 4.x (#45)
    • Review, organize and update example notebooks (#52)
    • Allow setting ANN index parameters (#55)
    • Modify API add method to stream data (#56)

    Bug Fixes

    • Fix language support issues (#39, #43)
    Source code(tar.gz)
    Source code(zip)
    tests.tar.gz(1.57 MB)
  • v1.5.0(Nov 21, 2020)

    This release adds the following enhancements and bug fixes:

    • Refresh example notebooks and add notebook on labeling (#40)
    • Enhance API to fully support all txtai functionality (#41)
    Source code(tar.gz)
    Source code(zip)
  • v1.4.0(Nov 3, 2020)

    This release adds the following enhancements and bug fixes:

    • Split extractor embedding query and QA calls (#35)
    • Upgrade to Faiss 1.6.4 (#36)
    • Migrate build to GitHub Actions (#38)
    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Oct 11, 2020)

    This release adds the following enhancements and bug fixes:

    • Added FastAPI interface (#12)
    • Fix tokenization error in notebook (#28)
    • Added text labeling interface using zero shot classifier (#30)
    • Update macOS version in Travis CI script
    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Sep 11, 2020)

  • v1.2.0(Sep 10, 2020)

    This release adds the following enhancements and bug fixes:

    • Add unit tests and integrate Travis CI (#7)
    • Add documentation for Embeddings settings to README (#11)
    • Compatibility issues with transformers 3.1 and sentence-transformers (#20)
    • Add batch indexing for transformer indices (#23)
    • Add option to store word vectors with embeddings model (#24)
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Aug 18, 2020)

    This release adds the following enhancements and bug fixes:

    • Fully support Windows and macOS installs (#1, #2, #8, #9)
    • Add support for additional index backends, Annoy (#4) and hnswlib (#5)
    • Support string ids (#6)
    • Enable flag to enable/disable Faiss SQ8 quantization (#10)
    Source code(tar.gz)
    Source code(zip)
    tests.gz(2.34 MB)
  • v1.0.0(Aug 11, 2020)

Owner
NeuML
Applying machine learning to solve everyday problems
NeuML
rclip - AI-Powered Command-Line Photo Search Tool

rclip is a command-line photo search tool based on the awesome OpenAI's CLIP neural network.

Yurij Mikhalevich 394 Dec 12, 2022
A fast, efficiency python package for searching and getting search results with many different search engines

search A fast, efficiency python package for searching and getting search results with many different search engines. Installation To install the pack

Neurs 0 Oct 6, 2022
Deep Image Search - AI-Based Image Search Engine

Deep Image Search is an AI-based image search engine that includes deep transfer learning features Extraction and tree-based vectorized search technique.

null 144 Jan 5, 2023
Search emails from a domain through search engines

EmailFinder - search emails through Search Engines

JosuΓ© Encinar 155 Dec 30, 2022
Image search service based on imgsmlr extension of PostgreSQL. Support image search by image.

imgsmlr-server Image search service based on imgsmlr extension of PostgreSQL. Support image search by image. This is a sample application of imgsmlr.

jie 45 Dec 12, 2022
GitScanner is a script to make it easy to search for Exposed Git through an advanced Google search.

GitScanner Legal disclaimer Usage of GitScanner for attacking targets without prior mutual consent is illegal. It is the end user's responsibility to

Kaio Gomes 3 Oct 28, 2022
Reverse-ikea-image-search - A simple image of ikea search using jina.ai

IKEA Reverse Image Search This is a demo project to fetch ikea product images(IK

SOUVIK GHOSH 4 Mar 8, 2022
Google Project: Search and auto-complete sentences within given input text files, manipulating data with complex data-structures.

Auto-Complete Google Project In this project there is an implementation for one feature of Google's search engines - AutoComplete. Autocomplete, or wo

Hadassah Engel 10 Jun 20, 2022
document organizer with tags and full-text-search, in a simple and clean sqlite3 schema

document organizer with tags and full-text-search, in a simple and clean sqlite3 schema

Manos Pitsidianakis 152 Oct 29, 2022
This project is a sample demo of Arxiv search related to AI/ML Papers built using Streamlit, sentence-transformers and Faiss.

This project is a sample demo of Arxiv search related to AI/ML Papers built using Streamlit, sentence-transformers and Faiss.

Karn Deb 49 Oct 30, 2022
Full-text multi-table search application for Django. Easy to install and use, with good performance.

django-watson django-watson is a fast multi-model full-text search plugin for Django. It is easy to install and use, and provides high quality search

Dave Hall 1.1k Jan 3, 2023
πŸ” Messages Searcher is make for search custom message in all channels in guild and dm.

?? Messages Searcher is make for search custom message in all channels in guild and dm.

Kaneki 33 Dec 31, 2022
Inverted index creation and query search mechanism on Wikipedia pages.

WikiPedia Search Engine Step 1 : Installing Requirements Install "stemming" module for python using pip. Step 2 : Parsing the Data To parse the data,

Piyush Atri 1 Nov 27, 2021
ForFinder is a search tool for folder and files

ForFinder is a search tool for folder and files. You can use that when you Source Code Analysis at your project's local files or other projects that you are download. Enter a root path and keyword to ForFinder.

Γ‡ağrΔ± Aliş 7 Oct 25, 2022
Modular search for Django

Haystack Author: Daniel Lindsley Date: 2013/07/28 Haystack provides modular search for Django. It features a unified, familiar API that allows you to

Haystack Search 3.4k Jan 4, 2023
Full text search for flask.

flask-msearch Installation To install flask-msearch: pip install flask-msearch # when MSEARCH_BACKEND = "whoosh" pip install whoosh blinker # when MSE

honmaple 197 Dec 29, 2022
Senginta is All in one Search Engine Scrapper for used by API or Python Module. It's Free!

Senginta is All in one Search Engine Scrapper. With traditional scrapping, Senginta can be powerful to get result from any Search Engine, and convert to Json. Now support only for Google Product Search Engine (GShop, GVideo and many too) and Baidu Search Engine.

null 33 Nov 21, 2022
Google Search Engine Results Pages (SERP) in locally, no API key, no signup required

Local SERP Google Search Engine Results Pages (SERP) in locally, no API key, no signup required Make sure the chromedriver and required package are in

theblackcat102 4 Jun 29, 2021
A web search server for ParlAI, including Blenderbot2.

Description A web search server for ParlAI, including Blenderbot2. Querying the server: The server reacting correctly: Uses html2text to strip the mar

Jules Gagnon-Marchand 119 Jan 6, 2023